Hyper-optimized tensor network contraction - Quantum Journal

Page created by Brad Kramer
 
CONTINUE READING
Hyper-optimized tensor network contraction
                                            Johnnie Gray1,2 and Stefanos Kourtis1,3,4
                                            1
                                                Blackett Laboratory, Imperial College London, London SW7 2AZ, United Kingdom
                                            2
                                                Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125, USA
                                            3
                                                Department of Physics, Boston University, Boston, MA, 02215, USA
                                            4
                                                Institut quantique & Département de physique, Université de Sherbrooke, Québec J1K 2R1, Canada
                                                March 12, 2021

                                              Tensor networks represent the state-of-                         1 Introduction
                                            the-art in computational methods across
                                                                                                              Since the advent of the density-matrix renormal-
arXiv:2002.01935v4 [quant-ph] 11 Mar 2021

                                            many disciplines, including the classical
                                            simulation of quantum many-body systems                           ization group algorithm, invented to study one-
                                            and quantum circuits. Several applications                        dimensional lattice systems of quantum degrees
                                            of current interest give rise to tensor net-                      of freedom, tensor networks have permeated a
                                            works with irregular geometries. Finding                          plethora of scientific disciplines, finding use in
                                            the best possible contraction path for such                       fields such as quantum condensed matter [1–4],
                                            networks is a central problem, with an ex-                        classical statistical mechanics [5–7], information
                                            ponential effect on computation time and                          science and big-data processing [8, 9], systems
                                            memory footprint. In this work, we imple-                         engineering [10], quantum computation [11], ma-
                                            ment new randomized protocols that find                           chine learning and artificial reasoning [12–14] and
                                            very high quality contraction paths for ar-                       more. The underlying idea of tensor network
                                            bitrary and large tensor networks. We test                        methods is to use sparse networks of intercon-
                                            our methods on a variety of benchmarks,                           nected low-rank tensors to represent data struc-
                                            including the random quantum circuit in-                          tures that would otherwise be expressed in (very)
                                            stances recently implemented on Google                            high-rank tensor form, which is hard to manip-
                                            quantum chips. We find that the paths ob-                         ulate. Due to this ubiquity, techniques to per-
                                            tained can be very close to optimal, and                          form (multi)linear algebraic operations on ten-
                                            often many orders or magnitude better                             sor networks accurately and efficiently are very
                                            than the most established approaches. As                          useful to a highly interdisciplinary community of
                                            different underlying geometries suit differ-                      researchers and engineers. Of these operations,
                                            ent methods, we also introduce a hyper-                           tensor network contraction, i.e., the evaluation of
                                            optimization approach, where both the                             a scalar quantity that has been expressed as a
                                            method applied and its algorithmic pa-                            tensor network, is the most common.
                                            rameters are tuned during the path find-                             When a system under consideration gives rise
                                            ing. The increase in quality of contraction                       to a tensor networks with a regular structure,
                                            schemes found has significant practical im-                       such as lattices, the renormalization group ap-
                                            plications for the simulation of quantum                          paratus is often employed to perform tensor
                                            many-body systems and particularly for                            network contractions with controllable accuracy.
                                            the benchmarking of new quantum chips.                            This approach has been successful in tackling
                                            Concretely, we estimate a speed-up of over                        a variety of classical and quantum many-body
                                            10,000× compared to the original expec-                           problems [5–7, 15–20]. Efficient tensor network
                                            tation for the classical simulation of the                        contraction is also possible in special cases in
                                            Sycamore ‘supremacy’ circuits.                                    which network topology (e.g., trees), values of
                                                                                                              tensor entries, or both are restricted [21–26]. De-
                                                                                                              spite these results, contracting tensor networks
                                                                                                              with arbitrary structure remains (at least) #P-
                                                                                                              hard in the general case [27, 28]. This is true, in
                                                                                                              particular, for tensor networks that model ran-

                                            Accepted in    Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0.                                      1
optimal drivers for forming sub-trees at different
                                                                scales. The second key idea is to hyper-optimize
                                                                the generation of these trees, and to do this with
                                                                respect to the entire tree and thus the total con-
                                                                traction cost, rather than just the leading scal-
                                                                ing, given by the line-graph tree-width for exam-
                                                                ple. We also establish a powerful set of simpli-
                                                                fications for efficiently pre-processing tensor net-
                                                                works prior to contraction.
                                                                   Using this framework we are able to find
                                                                very high-quality contraction paths, achieving
                                                                speedups that scale exponentially with the num-
                                                                ber of tensors in the network compared to es-
Figure 1: Sample tensor networks: (a) simplified network
                                                                tablished approaches, for a variety of problems.
for a rectangular 7x7 qubit 1 + 40 + 1 depth random
quantum circuit with 742 rank-3 tensors; (b) a random           The drivers we test include recently introduced
5-regular network with 100 tensors, arising in, e.g., SAT       contraction algorithms based on graph parti-
problems; and (c) random planar network with 184 ten-           tioning and community structure detection [43],
sors, arising in, e.g., the statistical-mechanical evaluation   previously theorized [11] and recently imple-
of knot invariants.                                             mented [44] algorithms based on the tree decom-
                                                                position of graphs, as well as new heuristics that
                                                                we introduce in this work. Furthermore, observ-
dom quantum circuits, a fact that has recently
                                                                ing that different graph structures favor different
inspired proposals for quantum algorithms run-
                                                                algorithms, we implement a hyper-optimization
ning on these circuits that aim towards a prac-
                                                                approach, where both the method applied and its
tically demonstrable quantum computational ad-
                                                                parameters are varied throughout the contraction
vantage over classical computers [11, 29–39]. The
                                                                path search, leading to automatically customized
key idea is that, unlike quantum algorithms (e.g.,
                                                                contraction algorithms that often achieve near-
Shor or Grover) that require deep quantum cir-
                                                                optimal performance.
cuits and high gate fidelities — inaccessible in
the near future — to become manifestly advanta-                    We demonstrate the new methodology intro-
geous, the task of sampling bit strings from the                duced here on a range of benchmarks. First, we
output of random quantum circuits is expected                   test on problems defined on random graph fam-
to be hard to simulate classically even for low-                ilies, such as simulation of solving MAX-CUT
depth circuits and low-fidelity gates. The precise              with quantum approximate optimization as well
threshold for observing such a quantum advan-                   as weighted model counting. We find substan-
tage is nonuniversal and ultimately depends on                  tial improvements in performance compared to
the efficiency of the classical simulation for each             previous methods reported in the literature. We
particular combination of circuit model and quan-               then simulate random quantum circuits recently
tum chip architecture. This motivates the de-                   implemented by Google on the Bristlecone and
velopment of high-performance simulation tech-                  Sycamore architectures. We estimate a speed-up
niques for these quantum systems, predominantly                 of over 10,000× in the classical simulation of the
based on finding good contraction paths for ten-                Sycamore ‘supremacy’ circuits compared to what
sor networks, that runs in parallel to the race for             is given in [45]. In general, our algorithms out-
the development of higher qubit count and qual-                 perform all others for the same task, by a wide
ity devices [40–42].                                            margin on general networks and by a narrower
   Inspired by the classical simulation of quantum              margin on planar structures. These findings thus
circuits, here we introduce a new framework for                 illustrate that our methods can lead to significant
exact contraction of large tensor networks with                 performance gains across a spectrum of tensor
arbitrary structure (see examples in Fig. 1). The               network applications. This is the main result of
first key idea of this framework is to explicitly               this paper.
construct the contraction tree for a given tensor                  The remainder of this paper is organized as
network, combining agglomerative, divisive, and                 follows. In Sec. 2 we formalize the problem of

Accepted in   Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0.                               2
finding the optimal contraction path for arbitrary              others all vertices in V are contracted into a sin-
tensor networks. In Sec. 3 we introduce and                     gle vertex. Here we will focus on the latter case,
explain the various algorithms employed in our                  as it underlies the former. We will assume that
heuristics. In Sec. 4 we test our methods on a va-              G initially has no loops, i.e., edges connecting
riety of benchmarks, including the random quan-                 vertices to themselves, and that multiple edges
tum circuit instances recently implemented on                   are always contracted simultaneously, so that no
Google Bristlecone and Sycamore quantum chips,                  loops occur throughout the contraction.
the simulation of the quantum adiabatic opti-                      To represent the sequence of vertex contrac-
mization algorithm for solving the MAX-CUT                      tions, we define a rooted binary tree B =
problem on random regular graphs, and exact                     (VB , EB ), with the first |V | vertex indices denot-
weighted model counting on problem instances                    ing leaves, using two tuples l and r such that l(v)
from a recent competition. We conclude in Sec. 5.               and r(v) are the indices of the ‘left’ and ‘right’
                                                                children of vertex v ∈ VB , respectively, if any.
                                                                This defines a tree embedding of G [46]. Finally,
2 Problem statement                                             we assign an incidence set sv to each v ∈ VB ,
                                                                starting with leaves, according to
We denote an edge-weighted graph by G = (V, E),                        (
where V is the vertex set and the set of 2-tuples                      {e : e ∈ E and v ∈ e} if v is a leaf index ,
                                                                sv =
of vertex indices E ⊂ {(u, v) : u, v ∈ V } is the                      sl(v) ⊕ sr(v) otherwise ,
edge set, along with a weight function w : E →                                                                    (1)
R+ that assigns a positive real number to each                  with si ⊕ sj = (si ∪ sj ) \ (si ∩ sj ). The composite
edge. For each vertex v, define the incidence set               of (B, S), where S = {sv : v ∈ VB }, defines a
sv = {e : e ∈ E and v ∈ e}, which is the set of                 contraction tree of G.
edges incident to vertex v, such that |sv | = dv ,                For a given tensor network contraction tree,
the degree of vertex v.                                         one can quantify the space and time cost of con-
   To define a tensor network, we augment G with                tracting the network. First, the total space re-
(i) a discrete variable xe for each edge e ∈ E,                 quired for the contraction of a network is given,
whose set of possible values is given by D(e) with              up to an O(|V |) prefactor, by 2W , for contraction
|D(e)| = w(e), (ii) an ordered tuple tv : Ndv → sv              width
for each vertex v ∈ V , and (iii) a multivari-                                  W = ecmax (B, S) ,                (2)
ate function or tensor Tv : D(tv (1)) × · · · ×                 where ecmax is the maximum edge congestion for
D(tv (dv )) → C, where tv (i) denotes the ith el-               this tree embedding of G [47]. In our notation,
ement of tuple tv , for every vertex v ∈ V . That                                                 X
w is defined to be a real-valued function even                             ecmax (B, S) = max            log2 w(e) .   (3)
                                                                                          v∈VB
though D(e) ∈ Z+ ∀ e ∈ E is simply a choice                                                       e∈sv

that allows for extra flexibility in the design of              A space-optimal contraction tree for G is then
contraction algorithms, see, e.g., the Boltzmann                defined by
greedy algorithm below.
                                                                           Bspace (G) = argmin ecmax (B, S) ,          (4)
   With these definitions, a tensor network con-                                        B∈B|V |
traction can be represented as a sequence of ver-
tex contractions in graph G. Each vertex con-                   where B|V | is the set of all rooted binary trees
traction removes common edges between pairs                     with |V | leaves. For systems of boolean variables
of tensors, if any, and represents a product op-                or qubits, w = 2 and ecmax (B, S) = maxv∈VB |sv |.
eration on the corresponding tensors, in which                  The contraction width is then equal to the max-
one takes the inner product over common indices                 imum vertex degree in the minors of G obtained
or an outer product if there are no common in-                  throughout the contraction path represented by
dices. For simplicity, in what follows we consider              B [43], as illustrated in the example of Fig. 2.
only pairwise contractions, which are common                    The same logic extends to any constant w.
practice. Multiway contractions are also possible,                 Similarly, the time complexity of the contrac-
but they can always be decomposed to sequences                  tion is captured by the contraction cost
of pairwise contractions. For some applications,
                                                                                          X
                                                                              C(B, S) =           2vc(B,S,v) ,         (5)
only a subset of V must be contracted, while in                                           v∈VB

Accepted in   Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0.                                     3
Method                Optimal    Edge weights   Hyper edges     Targets
                                                                        Exhaustive search            yes         yes            yes         total cost
                                                                  Line graph tree decomposition   dependsa       no             yes       leading cost
                                                                      Community detection            no          yes            no          total cost
                                                                        Boltzmann-greedy             no          yes            yes         total cost
                                                                    Hyper-graph partitioning         no          yes            yes         total cost

                                                                Table 1: Contraction path optimization methods detailed
                                                                in Secs. 3.1-3.5. For each method, we list its name,
                                                                whether it is guaranteed to find the optimal contraction
                                                                path, whether it incorporates edge weights (i.e., bond
                                                                dimensions), whether it naturally handles hyper-edges,
                                                                and whether it targets the total contraction cost or just
                                                                the leading cost (single most expensive contraction).

                                                                    a
                                                                    QuickBB will eventually find the optimal contraction
                                                                with respect to leading cost but not FlowCutter.

                                                                is not guaranteed to also find or approximate the
Figure 2: For the graph shown in (a), two possible con-
traction trees (b) and (c), showing intermediate tensors        other.
and congestions. Each edge in a tree has an associated
tensor and subgraph. The size of the tensor is exponen-
tial in the number of indices (denoted by unique colors)        3 Tensor network contraction path op-
running along that edge — the edge congestion. Each             timization
vertex in a tree represents a pairwise contraction of two
tensors, as well as a bi-partitioning of the parent edge’s      We have shown that the optimization of the con-
subgraph (the dashed grey line shows one example of
                                                                traction path for a given tensor network corre-
this). The cost of that pairwise contraction is exponen-
tial in the number of indices passing through that vertex       sponds to minimization of a vertex or edge con-
— the vertex congestion. Assuming each index is the             gestion measure over the possible tree embed-
same size, the tree (c) thus has both a higher maximum          dings of the network. Instead of performing this
contraction width (in bold) and total contraction cost          minimization, here we will use methods that op-
than tree (b).                                                  timize contraction paths based on quantities that
                                                                are proxies to these congestion measures, as ex-
where vc is the vertex congestion [47]                          plained below. Our heuristics are based on estab-
                                                                lished algorithms for a variety of common graph
                                                                theoretic tasks, such as balanced bipartitioning
                               X
        vc(B, S, v) =                      log2 w(e) .   (6)
                          e∈sl(v) ∪sr(v)                        or community detection, some of which, unlike
                                                                tree embedding, have seen decades of develop-
Again using the case of qubits as an example, the               ment and improvement, thus affording great ben-
number of operations required to obtain the ten-                efits in performance to our methods. We stress,
sor corresponding to a non-leaf vertex v by con-                however, that all contraction path optimization
tracting its children is proportional to 2|sl(v) ∪sr(v) | .     tools studied in this work except for those in-
More precisely, assuming every contraction is an                troduced in Secs. 3.1 and 3.2 are original contri-
inner product, for real (complex) tensors, the                  butions, and that graph theory algorithms used
associated FLOP count will be a factor of two                   to perform a particular task (e.g., graph parti-
(eight) times more than C: one (six) FLOP(s)                    tioning) are interchangeable with any other algo-
for the multiplication and one (two) FLOP(s) for                rithm that can perform the same task. Finally,
the addition. A time-optimal contraction tree for               we also note that all the algorithms we test ex-
G is then                                                       cept for the exhaustive search of Sec. 3.1 are not
                                                                guaranteed to find the global minimum of the con-
              Btime (G) = argmin C(B, S) .               (7)
                            B∈B|V |                             gestion measures. Nevertheless, as will be seen
                                                                below, they can often get arbitrarily close to the
Btime (G) and Bspace (G) are not necessarily the                optimum. A summary of the methods we intro-
same and hence a strategy that aims to find one                 duce below is shown in Tab. 1

Accepted in   Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0.                                                                   4
3.1 Exhaustive search                                           bond dimensions can vary significantly.

One method for finding contraction trees is to
                                                                 3.3 Community detection via edge betweenness
exhaustively search through all of them and re-
                                                                - Hyper-GN
turn whichever minimizes the desired target W
or C. Since outer products are rarely ever ben-                 One of the methods for the contraction of tensor
eficial, an efficient but virtually optimal way to              networks with arbitrary structure introduced in
perform this search is to adopt a dynamic pro-                  Ref. [43] is based on detecting communities in the
gramming approach that builds the tree consid-                  network. Qualitatively, a community is a subset
ering connected subgraphs only [48]. We refer to                of the vertices in a network that is densely con-
this optimizer as Optimal and for our results use               nected internally and sparsely connected with its
the version implemented in opt einsum [49].                     complement. Detecting communities in networks
                                                                is a central problem in the study of complex net-
                                                                works [53, 54].
3.2 Line-Graph Tree                Decompositions          -
                                                                   The intuition behind using the community
QuickBB & FlowCutter
                                                                structure to contract an arbitrary tensor network
The most common approach to contracting arbi-                   is that it is advantageous to contract all the
 trary tensor networks in recent years, motivated               edges between vertices that belong to a commu-
 by the results of Markov and Shi [11], has been                nity first. That is because the vertex that re-
 to find a tree decomposition of the line graph of              sults from the contraction of all edges within a
 G. From this tree decomposition, an edge elim-                 community, which we call a community vertex, is
 ination ordering can be constructed such that                  sparsely connected with the rest of the network.
 the complexity of the corresponding contraction                Thus, when a community structure exists and is
 is upper bounded by the tree-width of the line-                detected in the network, the adherence of contrac-
 graph minus one. Practically speaking, we turn                 tions to this community structure is expected to
 an edge ordering, (e1 , e2 , e3 , . . .) into a contrac-       lead to community vertices with a maximum de-
 tion tree as follows. First, find the subgraph                 gree that is lower than that of the same number of
 of G induced by the next edge in the ordering,                 vertices reached by an arbitrary sequence of con-
 ei . Update G by contracting all of the tensors in             tractions of the original network. This approach
 this subgraph to form a single vertex (if there are            hence effectively minimizes the contraction cost,
 more than 2 tensors use an exhaustive or greedy                i.e., yields a contraction sequence that approxi-
 approach to find a contraction sequence for this               mates the one defined by the space-optimal con-
 small subgraph only). Repeat until all edges in                traction tree.
 the ordering have been processed.                                 A popular community structure detection al-
     In the tensor network literature the most                  gorithm is the one of Girvan and Newman [55].
 commonly used tree decomposition finder is                     It operates by evaluating a quantity called edge
 QuickBB [50], which implements a depth-first                   betweenness centrality, defined as
‘branch and bound’ search.               Broadly speak-                         g(e) =
                                                                                         X
                                                                                                 σst (e)/σst ,   (8)
 ing this approach emphasizes performance for                                            s,t∈V
 graphs with modest numbers of edges, where in-
 deed QuickBB has been shown to work well [42].                 where σst is the total number of shortest paths
 More recently, the FlowCutter tree decomposi-                  between vertices s and t, and σst (e) is the number
 tion finder [51, 52], has been applied to tensor               of those paths that pass through edge e ∈ E.
 networks [44]. FlowCutter takes more of a ‘top-                The algorithm starts with an empty edge list and
 down’ approach which emphasizes performance                    repeats two steps:
 on graphs with large numbers of edges. Both
                                                                  1. remove e0 = argmax g(e) from E and add it
 function as ‘any-time’ algorithms, able to yield                                     e∈E
 the best found solution after setting an arbitrary                   to the list,
 time. On the other hand, neither of these op-                    2. calculate g(e) ∀ e ∈ E,
 timizers take edge weights into account, which
 may be a significant disadvantage in the many-                 until exhausting E. Multiple edges can be pro-
 body setting, where, unlike in quantum circuits,               cessed simultaneously, since they have the same

Accepted in   Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0.                               5
g. The resulting list of edges, sometimes called                with τ an effective temperature governing how
a dendrogram, defines the detected community                    ‘adventurous’ the path finding should be. Repeat-
structure: if one sequentially removes the list en-              edly generating contraction trees using this com-
tries from E until G becomes disconnected, then                  bination of cost and weighting, whilst potentially
the resulting connected components are the com-                  tuning both α and τ , leads to the Hyper-Greedy
munities of G. The algorithm then proceeds by                    optimizer. Hyper-Greedy generally outperforms
splitting each connected component into smaller                  other greedy approaches and is quick to run, mak-
communities, and the process repeats all the way                 ing it a simple but useful reference algorithm.
down to the individual vertex level.
   The output of the Girvan-Newman method is                    3.5 Divisive contraction trees - Hyper-Par
also a contraction path: one simply has to tra-
verse the edge list in reverse, each entry defining             The greedy or agglomerative approach is a natu-
a contraction of the endpoints of the correspond-               ral way to think about building contraction trees
ing edge. One can incorporate edge weights (and                 from the bottom up. However, as introduced in
thus bond dimensions) into Eq. (8), possibly ran-               [43] we can also try and build contraction trees
domized with some strength τ , to generate varied               from the top down in a divisive manner. The key
paths. We call the optimizer based on repeated                  here is that each node in a contraction tree repre-
sampling of these paths Hyper-GN.                               sents not only an effective tensor but a subgraph
                                                                of the initial graph describing the full tensor net-
                                                                work. As we ascend a contraction tree, merging
3.4 Agglomerative             contraction        trees     -    two nodes corresponds to a pairwise contraction
Hyper-Greedy                                                    of the two effective tensors. In reverse, as we de-
One simple way to construct a contraction tree                  scend a contraction tree, splitting a node corre-
is greedily from the bottom up. Here, one ig-                   sponds to a bipartitioning of subgraph associated
nores any overall structure of the graph G and                  with that node.
instead heuristically scores each possible pairwise                Practically we start with the list of ‘childless’
contraction. Based on these scores, a pair of ten-              vertices - initially just the root of the tree corre-
sors can be chosen and contracted into a new ver-               sponding to the full graph, {VG }. We take the
tex and the list of scores then updated with any                next childless vertex, V , and partition it into
new possible contractions. Whilst we know the                   V = V1 ∪ V2 . If |V1 | > 1 we append it to the
exact cost and output size of each pairwise con-                list of childless vertices and similarly if |V2 | > 1.
traction, we do not know the effect it might have               This process can be repeated until the full con-
on the cost and size of later contractions, mean-               traction tree is generated. Such a divisive ap-
ing we must instead carefully choose the heuristic              proach is very similar to the community detec-
score function.                                                 tion scheme introduced earlier, however, whilst
   Given two tensors Ti and Tj whose contraction                the Girvan-Newman algorithm naturally yields
yields Tk , one reasonable choice for the heuristic             the entire contraction tree, here we create single
cost function is                                                contractions one at a time. This allows one to
                                                                combine partitioning with other optimizers. For
  cost(Ti , Tj ) = size(Tk ) − α(size(Ti ) + size(Tj ))         example, we can instead partition a vertex V
                                                     (9)        into k partitions, V1 , V2 , . . . , Vk and then use the
with α a tunable constant. If we take α = 1 then                Optimal or Hyper-Greedy optimizer to ‘fill in’
this cost is directly proportional to the change                the contraction tree — essentially find the con-
in memory should we perform the contraction.                    traction path for a tensor network composed just
Whereas instead taking α = 0 essentially just pri-              of the tensors corresponding to each of these new
oritizes the rank of the new tensor. Since we will              subgraphs. Similarly, if the size of V drops below
want to sample many greedy paths we also intro-                 some threshold, we can again use either Optimal
duce a ‘Boltzmann factor’ weighting of the costs                or Hyper-Greedy to find the remaining part of
such that the probability of selecting a pairwise               the contraction tree corresponding just to the leaf
contraction is                                                  tensors in V .
                                                                   The cost of an individual contraction - a ver-
      p(Ti , Tj ) ∝ exp (−cost(Ti , Tj )/τ ) ,         (10)     tex bi-partitioning - is given by the product of

Accepted in   Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0.                                   6
in the search for short cuts compared to the origi-
                                                                nal graph. To revert back to a ‘traditional’ tensor
                                                                network after partitioning, each hyperedge can be
                                                                replaced by a low-rank COPY tensor subgraph
                                                                that cuts each separator at most once, as illus-
Figure 3: (a) Segment of tensor network with six ten-           trated in Fig. 3. Another important use-case for
sors, one of which (black filled circle) is a COPY tensor.      hyperedges is to efficiently treat batch and out-
(b) COPY tensor replaced by a hyperedge. Recursive              put indices, though these are not benchmarked
hypergraph bipartitioning yields the separator hierarchy        in this work.
drawn as dashed lines, with thicker lines for higher level         We employ the partitioner KaHyPar [56, 57]
in the hierarchy. (c) After a separator hierarchy is found,
                                                                to generate our contraction trees for a number of
the hyperedge is replaced by a connected subgraph of
COPY tensors whose edges intersect each separator at            reasons. Aside from offering state-of-the-art per-
most once. The results of the contraction of networks           formance, it also can handle hypergraphs (and
(a) and (c) are identical.                                      thus arbitrary tensor expressions), allows key pa-
                                                                rameters such as the imbalance to be specified,
                                                                and takes into account edge weights (and thus ar-
the dimensions of the involved indices. These                   bitrary bond dimensions). Repeatedly sampling
include any outer indices of the subgraph, plus                 contraction trees whilst tuning the parameters k,
any indices that cross the newly created partition.              and the cut-off to stop partitioning leads us
Since the outer indices are independent of the                  to the optimizer we call Hyper-Par. Note that
partition, minimizing the number of indices cut                 the line graph and greedy methods of Secs. 3.2
by a partition also minimizes the cost of the cor-              and 3.4, respectively, also support hypergraphs
responding contraction. This is still essentially               natively.
a greedy approach - it only considers the cost                     In passing, we note that (hyper)graph parti-
of a single contraction and strictly minimizing                 tioning has been used as a simplification tool for
this cost (corresponding to choosing a min-cut)                 computational tasks in other research fields, see,
could likely create more expensive contractions                 e.g., [58].
down the line. However, one way to heuristi-
cally adjust this is to control how balanced to
                                                                3.6 Stochastic Bayesian Optimization
make the partitions, in other words, how much
to match the size of each partition. Specifically,              The Optimal contraction tree optimizer runs un-
we can define the imbalance parameter, , such                  til completion whilst QuickBB and FlowCutter
that |Vi | ≤ (1 + )|V |/k for i = 1 . . . k, where k           are natively any-time algorithms. For the remain-
is the number of partitions. If  is close to zero,             ing three optimizers – Hyper-GN, Hyper-Greedy
then the partitions are forced to be very similar               and Hyper-Par – we use a combination of ran-
in size, whilst if  is close to k the partitions are           domization and Bayesian optimization [59] to in-
allowed to be of any size.                                      telligently sample ever better contraction paths.
   Taking into account the internal structure of                This allows all three of them to run as parallel
the tensors in a problem allows for further flexi-              any-time algorithms.
bility in the recursive bipartition process, which                 For the Hyper-GN and Hyper-Par optimizers,
in turn can lead to significant performance gains.              randomization can be introduced as a noise of
As an example, consider the case of a COPY ten-                 the edge weights of the initial graph G. For
sor, whose entries are 1 only when all indices                  the Hyper-Greedy optimizer the Boltzmann sam-
are equal and 0 otherwise. These tensors appear,                pling of greedy contractions yields another source
for example, when modeling circuits of controlled               of randomization. Due to the high sensitivity
gates (see, e.g., Sec. 4.6.1) or satisfiability formu-          of the contraction width W and cost C to the
las [26, 43]. Each COPY tensor in a network can                 contraction path, simply sampling many paths
be replaced by any connected graph of COPY ten-                 and keeping the best already offers significant
sors without changing the result of the contrac-                improvements over single shot versions of these
tion [4]. By replacing all COPY tensors in the                  same algorithms. However we can further im-
network with hyperedges, one can perform recur-                 prove the performance if we allow the heuristic
sive hypergraph bipartitioning with more freedom                parameters of each optimizer to be tuned as the

Accepted in   Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0.                              7
sampling continues. We use the baytune [60] li-                   The third pre-processing step we perform is
brary to perform this optimization, which uses                  antidiagonal-gauging. Here, again assuming we
Gaussian processes [61] to model the effect of the              have a k-dimensional tensor ti1 i2 ...ik , if for any
parameters on the target score – either W or C                  pair of indices {ix , iy } of matching size d we find
– and suggest new combinations which are likely
                                                                              ti1 i2 ...ik = 0 ∀ ix 6= d − iy              (12)
to perform well.
                                                                then we can flip the order of either index ix or iy
3.7 Tensor Network Simplifications                              throughout the tensor network. This corresponds
                                                                to gauging that index with a ‘reflected’ identity,
Next we describe a series of simplifications based              for example if d = 2 the Pauli matrix X. This
simply on tensor network structure and sparsity                 simplification does not help on its own but merely
of the tensors that we perform iteratively until                produces tensors which can then be diagonally
no more operations are possible. These are all                  reduced using the prior scheme.
designed to decrease the complexity of the ten-
sor network prior to invoking the full contraction
path finders, and are performed as efficient local
searches.
   The first of these is diagonal-reduction of ten-
sor axes, as introduced for quantum circuits
in [62]. For a k-dimensional tensor, ti1 i2 ...ik , with
                                                                   The fourth simplification we perform is
indices i1 i2 . . . ik , if for any pair {ix , iy }
                                                                column-reduction. Here, if for any k-dimensional
                 ti1 i2 ...ik = 0 ∀ ix 6= iy           (11)     tensor ti1 i2 ...ik we find an index ix and ‘column’ c
then we can replace t with a (k − 1)-dimensional                such that
tensor, t̃ with elements t̃...ix = t...ix iy δiiyx , where                        ti1 i2 ...ik = 0 ∀ ix 6= c               (13)
the δ copy can be implemented by re-indexing
iy → ix everywhere else in the tensor network,                  then we can replace every tensor, t...ix , featuring
thus resulting in ix becoming a hyperedge. This                 that index with the (k − 1)-dimensional tensor t̃
enables the use of the hypergraph machinery de-                 corresponding to the slice t...[ix =c] , removing that
tailed in Sec. 3.5.                                             index from the network entirely. This can be pic-
                                                                tured as projecting the index into the basis state
                                                                |ci.

   The second pre-processing step we perform is
rank-simplification. Here we generate a greedy
contraction path that targets rank reduction only                  The final possible processing step is split-
(i.e. with respect to Eq. (9) and (10) sets α = τ =             simplification. Here if any tensor, t, has an ex-
0). We then perform any of the pairwise contrac-                act low-rank decomposition across any biparti-
tions such that the rank of the output tensor is                tion of its indices – i.e. ti1 ...j1 ... = k li1 ...,k rj1 ,...,k
                                                                                                          P
not larger than the rank of either input tensor. If             with max(size(l), size(r)) < size(t) – we perform
the tensor network has no hyperedges, this corre-               it. This is done using the SVD, and is the one
sponds to absorbing all rank-1 and rank-2 tensors               simplification that increases the number of ten-
into neighbouring tensors, a process which can-                 sors in order to decrease the cut-weight across
not increase the cut-weight across any partition                partitions.
for example.

Accepted in   Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0.                                           8
We apply the above set of simplifications                    W and C, a complex, single precision tensor of
iteratively but deterministically until no method               size 227 requires 1GB of memory, and a con-
can find any operation to perform. For all                      sumer grade GPU can usually achieve a few ter-
methods that compare to zero we use a relative                  aFLOPs in terms of performance, corresponding
precision of 10−12 unless otherwise stated. The                 to C ∼ 1015 over an hour. In the final results
order they are applied in can produce very                      section we benchmark various contractions and
different networks – we find cycling through the                indeed find this real-world performance. At the
order {antidiagonal-gauging, diagonal-reduction,                extreme end of the scale, the most powerful su-
column-reduction,     rank-simplification,    split-            percomputer in the world currently, Summit, has
simplification} produces good results. Indeed                   a few petabytes of memory, corresponding very
for quantum circuits generally the resulting                    roughly to W ∼ 47, though this is obviously dis-
tensor networks often have almost no sparsity                   tributed among nodes and utilizing it for a single
among tensor entries. Note for methods such                     contraction would need, among many other tech-
as Hyper-GN which cannot handle hyperedges                      nical considerations, significant inter-node com-
we skip the diagonal-reduction.         Finally, if             munication. Summit has also achieved sustained
aiming to reuse a contraction path, one needs                   performance of a few hundred petaFLOPs [65],
to maintain the sparsity structure from network                 which over an hour might correspond to C ∼ 1020 ,
to network, possibly excluding any variable                     but is unlikely to do so if distributed contraction
tensors from the simplification steps that detect               is required (i.e. for high W ).
sparsity. For most circuits terminated with a
layer of Hadamard gates, if one only changes                    4.1 Random Regular Graphs
the sampled bit-string x then even this is not
usually necessary.                                              We start by benchmarking tensor networks with
                                                                geometries defined by random regular graphs,
                                                                as studied in [43, 44]. These graphs arise in
4 Results                                                       the study of many computational problems, such
                                                                as satisfiability, but also problems defined on
We benchmark our contractors on six classes                     graphs with nonuniform degree distribution can
of tensor networks with complex geometry –                      often be reduced to equivalent problems on low-
random regular graphs, random planar graphs,                    degree regular graphs [66]. For such a k-regular
square lattices, weighted model counting formu-                 graph, every vertex is connected randomly to k
lae, QAOA energy computation, and random                        others, with total number of vertices |V |. We
quantum circuits. In each set of results we set                 treat each of the edges as tensor indices of size
a time limit or maximum number of shots for                     2 and associate a rank-k tensor with each vertex.
each of the optimizers to run for, and then tar-                None of the simplifications of Sec. 3.7 are appli-
get either the contraction width, W , or contrac-               cable. An example of such a network is shown in
tion cost C. As a reminder, W is essentially the                Fig. 1(b). For each size |V |, degree k and target
space requirement of the contraction (log2 of the               ∈ {W, C}, we generate 100 sample regular graphs
size of the largest intermediate tensor) whilst C is            uniformly [67], and allow 5 minutes of search time
the time requirement (the total number of scalar                per instance for each optimizer. The reference
operations). The Optimal algorithm is able to                   Optimal path finder we instead run for 24 hours
search for either the minimum W or C, whilst                    and only show data points where all but one or
Hyper-GN, Hyper-Greedy and Hyper-Par can tar-                   two of the instances successfully terminated so as
get either through the guided Bayesian optimiza-                not to bias those points towards easy instances.
tion. Finally, there is no way to specifically bias               The results are shown in Figs. 4(a)-(f). First
QuickBB and FlowCutter towards either W or                      of all we note that for small sizes all optimiz-
C so in each case the optimizer runs identically.               ers return similar performance, indeed, close
If an optimizer can run in parallel, we allow it 4              to Optimal.       As |V | increases however the
cores to do so. An open source implementation of                same ranking emerges in each combination of
the optimizers, compatible with opt einsum [49]                 k and {W, C}: (from worst to best) QuickBB,
and quimb [63], is available at [64].                           Hyper-Greedy, FlowCutter, Hyper-GN, then fi-
   To give some context to the relative scale of                nally Hyper-Par. We attribute the improve-

Accepted in   Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0.                              9
k=3                                                      k=4                                                   k=5
                                  60       QuickBB                                        60        QuickBB                                     60        QuickBB
                                           Hyper-Greedy                                             Hyper-Greedy                                          Hyper-Greedy
                                           FlowCutter                                               FlowCutter                                            FlowCutter
                                  50                                                      50                                                    50
                                           Hyper-GN                                                 Hyper-GN                                              Hyper-GN
            W

                                           Hyper-Par                                                Hyper-Par                                             Hyper-Par
             Contraction Width,

                                  40       Optimal                                        40        Optimal                                     40        Optimal

                                  30                                                      30                                                    30

                                  20                                                      20                                                    20

                                  10                                                      10                                                    10
                                                                            (a)                                                       (b)                                                (c)

                                           QuickBB                                                  QuickBB                                               QuickBB
                        10 21              Hyper-Greedy
                                                                                        10 21       Hyper-Greedy
                                                                                                                                              10 21       Hyper-Greedy
                                           FlowCutter                                               FlowCutter                                            FlowCutter
                        10 18              Hyper-GN                                     10 18       Hyper-GN                                  10 18       Hyper-GN
 C

                                           Hyper-Par                                                Hyper-Par                                             Hyper-Par
 Contraction Cost,

                        10 15              Optimal                                      10 15       Optimal                                   10 15       Optimal

                        10 12                                                           10 12                                                 10 12

                              10 9                                                       10 9                                                  10 9

                              10 6                                                       10 6                                                  10 6
                                                                            (d)                                                       (e)                                                (f)
                              10 3                                                       10 3                                                  10 3
                                       0   50    100      150   200   250         300           0   25    50       75     100   125     150           0   20       40          60   80    100
                                                          |V|                                                       |V|                                                  |V|

Figure 4: Mean contraction width (top row) and cost (bottom row) of random regular graphs of degree k = 3, 4, 5
(left, centre and right columns respectively) as a function of the number of vertices (tensors) in the network, |V |, for
various contraction path optimizers each allowed 5 minutes to search. The shaded regions show standard deviations
across 100 random graph instances. An example graph with k = 5 is shown in Fig. 1(b).

ment of Hyper-GN over previous studies [44] to                                                                      tative. Similarly to the random regular graphs,
the use of guided stochastic sampling. There                                                                        for each vertex with k edges we associate a rank-
are some interesting performance comparisons                                                                        k tensor with bond dimensions of size 2 and al-
when it comes to targeting contraction width W                                                                      low each optimizer 5 minutes per instance to ex-
or cost C. For example, while Hyper-Greedy                                                                          plore contraction paths. In [44] it was shown
beats QuickBB for width across the board, the                                                                       that the optimal contraction path with respect
results are much closer for contraction cost. On                                                                    to W for planar graphs can be found in poly-
the other hand, the advantage of Hyper-Par                                                                          nomial time. Also, planar tensor networks√can
over Hyper-GN and FlowCutter is much more                                                                           be contracted in subexponential time O(2 |V | )
pronounced when considering cost rather than                                                                        as a consequence of the planar separator theo-
width.                                                                                                              rem [22, 43, 70]. In Fig. 5(a) and (b) we plot the
                                                                                                                    mean contraction width, W , and cost, C,pas a
                                                                                                                    function of the ‘side length’ of the graph, |V |.
4.2 Random Planar Graphs
                                                                                                                    Alongside a sub-exponential scaling for all the
A contrasting class of geometries to consider is                                                                    optimizers we see a very different ranking of opti-
that of planar graphs, encountered for example                                                                      mizer performance as compared to random regu-
in the study of physical systems defined on a 2D                                                                    lar graphs, with Hyper-Greedy performing best.
lattice or in evaluating knot invariants [68]. To                                                                   For small sizes, again the performance of all opti-
investigate these in a generic fashion, we gener-                                                                   mizers is close to Optimal, and in fact the differ-
ate random planar graphs with |V | ∈ [20, 200]                                                                      ence between methods remains relatively small
using the Boltzmann sampler described in [69].                                                                      throughout the size range.
An instance of the generated graphs is shown
in Fig. 1(c). Whilst these are much more ran-
dom than square lattices for example, we find
nonetheless that the results are broadly represen-

Accepted in                                Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0.                                                                               10
Random Planar                                                         Square Lattice OBC                 Square Lattice PBC
                                                                                                                                 TEBD-Exact                         TEBD-exact                   OBC
                                       Hyper-GN                                                                            60    Hyper-Greedy                60     Hyper-Par
                                  22   Hyper-Par                                                                                 Optimal                            Hyper-Par
                                                                                                                                 Hyper-Greedy                       hyper-edges
                                       FlowCutter                                                                          50                                50
                                  20                                                                                             hyper-edges                        Optimal
                                       QuickBB

                                                                                                    Contraction Wdith, W
                                                                                                                                 Optimal                            Optimal
            W

                                  18   Hyper-Greedy                                                                        40    hyper-edges                 40     hyper-edges
             Contraction Width,

                                       Optimal
                                  16
                                                                                                                           30                                30
                                                                                                                                                                                                 OBC
                                  14                                                                                                                                                          hyper-edges
                                                                                                                           20                                20
                                  12
                                                                                                                           10                                10
                                  10                                                                                                                 (a)                                (c)
                                   8                                              (a)                                      0                                   0

                                                                                                                 10 21                                     10 30                                 PBC
                        10 10                                                                                                                              10 27
                                       Hyper-GN                                                                  10 18
                                       Hyper-Par                                                                                                           10 24
                                       FlowCutter                                                                10 15

                                                                                          Contraction Cost, C
                                                                                                                                                           10 21
                                       QuickBB
                              10 8                                                                                                                         10 18
 C

                                       Hyper-Greedy                                                              10 12
 Contraction Cost,

                                                                                                                                                           10 15
                                       Optimal                                                                                                                                                   PBC
                                                                                                                      10 9                                 10 12                              hyper-edges
                                                                                                                      10 6                                  10 9
                              10 6
                                                                                                                                                            10 6
                                                                                                                      10 3                           (b)    10 3                        (d)

                                                                                  (b)                                             10         20      30              10           20    30
                              10 4                                                                                                       L                                  L

                                            6         8                 10   12     14
                                                              p
                                                                  |V|                    Figure 6: Contraction width W (top row) and contraction
                                                                                         cost C, for square lattice geometry - either with vertices
Figure 5: Mean contraction width W (top) and cost C                                      representing the underlying lattice (left column) or hyper-
(bottom) for randomly generated planar graphs as a func-                                 edges (right column). Insets to right illustrate the four
tion of number of vertices |V |, for various path optimizers                             possible TNs with L = 5. Note that the hyper-edge case
each allowed 5 minute to search. The shaded regions                                      can be exactly transformed into the normal case but the
show standard deviations across random graph instances.                                  reverse is not generally true.
The 35,162 graph instances studied
                                 p        are approximately
uniformly distributed over the |V | bins shown, and an
example instance is shown in Fig. 1(c).                                                  to yield the standard TN geometry, this makes
                                                                                         the TN harder to contract.
                                                                                            For OBC, we find W is significantly reduced
4.3 Regular Square Lattice                                                               from the TEBD-Exact scaling 1 of 2L (Fig. 6(a))
                                                                                         as well as C (Fig. 6(b)). Contracting the hyper-
To emphasize that the utility of these optimizers
                                                                                         edge form of the TN also yields an advantage
is not restricted to randomly structured graphs,
                                                                                         for both. For PBC the TEBD-Exact path yields
we now compare the best of them with a naive
                                                                                         the same, optimal contraction width (Fig. 6(c))
Time Evolving Block Decimation (TEBD) style
                                                                                         but carries a significantly worse scaling contrac-
approach on a square 2D lattice. While such an
                                                                                         tion cost (Fig. 6(d)). Contracting the hyper-edge
approach – contracting a Matrix Product State
                                                                                         form of the TN again yields an advantage for
boundary from one side to the other – usually
                                                                                         both. In all cases we see either Hyper-Greedy or
would be combined with canonicalization and
                                                                                         Hyper-Par very closely tracks the Optimal width
compression, doing it exactly yields a natural
                                                                                         and cost at accessible sizes.
comparison point for a simple, manually chosen
contraction path. In Fig. 6 we show W and C
for such an approach (labelled TEBD-Exact), the                                          4.4 Exact Weighted Model Counting
best of Hyper-Greedy or Hyper-Par, as well as                                            We now move on to exact weighted model count-
Optimal, for 2D square lattice TNs with bond                                             ing, an important #P-complete task, central to
dimension 2. As well as showing open and peri-                                           problems of inference in graphical models, eval-
odic boundary conditions (OBC and PBC), we                                               uating partition functions in statistical physics,
show the case for when the lattice geometry is                                           calculating network reliabilities, and many oth-
defined on hyper-edges rather than the vertices.                                         ers [71–73]. The problem can be cast as comput-
This is a common scenario when evaluating parti-
                                                                                                                           1
tion functions of classical spin models. While the                                          With canonicalization but no compression the scaling
hyper-edges can be converted to COPY tensors                                             would be W ∼ L.

Accepted in                            Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0.                                                                                           11
ing the following sum:

                   X #vars
                      Y           #clauses
                                    Y
              x=             wv              Cv̄i ,    (14)
                   {v}   v           i

where {v} is all combinatorial assignments of ev-
ery binary variable, wv is a vector with the ‘posi-
tive’ and ‘negative’ weight of variable v, and Cv̄i
the ith clause containing variables v̄i , given by
the tensorization of the OR function. Such an
expression can directly be thought of as an hy-
per tensor network, with tensors (nodes) wv , Cv̄i
and tensor indices (hyper-edges) v. Key here is
that we directly handle constructing contraction
trees for such hyper-graphs, and thus do not need
to map Eq. (14) into a ‘normal’ tensor network
form.
   To test our contraction optimizers we assess all             Figure 7: Example hyper tensor networks, post-
100 private weighted model counting (track-2) in-               simplification, representing weighed model counting for-
stances from the Model Counting 2020 competi-                   mulae from the MCC2020 model counting competition.
tion [74]. After constructing the tensor network
representation of x we run the simplification pro-              wj,k for j, k ∈ E(G) is given by:
cedure, actively renormalizing the tensors since
for some instances x > 102000 . We find the simpli-               |γ̄, β̄i = UB (βp )UC (γp ) · · · UB (β1 )UC (γ1 )|+i
fications to be very powerful here – of the 100 in-                                                                   (15)
stances, 63 simplify all the way to a single scalar,            where
whilst the remaining 37 instances require actual
                                                                                                    e−iγwjk Zj Zk
                                                                                            Y
contraction of a much reduced tensor network.                               UC (γ) =                                   (16)
                                                                                         j,k∈E(G)
We invoke our hyper-optimizer on these, allow-
                                                                                               e−iβXj
                                                                                         Y
ing 64 repeats and access to both the greedy and                            UB (β) =                                   (17)
KaHyPar drivers. Of these, 1 instance was excep-                                         j∈G
tionally difficult (W & 100), whilst the remain-
                                                                for the two length-p vectors of parameters ᾱ and
ing (shown in Fig. 7) all had contraction paths
                                                                β̄. The energy of this is given by a sum of local
with W < 20 and C < 108 making them eas-
                                                                terms:
ily contractable. Overall the 99 solved instances
compares favourably with the best score of 69
                                                                               X
                                                                       E=               wj,k hγ̄, β̄| Zj Zk |γ̄, β̄i   (18)
achieved in the competition [74]. For those 69 in-                           j,k∈E(G)
stances we confirmed all results against the ADDMC
solver [75].                                                    where for each term any unitaries outside the ‘re-
                                                                verse lightcone’ of j, k can be cancelled.
                                                                   We study MAX-CUT problems on random 3-
4.5 QAOA Energy Evaluation
                                                                regular graphs of size N , for which wj,k = 1,
The Quantum Approximate Optimization Algo-                      equivalent to an antiferromagnetic Ising model.
rithm (QAOA) [76] is a promising approach for                   Note that whilst the problem is defined on such
optimization on near-term quantum devices. It                   a graph, G, the actual tensor networks for each
involves optimizing the energy of an ansatz cir-                energy term have very different geometries com-
cuit, followed by the sampling of potential solu-               pared to Sec. 4.1, since they arise from the re-
tion bitstrings. Here we explore the first part, a              peated application of 3p layers of gates followed
task that has been studied before [77] and is iden-             by unitary cancellation. Indeed, in the limit of
tical to computing the energy of a unitary ansatz               large N , they are not random at all [77]. First
for a many-body model. The p-layer ansatz cir-                  we form the 3N2 energy term tensor networks, and
cuit for target graph G with constraint weights                 simplify each using all five methods from Sec. 3.7.

Accepted in   Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0.                                     12
p                                                  executed on a range of quantum chip geometries.
            50                                                 (a)
                            5                                              In particular, we look at sizes and depths previ-
                            4
            40              3                                              ously explored in the context of so-called ‘quan-
                            2                                              tum supremacy’ [37, 38, 45, 78]. Quantum cir-
                            1
  ®

            30                                                             cuits can be naturally cast as tensor networks
    Wmax

                                                                           and then simulated via contraction, as shown
  ­

            20
                                                                           in [11]. In recent years, random quantum cir-
            10                                                             cuits have been used both as a test-bed for ten-
                                                                           sor network contraction schemes as well as set-
                                                                           ting the benchmark for demonstrating ‘quantum
          10 18         p                                      (b)         supremacy’ [41, 62, 79–82]. Practically speak-
                            5
                            4                                              ing, such simulations can also allow the fidelity
          10 15
                            3                                              of real quantum chips to be benchmarked and
          10 12             2
                            1                                              calibrated [38, 45, 81].
®
 Ctotal

           10 9                                                               The simplest quantity to compute here is the
­

                                                                          ‘transition amplitude’ of one computational basis
           10 6
                                                                           state to another through a unitary describing the
           10 3                                                            quantum circuit. Assuming we start with the
                                                                           N qubit all-zero bit-string |0⊗N i, the transition
                  101                      102                   103
                                           N                               amplitude for output bit-string x can be written:

Figure 8: Maximum contraction width (a) and total                                 cx = hx| Ud Ud−1 . . . U2 U1 |0⊗N i ,   (19)
contraction cost (b) for computing the energy of a p-layer
                                                                          where we have assumed some notion of circuit
QAOA circuit, averaged across 10 instances of random
3-regular graphs of size N . The shaded region shows the
                                                                           depth, d, such that each unitary Ui contains a
standard deviation across these instances.                                ‘layer’ of entangling gates, the exact composition
                                                                           of which depends on the specific circuit definition.
                                                                          The process for computing cx takes place in sev-
We invoke our hyper-optimizer on these, allowing                           eral steps; (a) construct the tensor network cor-
64 repeats and access to both the greedy and                               responding the circuit; (b) perform some purely
KaHyPar drivers. In Fig. 8 we report the maxi-                             structure dependent simplifications of the tensor
mum contraction width, Wmax and total contrac-                             network; (c) find the contraction path for this
tion cost, Ctotal , across terms, averaged over 10                         simplified network; and (d) actually perform the
instances of the random regular graphs, as a func-                         contraction using the found path. Steps (a) and
tion of N and p.                                                          (b) are very cheap, and moreover we can re-use
   We note that up to and including p=4,                                   the path found in step (c) to contract any ten-
throughout the range of N , Wmax remains less                              sor with matching structure but different tensor
than ∼ 28 and Ctotal less than ∼ 1010 , putting                            entries, such as varying x.
such simulations easily within the range of single
workstations. As an example, on a CPU with                                4.6.1    Gate Decompositions
4 cores, performing all of the contractions for
N = 54 and p = 4 takes on the order of sec-                               We find that pre-processing the tensor networks
onds. Stepping up to p = 5 increases the diffi-                           with the methods from Sec. 3.7 before attempt-
culty significantly, especially in the N = 40 − 120                       ing to find contraction paths is an important step,
range. The peak here is due to cycles of length                           particularly for optimizers such as QuickBB and
≤ p appearing in G for small enough N , which                             Hyper-Greedy that scale badly with the num-
dramatically increase the complexity of each ten-                         ber of edges and vertices. A tensor network for
sor network.                                                              cx initially consists of: rank-1 tensors describ-
                                                                          ing each of the input and output qubit states;
                                                                          rank-2 tensors describing single qubit gates; and
4.6 Random Quantum Circuits
                                                                          rank-4 tensors describing two-qubit gates. The
The final class of tensor networks we study is                            first processing step is deciding how to treat
those corresponding to random quantum circuits                            the two-qubit gates. A tensor describing such

Accepted in             Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0.                               13
a gate can be written gioaaiobb , such that ia (ib ) is                          cuting on three different quantum chip geome-
the input index and oa (ob ) the output index                                    tries: (i) a rectangular 7×7 lattice of 49 qubits;
of qubit a (b). Whilst gioaaiobb is unitary with re-                            (ii) a 70 qubit ‘Bristlecone’ lattice; and (iii) a 53-
spect to ia ib → oa ob , a low rank decomposition                                qubit ‘Sycamore’ lattice.
can potentially be found by grouping the indices                                    For the first two we use the updated, harder
{ia , oa }, {ib , ob } or {ia , ob }, {ib , oa } and perform-                   versions of the random circuit definitions first sug-
ing an SVD on the resulting matrix. In the first                                 gested in [38], which are available at [84]. We
case this yields two rank-3 tensors:                                             adopt the notation (1+d+1) for depth d to em-
                                     χ                                           phasize that the technically first and last layer of
                       gioaaiobb       lioaaξ riobbξ ,                           single qubit gates (which add no real complexity)
                                     X
                                   =                                     (20)
                                      ξ=1                                        are not counted. In both cases the entangling
                                                                                 gate used is the controlled-Z which has a χ = 2
                                                                                 spatial decomposition.
                                                                                    For the Sycamore architecture, we use the
                                                                                 same circuits that were defined and also actu-
where we have dropped any zero singular vectors                                 ally executed in the recent work [45]. Here each
and absorbed the remaining singular values into                                  two-qubit gate is a separately tuned ‘fermionic
either of the left and right tensors l and r, each                               simulation’ gate which has no low-rank decompo-
of which is now ‘local’ to either qubit a or b, con-                             sition if treated exactly. On the other hand, if
nected by a bond of size χ. The second case yields                               a swapped decomposition is performed, the two
the same but with an effective SWAP (which can                                   smallest singular values are quite small and on av-
be implemented purely as a relabelling of tensor                                 erage discarding them leads to a fidelity drop of a
indices) of the qubit states first:                                              fraction of a percentage point – for a single gate.
                                                                                 If this approximation is used for every single en-
                           χ X
                             2
                                                          i0 i0                  tangling gate in the circuit, however, the error
           gioaaiobb                        lio0aaξ rio0bξ δiab δiba .
                           X
                       =                                                 (21)
                                                     b                           is compounded. For our main results, labelled
                           ξ=1 i0a i0b =1
                                                                                ‘Sycamore-53’, we thus perform no gate decom-
                                                                                 position and consider perfect fidelity transition
                                                                                 amplitude calculations only. Results where the
                                                                                 χ = 2 swapped decomposition has been used we
                                                                                 label ‘Sycamore-53*’. We also note that the defi-
The options for a gate are thus to: (a) perform                                  nition of circuit ‘cycles’, m, used in [45] is about
no decomposition; (b) perform a spatial decom-                                   twice as hard as the rectangular and Bristlecone
position – Eq. (20); or (c) perform a swapped                                    circuit definition of depth, d, since per layer al-
decomposition – Eq. (21). By default we only                                     most all qubits are acted on with an entangling
perform a decomposition if the bond dimension,                                   gate rather than approximately half respectively.
χ, yielded is less than 4; all controlled gates                                     In the following table we report the number
fall into this category for a spatial decomposi-                                 of network vertices and edges for representative
tion, whereas the ISWAP gate for instance has                                    depths of each circuit geometry after simplifica-
χ = 2 for the swapped decomposition. Such ex-                                    tions. The first two columns, |V |, |E| are for
act decompositions would also be performed au-                                   the case where hyperedge introduction is avoided,
tomatically using the split-simplification scheme                                the last two columns, |V˜ |, |E|,
                                                                                                                ˜ are for the case
of Sec. 3.7. Another option is to discard small                                 where the full simplification scheme introduced
but non-zero singular values which will result in                                above has been applied. Using the ratio |V˜ |/|E| ˜
a drop in the fidelity of cx [45, 83] – unless explic-                           as a heuristic figure of merit, we see that the net-
itly noted we do not perform this form of ‘com-                                 works resulting from the Sycamore circuit model
pression’.                                                                       are considerably denser. One may thus anticipate
                                                                                 that Sycamore benchmarks will be more challeng-
4.6.2   Random Quantum Circuit Geometries                                        ing for our methods. This expectation will be
                                                                                 borne out in Sec. 4.6.4.
We benchmark the contraction path optimizers
against different random quantum circuits exe-

Accepted in   Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0.                                                14
Circuit              |V |     |E|     |Ṽ |    |Ẽ|       ture is the same, since all the optimizers aside
 Rectangular-7×7 (1+40+1)      734     1101     790      425
                                                                     from qFlex/PEPs are naturally stochastic.
  Bristlecone-70 (1+40+1)     1036     1554    1086      574
   Sycamore-53 (m=20)          381      754     381      754            We first note that across the board, the
   Sycamore-53* (m=20)         754     1131    1125      748         Hyper-Par optimizer again performs best, with
  We note that if the swap decomposition is                          little variance from instance to instance. Perfor-
not applied to the Sycamore circuits then no                         mance of the remaining optimizers is more diffi-
diagonal-reductions can take place and the result-                   cult to rank. The tensor network simplification
ing simplified tensor network is the same in both                    scheme employed here results in significant im-
cases.                                                               provement over previous results even when using
                                                                     QuickBB to perform the actual path optimization,
                                                                     particularly when |E| or |Ẽ| is moderate. As the
4.6.3 2D   Circuit           Specific      Optimizers            -
                                                                     tensor networks get larger QuickBB is consistently
qFlex/PEPs
                                                                     outperformed by the other line-graph based opti-
Before presenting results for contraction width                      mizer FlowCutter.
and cost for these random circuits, we introduce                       For the Rectangular-7x7 and Bristlecone-70 cir-
one final form of contraction path optimizer that                    cuits, which both use a CZ entangling gate, the
has been successfully applied to circuits acting on                  diagonal reduction of tensors greatly simplifies
2D lattices [81, 82]. Here one performs the spatial                  the tensor networks. The methods that make
decomposition of the entangling gates, regardless                    use of this, aside from Hyper-Greedy, perform
of rank, such that every tensor is uniquely lo-                      best here, with similar values of C, though in-
calized above a single qubit register. One can                       terestingly Hyper-Par is able to target a lower
then contract every tensor in each of these spa-                     contraction width. Hyper-GN and qFlex/PEPs
tial slices resulting in a planar tensor network                     do not use the diagonal simplification and here
representing cx with a single tensor per site. Al-                   show similar performance.
though the two works, [81] and [82], have signifi-
cant differences in terms of details (and goals be-                     In the case of Sycamore-53 the entangling
yond the computation of a single perfect fidelity                    fSim [85] gates are close to but not exactly
amplitude), the core object treated by each is                       ISWAP gates. As a result there are no diagonal
ultimately this planar tensor network, which is                      reductions to be made and the simplified tensor
small enough that we can report optimal contrac-                     network has no hyper-edges. Whilst FlowCutter,
tion widths and costs for. We call this optimizer –                  Hyper-GN and Hyper-Par find similar contrac-
which flattens the circuit tensor network into the                   tion widths, Hyper-Par achieves a much lower
plane before finding the optimal W or C from                         contraction cost. This is likely due to its ability
that point onwards – qFlex/PEPs. With regards                        to search imbalanced partition contraction trees
to a swapped decomposition, in order to maintain                     such as ‘Schrödinger style’ (full wavefunction)
the spatial locality of the tensors this method can                  evolution. Note that for the entangling gates an
only benefit in the first and last layer of gates [45].              approximate swapped χ=2 decomposition can be
                                                                     made, resulting in a drop in fidelity based on how
                                                                     many of the m layers of gates this is applied to.
4.6.4   Results
                                                                     The qFlex/PEPs method results in [45] make use
In Fig. 9(a)-(f) we report the mean contraction                      of this in the first and last layer of gates for a
width, W , and cost, C, for each geometry and op-                    drop in total fidelity of ∼5% that reduces W by
timizer as a function of circuit depth, d, or cycles,                ∼4 and C by ∼24 . We only show the exact results
m. For these large tensor networks we allow each                     here so as to compare all methods on exactly the
optimizer one hour to search for a contraction                       same footing. If the swapped decomposition is
path. While this is not an insignificant amount                      used for all layers (Sycamore-53*) then at m=20
of time, we note that many optimizers converge                       the corresponding drop in total fidelity is likely
to their best contraction paths much quicker, and                    to be ∼50%. For the best performing optimizers
moreover that contraction paths can be re-used                       in Fig. 9(c) and (f) we find little gain in doing so.
if only changing tensor values from run to run.                      We also emphasize that for the highest values of
We show the variance in W and C across 10 in-                        m, the estimates for classical computation cost
stances, despite the fact the tensor network struc-                  in [45] are not based on the qFlex [81] simulator

Accepted in   Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0.                                    15
You can also read