ZIPPER: Exploiting Tile- and Operator-level Parallelism for General and Scalable Graph Neural Network Acceleration - arXiv

Page created by Julian Carroll
 
CONTINUE READING
ZIPPER: Exploiting Tile- and Operator-level Parallelism for
                                          General and Scalable Graph Neural Network Acceleration
                                                                Zhihui Zhang1 , Jingwen Leng1 , Shuwen Lu1 , Youshan Miao2 ,
                                                                      Yijia Diao1 , Minyi Guo1 , Chao Li1 , Yuhao Zhu3
                                                        1 Shanghai    Jiao Tong University, 2 Microsoft Research, 3 University of Rochester

                                         ABSTRACT                                                              cessing are sub-optimal for GNNs. Meanwhile, the primitive
                                         Graph neural networks (GNNs) start to gain momentum after             operations can be freely interleaved in general GNN models.
arXiv:2107.08709v1 [cs.AR] 19 Jul 2021

                                         showing significant performance improvement in a variety              As a result, prior GNN accelerators that focus on a particular
                                         of domains including molecular science, recommendation,               kind of GNN [2, 22, 24, 34, 37] are not generally applicable
                                         and transportation. Turning such performance improvement                  Then we build Z IPPER, an efficient yet general acceleration
                                         of GNNs into practical applications relies on effective and           system for GNNs. The fundamental challenge in accelerating
                                         efficient execution, especially for inference. However, neither       the general domain of GNN is the semantics gap between
                                         CPU nor GPU can meet these needs if considering both per-             the high-level GNN programming model and efficient hard-
                                         formance and energy efficiency. That’s because accelerating           ware. Today’s GNN programming model [36] is designed
                                         GNNs is challenging due to their excessive memory usage               to define operations on an input graph as a whole by rep-
                                         and arbitrary interleaving of diverse operations. Besides,            resenting all vertices and edges as tensors. We refer to it
                                         the semantics gap between the high-level GNN programming              as classic GNN programming model. Such a programming
                                         model and efficient hardware makes it difficult in accelerating       model makes GNNs similar to conventional CNNs and thus
                                         general-domain GNNs.                                                  friendly to algorithm designers. But it hides the performance-
                                            To address the challenge, we propose Z IPPER, an efficient         critical graph structures as well as the vertex- and edge-level
                                         yet general acceleration system for GNNs. The keys to Z IP -          operations from flexible execution, losing the opportunity of
                                         PER include a graph-native intermediate representation (IR)           improving system efficiency.
                                         and the associated compiler. By capturing GNN primitive                   To bridge the semantics gap, we propose a GNN-aware
                                         operations and representing with GNN IR, Z IPPER is able              intermediate representation (IR) and the associated compiler,
                                         to fit GNN semantics into hardware structure for efficient            which together automatically extract the graph-specific se-
                                         execution. The IR also enables GNN-specific optimizations             mantics (e.g., vertex and edge computational graphs) from
                                         including sparse graph tiling and redundant operation elimi-          classic GNN programming model. The compiler takes a GNN
                                         nation. We further present an hardware architecture design            model described in classic popular GNN frameworks (e.g.,
                                         consisting of dedicated blocks for different primitive opera-         DGL [36]) and generates an IR program that will later be
                                         tions, along with a run-time scheduler to map a IR program            mapped to the hardware. The IR captures primitive operations
                                         to the hardware blocks. Our evaluation shows that Z IPPER             from GNNs into an IR program, with its semantics fitting
                                         achieves 93.6× speedup and 147× energy reduction over                 into the hardware structure for efficient execution. More im-
                                         Intel Xeon CPU, and 1.56× speedup and 4.85× energy re-                portantly, the informative IR program enables our compiler
                                         duction over NVIDIA V100 GPU on averages.                             to perform GNN-specific optimizations such as sparse graph
                                                                                                               tiling and redundant operation elimination.
                                                                                                                   Coupled with the compiler, we propose a GNN accelerator
                                         1.   INTRODUCTION                                                     architecture to execute GNN IR programs. The accelerator
                                            Graph neural networks (GNN) start to gain momentum                 hardware consists of dedicated execution blocks for differ-
                                         since researchers involve graphs into DNN tasks. By leverag-          ent primitive operations, along with a run-time scheduler to
                                         ing the end-to-end and hierarchical learning capability of deep       map an IR program to the hardware blocks. For better per-
                                         learning, as well as the rich structural information of graphs,       formance, the scheduler effectively exploits GNN-specific
                                         GNNs achieve better performance in a variety of domains               parallelisms while respecting the dependencies enforced in
                                         including molecular science [16], recommendation [13, 39],            an IR program. For instance, the scheduler overlaps the exe-
                                         and transportation [7, 8].                                            cution of tiles (subgraphs) which exercise different hardware
                                            To better unleash the power of GNNs via efficient inference        resources/blocks to improve hardware utilization.
                                         execution, we first perform a thorough analysis of the general            In evaluation, we compare Z IPPER with Intel Xeon E5-
                                         GNN design space. We find that GNNs consist of diverse                2630 v4 CPU and NVIDIA V100 general-purpose GPU.
                                         primitive operations, including regular, compute-intensive            The experiments results show that Z IPPER achieves 93.6×
                                         general matrix multiplication operations from DNNs and ir-            speedup and 147× energy reduction over the CPU on average.
                                         regular, memory-intensive graph operations, such as gather            Compared to the GPU, Z IPPER achieves 1.56× speedup and
                                         and scatter, from traditional graph processing. As such, sys-         4.85× energy reductions.
                                         tems optimize exclusively for DNNs or traditional graph pro-              We summarize the main contributions below:

                                                                                                           1
• We perform a thorough characterization on general GNN

                                                                                                Out-edge
                                                                                                                                                                                      ELW

                                                                                                Scatter-

                                                                                                             SumDst
                                                                                                             Gather-

                                                                                                                                                 ReLU
                                                                                    V ×F

                                                                                                                            V ×F

                                                                                                                                                             V ×F
                                                                                                                                                                                      GOP

                                                                                                                                    FC
  models today. Our characterizations show that GNN work-                                                                                                                             GEMM
  loads have a mixed set of compute-intensive and memory-                                                                                                                             Tensor

  intensive operators that do not simultaneously exist in ei-               (a)

                                                                                                                                   LeakyReLU
  ther traditional graph analytics or DNNs.

                                                                                                 Scatter-
                                                                                                 In-edge

                                                                                                                                                        Gather-
                                                                                                                                                        SumDst

                                                                                                                                                                    V ×1
                                                                                                            E×F

                                                                                                                                               Exp
                                                                                                                       MV
• We present a GNN IR. It captures primitive operations in

                                                                                                                             +
  GNNs, which is friendly to hardware semantics. The asso-

                                                                                                                                                                               ReLU
                                                                             V ×F

                                                                                                                                                                                         V ×F
                                                                                           FC

                                                                                                                                                                           /
  ciated compiler automatically converts a GNN model into

                                                                                                 Out-edge
                                                                                                 Scatter-

                                                                                                                                                        SumDst
                                                                                                                                                        Gather-
  an IR program while applying GNN-specific optimizations

                                                                                                                                                                    V ×F
                                                                                                            E×F

                                                                                                                             MV

                                                                                                                                               ×
  such as tiling and redundant computation elimination.                     (b)
• We propose an efficient and flexible GNN accelerator archi-
  tecture. The architecture exploits the parallelisms unique          Figure 1: One single layer of a) GCN [25] and b) GAT [35].
  to GNN to maximize hardware utilization, thereby improv-
  ing execution efficiency. The architecture is also applicable       • The gather operation collects and reduces the embed-
  to a broad domain of GNNs.                                             dings of all the incoming (or outgoing) edges of each vertex
• We evaluate Z IPPER with detailed experiments and demon-               to a fixed-length vertex embedding. The specific reduc-
  strate average 93.6× and 1.56× speedup with 147× and                   tion function is user-defined, and aligns vertices’ embed-
  4.85× energy reduction over CPU and GPU, respectively.                 ding for subsequent operations despite different edges.
2.   BACKGROUND                                                       General Matrix Multiplication Operations. GNNs use neu-
                                                                      ral network (NN) operations such as MLP [28] and RNN [11]
   GNN models form a large design space, and prior work [15,          to transform vertex and edge embeddings. These NN-based
37] usually focus on only specific GNN models (e.g., GCN              operations are important to GNNs as they enable the learning
(or graph convolutional network) [25]) and, thus, lack general        ability on the graph. Since there is no dependency among
applicability. Since our work targets generic GNN models,             vertices and edges, these NN operations can be performed in
this section describes the general design space of GNN mod-           parallel, which is essentially general matrix multiplication.
els, emphasizing the common computation primitives.
   GNNs extend traditional graph processing with the end-             Element-Wise Operations. GNNs also use element-wise
to-end learning capability of deep learning, which has led            (ELW) operations to transform vertex and edge embeddings.
to better accuracies than the prior hand-crafted or intuition-        Common ELWs include add, exp, and RELU. ELW operations
based methods (e.g., DeepWalk [30] and node2vec [18]) in              on different vertices and edges can also be executed in parallel.
a wide variety of domains including molecular science [16],           While an ELW operation is less compute-intensive than a
recommendation [13, 39], and transportation [7, 8].                   GEMM, GNNs can spend a significant portion of time on ELWs
   Similar to conventional DNNs, GNNs are composed of                 owing to their quantity as we show later.
layers, where a layer l takes as input the vertex and edge em-        Examples. We take a layer from two popular GNN mod-
bedding matrix along with the graph structure in the form of          els, GCN [25] and GAT [35], respectively, as examples to
the adjacency matrix and outputs the new embedding matrix             illustrate how we can express GNNs with such primitive op-
for the layer l + 1. Different from DNNs which may consist            erations. Figure 1 shows each layer’s computation graph
of a large number of layers, GNN models only have a few               implemented in the DGL [36] library, where F, V , and E
(usually less than five) GNN layers.                                  represent the embedding size, vertex number, and edge num-
Capturing GNN Design Space. Owing to the end-to-end                   ber of the input graph, respectively. We annotate the three
learning capability of GNNs, algorithm researchers have ex-           primitive operations in the figure. The GCN model in Fig-
plored a wide variety of different GNN models, leading to an          ure 1a show is relatively simple. In contrast, the GAT model
enormous model space for GNNs [40].                                   in Figure 1b shows much more complex computation pat-
   While the GNN design space is vast, computations in GNN            terns. The mixed operation types also show the complexity
can be abstracted as two main kinds of operations [36]: 1)            and operation diversity of GNNs.
graph operations (GOP) and 2) neural-network operations.
The latter could be further classified into either general ma-        3.     MOTIVATION
trix multiplication (GEMM) or element-wise (ELW) operations.            A GNN model by nature is a flexible combination of DNNs
These three operations (GOP, GEMM, ELW) cover all forms of            and graph processing. Diverse characteristics of DNNs and
computation in GNNs. This abstraction is described in a               graph processing introduce great expressiveness but also
widely-used GNN library DGL [36], and also shared by other            make it challenging for accelerating computation. In this
libraries such as PyG [14] and NeuGraph [27].                         section, we first study the characteristics of the GNN work-
   We now briefly describe these three operations and show            loads against the well-studied DNN models and traditional
how they are used with concrete GNN models.                           graph processing algorithms on GPU architecture. We find
Graph Operations. There are two graph operations in GNNs:             that the existing solutions suffer from inefficiency problems
scatter and gather. Such two operations can be consid-                due to GNN’s diverse characteristics. We then study the root
ered as vectorized graph propagation operations in traditional        cause at both architecture and software level, and propose a
graph processing (c.f., GAS [17] for graph processing).               solution with co-design of software and hardware.
• The scatter operation distributes the embedding of each             3.1         GNN Workload Characterization
   vertex to its outgoing (or incoming) edges.

                                                                  2
25                                                                              els. At the top of each figure, we annotate which primitives
                                                                                  Workspace
Memory Usage (GB)

                    20                                                            Feature           operation (i.e., GEMM, ELW, and GOP in Section 2) dominates
                                                                                  Weight
                    15                                                            Graph             an algorithm at a given moment in time.
                    10                                                                                 GNNs exhibit a much more diverse mix of primitive op-
                                                                                                    erations than traditional graph processing and CNNs. GOP
                     5                                            Beyond 32 GB
                                                                  GPU Memory                        dominates the execution of PageRank, while GEMM and ELW
                     0                                                                              dominate the execution of VGG. In contrast, all the three prim-
                              GAT

                                                GAT

                                                                  GAT
                         PR

                                    SAGE
                                           PR

                                                      SAGE
                                                             PR

                                                                        SAGE
                                                                                 VGG16
                                                                                         RN50
                                                                                                    itives operations exist in GNNs; the interleaving of the three
                              CP                 SL               EO                 256            operations varies by GNN model.
Figure 2: The total memory usage. The workspace refers to                                              It is worth noting that prior GNN accelerators [15,22,24,37,
the intermediate data during the computation. GNNs (GAT                                             41] target only one particular type of GNN called GCN (graph
and SAGE) require much larger workspace and thus could not                                          convolutional network), which has a fixed primitive operation
process large graphs (e.g., EO).                                                                    interleaving of GOP-GEMM-ELW as shown in Figure 1a. A
                                                                                                    general GNN would be much flexible than GCN in model
   We compare and contrast GNNs against DNNs as well as                                             structure, as GAT shown in Figure 1b, which has a much more
traditional graph processing algorithms to understand GNNs’                                         complicated primitive operation mixing and interleaving.
common and unique properties. Specifically, we start our                                            Low Hardware Utilization Because of the primitive opera-
study from two representative GNN models, GAT [35] and                                              tion diversity, GNNs tend to have lower hardware utilization
GraphSAGE (SAGE) [20], two DNN models, VGG [33] and                                                 than CNNs. As a result, accelerators built for CNNs are
ResNet (RN) [21], and a traditional graph processing work-                                          ill-suited for GNNs. We annotate in Figure 3 the average
load PageRank (PR) [5]. GNNs and traditional graph pro-                                             FLOP efficiency and DRAM bandwidth utilization for each
cessing algorithm are evaluated on three datasets (Table 3);                                        algorithm. The FLOP efficiency on both GNNs is at least
DNNs are evaluated on ImageNet [10] with batch size of 256.                                         35% lower than that of VGG; the DRAM bandwidth utilization
   We identify two opportunities for optimizing GNNs.                                               of both GNNs is also lower than that of VGG.
 Observation 1: GNNs exhibit much higher memory usage                                                  A close examination of the GNN execution shows the
 than DNNs and traditional graph processing. As a result, it’s                                      reason. CNNs primarily rely on GEMM, which has high FLOP
 hard to efficiently scale GNNs to large graphs.                                                    efficiency and high DRAM bandwidth utilization due to its
   Specifically, Figure 2 compares the memory usage of differ-                                      regular compute and memory access patterns. GNNs mix
ent algorithms on a NVIDIA V100 GPU with 32 GB memory.                                              regular GEMM kernels and irregular GOPs used in traditional
On the dataset SL, the GNN GraphSAGE uses 16.3 GB of                                                graph processing (e.g., PR), which has low FLOP efficiency
GPU memory, while PageRank on the same dataset uses only                                            and DRAM bandwidth utilization due to its irregular compute
3.7 GB and VGG16 with ImageNet under batch size of 256                                              and memory access patterns. As a result, GNNs tend to have
uses 6.9 GB. We also observe a similar trend on dataset CP.                                         lower hardware utilization than CNNs.
   The excessive memory usage prevents GNN models from
processing large graphs. Figure 2 shows that when using a                                           3.2    Inter-tile Pipelining
large graph EO, which consists of 10.5× more vertices and                                              Our main idea is to exploit the operation diversity in GNNs
1.2× more edges than SL, both GNNs GAT and GraphSAGE                                                and pipeline the operations to reduce memory footprint while
run into the out-of-memory issue.                                                                   improving hardware utilization. Figure 4 illustrates the idea
   To understand the excessive memory usage, we further                                             of a simple GNN with three primitives operations.
break down the memory usage into four components: the                                                  Figure 4a illustrates how GNNs are executed on today’s
graph data, the weight matrices, the input/output feature em-                                       system, where the three operations are sequentially executed
beddings, and the workspace, which refers to the intermediate                                       as three serialized stages, each operating on the entire graph.
data between operations. We observe that GNNs use a sig-                                            The intermediate data between stages encodes information
nificant portion of memory for storing the intermediate data.                                       for the entire graph, leading to a large memory footprint.
This is fundamental because classic GNN systems operate on                                          In addition, the serialized execution leads to long execution
an entire graph within one operation.                                                               times with low hardware resource utilization.
  Observation 2: Due to irregular nature of graph, similar to                                          A common strategy to reduce the memory footprint is
  traditional graph processing, GNNs also show lower hard-                                          graph tiling [42], which partitions a graph into smaller sub-
  ware utilization than CNNs. However, GNNs mix diverse                                             graphs, a.k.a., tiles, and operates on each tile separately. Fig-
  primitive operations, which requires different types of hard-                                     ure 4b illustrates such an idea, where the entire graph is
  ware resources. It provides a unique optimization opportunity                                     divided into three tiles. The three tiles are processed sequen-
  by redistributing the hardware resource while overlapping                                         tially, which reduces the memory footprint since at any given
  different operations.                                                                             moment only a small subgraph is resident in memory.
                                                                                                       However, this strategy degrades performance due to the
Primitive Operation Diversity Figure 3 plots how the single-                                        bookkeeping overhead such as setting up and switching tiles.
precision FLOP efficiency and the DRAM bandwidth utiliza-                                           In addition, it does not address the low hardware utilization,
tion change over time for the four algorithms – all on a V100                                       as at any given moment only one operation is executed.
GPU. These two metrics capture key computational and mem-                                              We propose to pipeline across tiles, as illustrated in Fig-
ory behaviors of an algorithm. The data is obtained from one                                        ure 4c, which retains the advantage of low memory footprint
iteration of PageRank and a layer in the GNN and DNN mod-                                           while significantly improving the performance. By pipelining

                                                                                                3
phase                      PageRank (SL)                             phase              VGG16-Conv0 (256)                           phase                         GAT (SL)                                     phase                         SAGE (SL)
                 100                              Mean:                               100                                                            100                                 Mean:                                  100                     Mean:
                                                  FLOP: 0.11                                                                                                                             FLOP: 7.39                                                     FLOP: 17.30
Percentage (%)

                                                                     Percentage (%)

                                                                                                                                    Percentage (%)

                                                                                                                                                                                                               Percentage (%)
                  80                              DRAM: 34.76                          80                                                             80                                 DRAM: 51.68                             80                     DRAM: 60.75
                  60                                                                   60                                                             60                                                                         60
                                                                                                                      Mean:
                  40                                                                   40                             FLOP: 70.48
                                                                                                                      DRAM: 66.66
                                                                                                                                                      40                                                                         40
                  20                                                                   20                                                             20                                                                         20
                   0                                                                    0                                                              0                                                                          0
                       0          2        4      6            8                             0      5          10    15     20                              0           50        100            150                                    0          50       100    150             200
                                        Time (ms)                                                             Time (ms)                                                      Time (ms)                                                                   Time (ms)
                          Single-precision FLOP efficiency                              DRAM bandwidth utilization                  Light element-wise phase                                     Dense matrix phase                                        Sparse graph phase
                                             Figure 3: The metric kernel traces of one layer or iteration. We also annotate different phases.

                                                                                                                                                       1   x = x.matmul(self.lin).view(-1, n_heads, dim_out)                     1    x = x.matmul(self.lin).view(-1, n_heads, dim_out)
                 Graph                 OP1                         OP2                                  OP3          8GB    (a)                        2   al = (x * self.att_l).sum(dim=-1).unsqueeze(-1)                       2    al = (x * self.att_l).sum(dim=-1)
                                                                                                                                                       3   ar = (x * self.att_r).sum(dim=-1).unsqueeze(-1)                       3    ar = (x * self.att_r).sum(dim=-1)    Vertex OPs
                 Tile 1        OP1       OP2     OP3     3GB                                                                                           4                                         Vertex OPs                      4
                                                                                                                                                       5   g.srcdata.update({'x': x, 'al': al})                                  5    xi = x.index_select(dim=0, index=edge_list[0])
                 Tile 2                                  OP1            OP2                 OP3   3GB                       (b)                        6   g.dstdata.update({'ar': ar})                                          6    ai = ar.index_select(dim=0, index=edge_list[1])
                 Tile 3                                                                           OP1          OP2    OP3   3GB                        7   g.apply_edges(fn.u_add_v('al', 'ar', 'alpha'))                        7    aj = al.index_select(dim=0, index=edge_list[0])
                                                                                                                                                       8   e = leaky_relu(g.edata.pop('alpha'), negative_slope)                  8    alpha = ai + aj
                                                                                                                                                       9   g.edata['alpha'] = edge_softmax(g, e)                                 9    alpha = leaky_relu(alpha, negative_slope)
                 Tile 1        OP1       OP2     OP3     3GB
                                                                                                                                                      10   g.update_all(fn.u_mul_e('x', 'alpha', 'msg'),                        10    alpha = softmax(alpha, edge_list)
                 Tile 2                  OP1       OP2         OP3     3GB                                                  (c)                       11                     fn.sum('msg', 'x'))                                11    msg = xi * alpha.unsqueeze(-1)        Edge OPs
                                                                                                                                                      12   out = g.dstdata['x']                      Edge OPs                   12    out = msg.scatter(edge_list[1],reduce='sum',dim=0)
                 Tile 3                                  OP1            OP2                 OP3   3GB
                                                                                                                                                      13                                                                        13
                                                                                                                                                      14   out = out + self.bias.view(-1, n_heads, dim_out)                     14    out = out.view(-1, n_heads * dim_out)
                                                                                                                                                      15   out = activation(out)                                                15    out += self.bias
                                                                                                                                                      16   out = out.flatten(1)                 Vertex OPs                      16    out = activation(out)              Vertex OPs
     Figure 4: An illustration of the benefits of inter-tile pipelined
     execution. (a) Non-tiling execution with the peak memory
     usage of 8 GB; (b) tiling-based execution, where the graph is                                                                                    Figure 5: The unawareness of graph semantics in current
     partitioned into three tiles and the memory usage is 3 GB; (c)                                                                                   GNN frameworks: DGL [36] (left) and PyG [14] (right).
     inter-tile pipelined execution, where two tiles are pipelined in
     the operation level with no more than 6 GB memory usage.
                                                                                                                                                         In this work, we propose a hardware and software co-
     across tiles, operations of different tiles are overlapped and                                                                                   designed system that exploits the tile- and operator-level par-
     executed at the same time. Overlapping operations exercise                                                                                       allelism to provide efficient and scalable support for generic
     different resources, improving the overall resource utilization                                                                                  GNN acceleration. Z IPPER overcomes the challenges of
     and therefore leads to better performance.                                                                                                       inter-tile pipelining in Section 3.3 through a combination of
                                                                                                                                                      software GNN compiling and hardware GNN architecture.
     3.3                   Challenges of Inter-tile Pipelining                                                                                        Figure 6 shows an overview of the Z IPPER system.
        Applying tile-level pipelining to GNNs is challenging for                                                                                        First, Z IPPER proposes a GNN intermediate representation
     two reasons. First, there is a semantics gap between the                                                                                         (IR) and the associated compiler, which together automati-
     classic GNN programming model and efficient hardware ex-                                                                                         cally extract the graph-specific semantics (e.g., vertex and
     ecution. In particular, tile-level pipelining requires us to                                                                                     edge computation graphs) from classic GNN programming
     identify the operations associated with each tile, including its                                                                                 model. The IR is closer to the hardware, and enables an
     vertices and edges. But classic GNN programming model is                                                                                         efficient hardware accelerator design and scheduling.
     designed to define operations on a graph as a whole without                                                                                         Second, Z IPPER proposes an accelerator architecture for
     exposing vertex- and edge-level operations. This is accom-                                                                                       executing GNN models represented as IR programs. The
     plished by representing all vertices and edges as tensors. This                                                                                  key to the hardware is to be flexible enough to accommodate
     programming model thus expresses GNN execution as tensor                                                                                         different types and mixes of the primitive operations while
     computations, similar to conventional CNNs. Figure 5 shows                                                                                       being efficient by exploiting GNN-specific parallelisms and
     such an example from the GNN library in PyTorch, where the                                                                                       locality. The hardware achieves this by employing dedicated
     bold boxes show how the vertices and edges are represented                                                                                       blocks for each primitive operation coupled with an efficient
     as tensors and the GNN execution is represented as tensor                                                                                        run-time scheduler, which maps GNN IR programs to the
     computation without exposing graph semantics.                                                                                                    hardware substrate.
        While this programming model makes GNNs similar to                                                                                            IR and Compiler The proposed IR is structured as multiple
     conventional CNNs and is thus friendly to GNN algorithm                                                                                          directed acyclic graphs (DAG) extracted from a GNN model.
     designers, it also hides vertex and edge-level details that are                                                                                  Each graph is labeled as a vertex segment or a edge segment,
     vital for efficient hardware execution.                                                                                                          where the DAG nodes are the GNN operations for a single
        Second, GNNs adopt a wide variety of different opera-                                                                                         vertex or an edge while the DAG edges are the data of the
     tions. Therefore, one must flexibly schedule a GNN to the                                                                                        vertex or edge. The IRs are meant to be used by the compiler,
     hardware. For instance, the GCN and the GAT in Figure 1 are                                                                                      which compiles the GNN model into a low- and tile-level
     drastically different. A static, fixed mapping from a GNN to                                                                                     program consisting of three functions for the source vertices
     the hardware is likely suboptimal in utilizing the hardware.                                                                                     of tiles (sFunction), the edges of the tiles (eFunction), and
                                                                                                                                                      the destination vertex of partitions (dFunction), respectively,
     4.                ZIPPER OVERVIEW                                                                                                                under the tiling-based execution semantics to specifies the

                                                                                                                                    4
Current GNN Model                                                                                  Graph Structure                                                                              Vertex/Edge Embeddings

                                                                           Tile Size                                                                                                         Off-Chip Memory Access
                 V ×F
                                       IR        Tile-Level Program                                        Graph Tiling
                                                                                                                                                   dStream
                                                                                                         Destination Vertex                                                                           Vector Unit
                 FC                                                                                                                               sStream 1                 Tile            Mem

                                                                                                                                                                                                                         Embedding
                                                                                                                                                                            Hub             Ctrl

                                                                                                                                                                                                                          Memory
                                        sFunction

                                                                                                                                                                                                                          Unified
                                                                                                                                                  sStream 2
         MV                 MV                              dFunction                                                                                                                                 Vector Unit

                                                                                       Source Vertex
                                       1    WAIT
                                       2    LD.SRC          1   FCH.PTT                                                                           eStream 1
       Scatter            Scatter      3    GEMM            2   LD.DST                                                                                                       Dispatcher
       InEdge             OutEdge
                                       4    GEMV                                                                                                   ······
                                                            3   GEMM                                                                                                                                  Matrix Unit
                                       5    SCTR.OUTE                                                                                             Scheduler
                      +                                     4   GEMV
                                       6    SIGNAL.E
         ×                                                  5   SCTR.INE
                                           eFunction        6   UPD.PTT
      Gather-             Gather-
      DstSum              DstSum       1    WAIT            7   SIGNAL.S                               Graph Adjacency Matrix                                 Partition 1
                                       2    ADD
                                                            8   WAIT                                                                           dStream:
                  /                    3    GTHR.DST.SUM
                                                            9   DIV                                                                                                                Tile 1              Tile 3
                                       4    MUL
                                       5
                                                                                                                                               s/eStream 1:

                                                                                            Tiles
                                            GTHR.DST.SUM   10

                                                                                                        Tile 1
                                                                                                                 Tile 2
                                                                                                                          Tile 3
                 ReLU                                           RELU

                                                                                                                                   ·
                                                                                                                                   ·
                                                                                                                                   ·
                                       6    FCH.TILE       11   ST.DST
                                                                                                                                                                                        Tile 2
                                       7    CHK.PTT                                                                                            s/eStream 2:                                                     Pipelined Execution

                           (a) GNN IR and Compiler                                                     (b) Graph Tiling                                                     (c) Hybrid Architecture

                                                                     Figure 6: The Z IPPER system overview.

tile data dimensions as well as the interactions between the                                                                       vertices is stored in a sparse format such as edge list (COO)
vertices and edges. We refer to the three functions as SDE                                                                         or compressed sparse column (CSC) for saving the storage [6].
functions for simplicity.                                                                                                          While the tile metadata such as the edge and vertex numbers
Hardware Architecture The hardware consists of two main                                                                            are usually stored in a dense array.
components: the building blocks to support various primitive
operations execution and a scheduler that dispatches GNN                                                                           5.2    Multi-streamed Execution
tiles to the hardware blocks.                                                                                                         Based on the above grid-based graph tiling, we propose a
   The hardware blocks consist of both generic matrix and                                                                          generally applicable parallel execution mechanism for GNN
vector units to support GEMM and ELM operations, respectively,                                                                     models which are combinations of GEMM, ELM, and GOPs.
as well as graph-specific structures that are optimized for                                                                           We first need a multi-streamed parallel execution mecha-
GOP. The scheduler generates independent work units for                                                                            nism that supports fine-grained inter-stream synchronization
source vertices and edges in tiles and destination vertices in                                                                     via a signal-wait pair. The idea is to map the computation
partitions, which we call sStreams, eStreams and dStreams.                                                                         of concurrent tiles to different streams and insert proper syn-
The streams are then pipelined efficiently across the hardware                                                                     chronization instructions to maintain the dependency from
blocks to achieve high utilization.                                                                                                the original GNN model.
                                                                                                                                      The key for this mapping is the Gather operation. This
5.      GNN PARALLELIZATION                                                                                                        operation is applied to all the tiles under the same partition to
  In order to exploit the inter-tile pipelined execution op-                                                                       reduce the embeddings of all the edges and source vertices
portunity, we first describe our graph tiling method and a                                                                         for each destination vertex. As such, all the operations that
generally applicable strategy to parallelize GNN models.                                                                           depend on the result of the Gather operation need to wait for
                                                                                                                                   all tiles in the same partition. In the meantime, the operations
5.1          Graph Tiling                                                                                                          that do not depend on the Gather operation can be executed
    The fundamental building block in our proposed system                                                                          in parallel.
is graph tiling (also called partition or sharding), which di-                                                                        Following the above principle, we propose to use multi-
vides an input graph into smaller sub-graphs, a.k.a., tiles. The                                                                   ple streams for the concurrent tile computation and a single
benefits of graph tiling are two-fold: easing the pressure of ex-                                                                  stream for the partition computation. We call the former as
cessive memory footprint of GNN computation and exposing                                                                           sStreams and eStreams, processing the source vertices and
tile-level parallelism that Z IPPER leverages (Figure 4).                                                                          edges in tiles respectively, and the latter as dStream for the
    We adopt a tiling strategy called grid-based or regular                                                                        destination vertices in partitions. The reason that the number
tiling [24, 27, 42]. The idea is to divide the graph adjacency                                                                     of s/eStreams is greater than the number of dStream is that
matrix into multiple smaller rectangles as tiles. Figure 7b                                                                        there can be many tiles for the same partition.
illustrates the grid-based tiling example. Formally, we first                                                                         Figure 6c-bottom shows an example of streams. The
split the vertices evenly into several destination partitions ac-                                                                  dStream first processes the embedding of a partition until
cording to their vertex IDs. For each destination partition, we                                                                    it reaches the signal and wait instructions. The signal
further split its vertices into several source partitions accord-                                                                  instruction wakes up two sStreams, which starts to load and
ing to their vertex IDs. As a result, each tile corresponds to                                                                     process a tile under the partition. The sStream ends up with a
exactly one destination and one source partition and uniquely                                                                      signal to wake up an eStreams and continue the tile process.
identifies a set of edges whose source and destination vertices                                                                    When the eStream reaches the end, it fetches the metadata
are in the corresponding partitions.                                                                                               of the next tile and checks whether the destination partition
    Owing to the sparsity in the graph, the adjacency matrix                                                                       is the current one. If true, the eStream starts a new round of
for the tile that indicates connections between edges and                                                                          s/eStreams through the signal and wait instruction; or it re-

                                                                                                                             5
Destination Vertex                              ● Edge          Tile 1       Tile 2          Tile 3          Tile 4           Tile 5       Tile 6
                                                                                                                                                                                                                      Table 1: An example of the Graph-Native GNN IR operations.
                                                          Partition 1           Partition 2               1 2 3 4 5 6 7 8 9 10 11 12                            3 6 7 8 1 2 4 5 9 10 11 12
                                                          ●                 ●       ●                1       ●                ●        ●                1             ●             ●            ●
                Partition 3 Partition 2 Partition 1

                                                                  ●                                  2               ●                                  2                                ●
                                                                            ●                        3                        ●                         3             ●                                                Operation Type                  Examples                        Operands
Source Vertex

                                                              ●                                 ●    4           ●                                 ●    4       ●                                             ●
                                                              ●         ●                            5           ●        ●                             5       ●●                                                                         GEMM (matmul, conv), ELW (
                                                              ●                 ●                    6           ●                 ●                    6       ●         ●                                                Computational                                        a single edge / vertex
                                                                                            ●        7                                         ●        7                                                ●                                 add, sub, mul, div, mv, relu, exp)
                                                              ●             ●                        8           ●            ●                         8       ●     ●
                                                                                                     9                                                  9                                                                                  Scatter (sendOutEdge-recvSrc,
                                                              ●       ●●        ●                   10           ●       ●●        ●                   10       ●●        ●                  ●                         Communicational                                          an communication index
                                                                                                                                                                                                                                           sendInEdge-recvDst), Gather (
                                                      ●                ●                            11   ●                ●                            11        ●              ●                                         (GOP)                                                 and an edge / vertex
                                                                                        ●           12                                     ●           12                                            ●                                     sendDstSum-recvInEdge)
                                                          source vertex load = 24                            source vertex load = 16                                source vertex load = 14                                Entry & Exit    Indicator (input, output)            a string edge / vertex
                                                      (a) regular graph tiling                           (b) sparse graph tiling                        (c) sparse tiling+reordering

Figure 7: Comparison of different graph tiling methods. The
                                                                                                                                                                                                                      ment [4, 12]. In contract, applying the reordering to GNN
vertices in (c) are arranged in descending order of their in-
                                                                                                                                                                                                                      can lead to considerable performance improvement, because
degrees. The maximum of the source and destination vertices
                                                                                                                                                                                                                      the data volume and computation for a vertex is much more
in a tile is 4 and 6 respectively.
                                                                                                                                                                                                                      than that in traditional graph processing. We use a heuristic
sumes the dStream. When the dStream continues, it finishes                                                                                                                                                            Degree Sorting strategy that reorders the vertices according
the partition process and fetches the next partition.                                                                                                                                                                 to their in-degrees as shown in Figure 7c to illustrate the
                                                                                                                                                                                                                      effectiveness of reordering for GNN. The vertices with high
5.3                                                           Tile Parameter Optimization                                                                                                                             in-degrees are arranged to the left side so there are more blank
    Although the tiling-based pipelining reduces the memory                                                                                                                                                           rows on the right side for the sparse tiling. As a result, the
footprint of the GNN computation, the large amount of tile                                                                                                                                                            total source vertex load is reduced.
vertex data loaded and computed without contributions to the
final results because of the graph sparsity. Figure 7a illustrates                                                                                                                                                    6.      GNN IR AND COMPILING
this problem where the source vertex without an edge in a                                                                                                                                                                Z IPPER automatically compiles a high-level GNN model
tile causes redundant memory access and computation. For                                                                                                                                                              expressed in classic GNN programming framework (e.g.,
example, in tile 1 (the top left one), vertex 3 is included in the                                                                                                                                                    DGL [36]) to the tile-level program that is scheduled and
tile and thus loaded to the on-chip memory. However, this                                                                                                                                                             executed as streams on our hardware substrate. As we have
vertex will be only computed as an edge source and the result                                                                                                                                                         explained earlier, the classic GNN programming model is
will not be propagated to any edges or destination vertices                                                                                                                                                           centered around tensor computation and does not expose
because it does not have any associated edges. So the on-chip                                                                                                                                                         edge and vertex level operations. To bridge the semantics
memory load, as well as the associated computation of such                                                                                                                                                            gap, we propose a graph-semantics-preserving IR that is used
vertices, are unnecessary in this tile.                                                                                                                                                                               by the compiler to generate the SDE functions mentioned
Sparse Tiling. We use the sparse graph tiling approach [27]                                                                                                                                                           in Section 5. An end-to-end illustration of the compiling
that embraces the graph sparsity to remove the above unnec-                                                                                                                                                           process is depicted in Figure 8, and we first introduce the IR
essary processing. As illustrated in Figure 7b, only the source                                                                                                                                                       with the corresponding compiling process, and then discuss
vertices that have associated edges in the tile are kept. Usually,                                                                                                                                                    performance optimizations based on the IR.
traditional graph processing systems, e.g. [31], prefer regular
tiling to sparse tiling. It is because their vertex/edge features                                                                                                                                                     6.1       Graph-Native GNN IR
are scalars. Applying sparse tiling would turning efficient                                                                                                                                                              We propose Graph-Native GNN IR that contains multiple
sequential off-chip memory access into much more expensive                                                                                                                                                            computational DAG segments. Each segment is labeled as
random access on small-size scalars, leading to lower overall                                                                                                                                                         an edge or vertex segment, and consists of IR operations
performance even skipping unnecessary vertices. However,                                                                                                                                                              as nodes that operate the data of a single edge or vertex.
the vertex/edge features in GNN are high-dimensional embed-                                                                                                                                                           Figure 8b shows two examples.
dings, much larger than the scalars and each of them can be                                                                                                                                                              Table 1 lists a subset of the IR operations in Z IPPER, which
translated into multiple off-chip memory transactions, which                                                                                                                                                          are designed to represent all three types of primitive oper-
can match the bandwidth of sequential access while avoiding                                                                                                                                                           ations (Section 2). In particular, the computational opera-
unnecessary processing.                                                                                                                                                                                               tions correspond to the GEMM and ELW operations in general
Graph Reordering. The improvement of the sparse tiling                                                                                                                                                                DNNs, while the communicational operations correspond to
is limited yet because of the random vertex distribution. As                                                                                                                                                          the GOPs that exchange data between vertex and edge. All
shown in Figure 7b, we can observe that most source vertices                                                                                                                                                          the IR operations target only one item (i.e., single edge or
only have a few (mostly one) edges to the destination par-                                                                                                                                                            vertex) for graph semantic atomicity. The IR connects the
titions in a tile, which less than their out-degrees. As such,                                                                                                                                                        high-level programming model by using compatible compu-
there is an opportunity to gather the out-edges of the source                                                                                                                                                         tational graphs and operations in the DNN, while it recovers
vertices into a few tiles to further reduce the redundancy and                                                                                                                                                        the graph semantics by operating on a single edge and vertex
improve vertex data reuse.                                                                                                                                                                                            with decoupled computational graphs.
    We leverage graph reordering to fulfill such an opportunity,                                                                                                                                                      Compiler. We build a compiler to automatically translate
which makes minimal modifications to the existing system                                                                                                                                                              the existing GNNs expressed via the high-level programming
design. There are lines of sophisticated graph reordering in                                                                                                                                                          model to our proposed IR. The compiler then uses the gener-
traditional graph processing, but only the lightweight methods                                                                                                                                                        ated IR to produce the SDE functions, which are eventually
are effective considering the limited performance improve-                                                                                                                                                            mapped to the hardware streams to exploit tile-level pipelin-

                                                                                                                                                                                                                  6
1   x = input(data='vertex') IR.v.1       1   x = input(data='vertex') IR.v.1   1   x = input(data='vertex') IR.s.1
                                    Original Code   2   x1 = x.matmul(self.lin)               2   x1 = x.matmul(self.lin)           2   x1 = x.matmul(self.lin)
  1   x = x.matmul(self.lin)                                                                  3   x1.sendOutEdge(idx=0)                                                     sFunction
                                                    3   x1.sendOutEdge(idx=0)                                                       3   x1.sendOutEdge(idx=0)
  2   xi = x.index_select(0, edge_list[0])          4   x1.sendInEdge(idx=0)                  4   a1 = x1.mv(self.att_i)            4   a1 = x1.mv(self.att_i)
  3   xj = x.index_select(0, edge_list[1])                                                    5   a1.sendOutEdge(idx=1)             5   a1.sendOutEdge(idx=1)               Instructions
                                                                                                                                                                               ……
                                                    1   xi = recvSrc(idx=0)    IR.e.1         6   a2 = x1.mv(self.att_j)                1 x = input(data='vertex') IR.d.1
  4   ai = xi.mv(self.att_i)                                                                  7   a2.sendInEdge(idx=0)                  2 x1 = x.matmul(self.lin)
                                                    2   xj = recvDst(idx=0)
  5   aj = xj.mv(self.att_j)                        3   ai = xi.mv(self.att_i)                1   xi = recvSrc(idx=0)   IR.e.1          3 a2 = x1.mv(self.att_j)            dFunction
  6   alpha = ai + aj                               4   aj = xj.mv(self.att_j)                2   ai = recvSrc(idx=1)                   4 a2.sendInEdge(idx=0)
                                                                                              3   aj = recvDst(idx=0)                                                       Instructions
  7   msg = xi * alpha                              5   alpha = ai + aj
                                                                                              4   alpha = ai + aj                   1 xi = recvSrc(idx=0)       IR.e.1         ……
                                                    6   msg = xi * alpha
  8   out = msg.scatter(edge_list[1],               7   msg.sendDstSum(idx=0)                 5   msg = xi * alpha                  ∙∙∙∙∙∙
                                                                                              6   msg.sendDstSum(idx=0)             6 msg.sendDstSum(idx=0)
  9                       'sum', dim=0)                                                                                                                                     eFunction
                                                    1   out = recvInEdge(idx=0) IR.v.2        1                                     1
 10   out += self.bias                                                                            out = recvInEdge(idx=0) IR.v.2        out = recvInEdge(idx=0) IR.d.2
                                                                                                                                                                            Instructions
                                                    2   out1 = out + self.bias                2   out1 = out + self.bias            2   out1 = out + self.bias
 11   out = relu(out)                               3   out2 = relu(out1)                     3   out2 = relu(out1)                 3   out2 = relu(out1)
                                                                                                                                                                               ……

                                                    4   out2.output(data='vertex')            4   out2.output(data='vertex')        4   out2.output(data='vertex')

   (a) An example of current GNN code               (b) Graph-native GNN IR and edge-to-vertex optimization                                (c) Tiling-based code generation

Figure 8: Compiling process from a high-level GNN programming model (PyG [14]) to the functions for tiling-based execution.

ing. For a given GNN model, the compiler generates the                                            we first adapt the proposed general IR according to the low-
functions in three steps, as illustrated in Figure 8.                                             level hardware execution model. In this paper, we target
Step 1: Constructing the IR with graph semantics. The                                             the tiling-based execution model (Section 5.2), so we need
compiler extracts a generalized GNN computational graph to                                        independent and separate sFunction and dFunction for the
capture the nature of the given GNN model. Specifically, it                                       processing of source and destination vertices under a tile and
first acquires the raw computational graph from the standard                                      a partition, respectively. We obtain the functions by further
DNN programming frameworks such as TensorFlow and Py-                                             dividing the vertex segments into source and destination parts
Torch, then defuses and replaces the library-customized GOPs                                      as illustrated in Figure 8c. The compiler first replicates the
into atomic ones (i.e., Scatters, Gathers) We achieve that                                        vertex segments, and then prunes the operations unrelated to
by maintaining a list of operations according to the library im-                                  the source or destination semantics from each replica.
plementations, for example the apply_edges / update_all                                              Given the adapted IR, we generate the instruction SDE
in DGL and the Scatter in PyG.                                                                    functions using the hardware ISA that we describe later. The
   Afterward, the compiler splits the computational graph                                         source, destination, and edge segments target the sFunction,
into multiple segments as the graph-native IR to reveal the                                       dFunction, and eFunction, respectively. For each segment,
graph semantics. Because the tensor types (e.g., tensor for                                       we sort it topologically from the input markers. The input and
edge or vertex) are changed only by the GOPs, the compiler                                        output markers are translated into data-transfer instructions;
simply splits the model by replacing each GOP with a pair of                                      the computational and send operations corresponds to the
IR communicational operations send and recv as annotated                                          computational instructions; the recv operations are regarded
by the red solid boxes in Figure 8. The output is multi-                                          as barriers for the synchronization instructions to ensure the
ple disconnected GNN model segments, and the compiler                                             multi-stream execution semantics.
then labels the segments with graph semantics based on the                                           Since the proposed IR can be multiple disconnected seg-
communicational operations. For example, a segment con-                                           ments as in Figure 8, the sorting can be deadlocked, which
taining sendOutEdge will be labeled as IR.v.x where the x                                         implicates the interaction between the segments. For the
denotes the segment index. The compiler finally maps the rest                                     deadlock in a destination segment, the compiler inserts the
computational operations to the corresponding single-item                                         update, signal, and wait instructions, and resumes the
operations (i.e., operations for a single edge or vertex). It also                                sorting by removing the recv operations in the next source
inserts input and output markers for the whole IR as the entry                                    or destination segment; for the deadlock at the end of a source
and exit indicators to finish the IR construction.                                                segment, the compiler inserts the signal and wait instruc-
                                                                                                  tions, and resumes the sorting by removing the recvs in an
Step 2: Optimizing the structure of the IR. The proposed
                                                                                                  edge segment; for the deadlock at the end of an edge seg-
graph-native GNN IR not only supports the existing optimiza-
                                                                                                  ment, the compiler checks whether there are still unvisited
tions in deep learning, but also enables the GNN-specific
                                                                                                  segments: if true, it simply removes the recvs in one un-
optimizations. The node and edge in our IR segments rep-
                                                                                                  visited segment and resumes the sorting, or it ends the IR
resent the operator and feature embedding, which is highly
                                                                                                  translation to finish the function generation.
similar to the deep learning computational graph. So the
existing optimizations in deep learning, e.g., subgraph sub-                                      Instruction Set Architecture (ISA). We propose Z IPPER
stitution and operator fusion, are fully applicable to our IR.                                    ISA with three types of instructions: computational, data-
Besides, the supplementary of the graph semantics enables                                         transfer, and synchronization instructions as shown in Table 2.
the space to explore the GNN-specific optimizations, which                                        The computational instructions are supposed to cover all
involves both the computational graph and the input graph.                                        the operations of GEMM, ELW and GOP appeared in the model
We further propose such an optimization based on our IR and                                       computation. Each of the instructions is coarse-grained and
we detail it in Section 6.2.                                                                      operates on all the edges or vertices in a tile to improve the
                                                                                                  performance. The ELW instructions also have matrix and vec-
Step 3: Generating SDE function code. Before translating
                                                                                                  tor versions aiming at different use cases. The data-transfer
the optimized IR into functions of the instruction sequences,
                                                                                                  instructions are designed for loading and storing the embed-

                                                                                          7
Table 2: The details of ISA for the Z IPPER architecture.

           Instruction Type                                 Examples                                                    Operands
                               arithmetics (ADD, SUB, MUL, DIV), special functions (EXP, RELU),     data dimensions, source and destination addresses
                        ELW
                               matrix-vector multiplication (GEMV)                                  of embedding memory

        Computational          general matrix multiplication (GEMM), index-guided batched matrix
                        GEMM
                               multiplication (BMM)                                                 tile id, data dimensions, source and destination
                               edge gather (GTHR.DST.SUM, GTHR.DST.MAX), vertex scatter             addresses of embedding memory
                        GOP
                               (SCTR.OUTE, SCTR.INE)
                                                                                                    tile id, data dimensions, destination address of
            Data-Transfer      off-chip memory load (LD.DST, LD.SRC, LD.EDGE), store (ST.DST)
                                                                                                    embedding memory
                               stream wakeup (SIGNAL.E), fetch new tile / partition ID (FCH.TILE,
           Synchronization                                                                          stream ID
                               FCH.PTT), update (UPD.PTT), check (CHK.PTT)

dings of the edges and vertices. They are also coarse-grained                    pipelining due to the interleaving nature of operations which
in the tile level and can be further divided into multiple off-                  uses different computation resources in GNN computation.
chip memory transactions according to the protocol. The                          The streams are created and maintained by a hardware sched-
synchronization instructions perform inter-stream synchro-                       uler, which feeds the instructions of the ready stream to the
nization and ensure the correct execution order.                                 downstream dispatcher. The dispatcher decodes and issues
                                                                                 the incoming instruction to the target components for the
6.2    IR-Based Compiling Optimization                                           execution. We also design a tile hub and a unified memory
   The graph-native IR also provides a wide optimization                         for storing the edge list and embeddings of the tiles being
space for GNN model in addition to generating the low-level                      processed by the streams.
code. Optimizations from both traditional deep learning and
graph processing such as graph substitution and operation
fusion, forming GNN-specific optimizations, can be auto-
                                                                                 7.1      Hardware Component
matically applied by compilers through the IR. We propose                           In the Z IPPER architecture, we deploy a set of computing
edge-to-vertex (E2V) optimization to demonstrate how the IR                      units and various memory structures. We later perform a
can be used for GNN model optimization.                                          design space exploration (i.e., the number of computing units
   The idea of E2V is to move an operation on edge to the                        and memory structure parameters) for justifying our choices.
vertex if the operation input only involves the source or des-                   Computing Unit. Two types of computing units are de-
tination of the edges. For example, we can move the two                          ployed for different primitive operations (Section 3.1): Matrix
matrix-vector multiplications (MVs) to the vertex segment in                     Unit (MU) and Vector Unit (VU). The MU is a single systolic
Figure 8b as the edge-to-vertex optimization. Because the na-                    array including a weight buffer for the GEMMs. It executes
ture of one vertex relates to multiple edges, the data scattered                 with output stationary dataflow [9] where the input embed-
to edges will be the same as the source or destination vertex                    dings and weight are feed to the unit at the same time. The
data. If now an operation applied to the edges uses only the                     VU is a group of single-instruction-multiple-data (SIMD)
source or destination data, the results would also be the same.                  cores for the ELWs and GOPs. The reason for offloading GOPs
So this operation causes redundant computation.                                  to the VU instead of another dedicated component is that the
   To detect and eliminate such redundancy, we apply the                         atomic operations in GOP are also element-wise with only the
operation before the data are scattered to the edges. We first                   operands determined by the tile edge list. For executing the
detect the redundancy by traversing and examining the oper-                      GOPs, each core is responsible for scattering or gathering one
ations in the edge segment from the recvs. Operations that                       vertex in the tile at a time and fetches the corresponding part
use only the source or destination are enqueued. When an                         of the edge list. Both the MUs and VUs can be instantiated
operation does not satisfy the condition, we move the oper-                      to have multiple instances to increase parallelism.
ations in the queue ahead of the corresponding send in the                       Memory. The on-chip memory consists of two parts: 1) a
vertex segment. Finally, we insert extra send-recv pairs for                     large unified embedding memory (UEM) for storing the input,
the moved operations whose results are still used in the edge                    intermediate and output embeddings of edges and vertices,
segment, and restart to examine the next recv. In fact, the                      and 2) a small tile hub (TH) for the graph tiles containing
E2V optimization can be also applied to original computa-                        the edge list and other metadata such as the edge and vertex
tional graphs, but it may complicate the implementation and                      numbers of the tile. Owing to the large size of embeddings,
incur several unnecessary traversals on the whole graph.                         we use eDRAM as the UEM. The eDRAM has multiple banks
                                                                                 and connects directly with all the computing units to support
7.    HARDWARE ARCHITECTURE                                                      the multi-streamed parallel execution. For the TH, we use
  To support the diverse primitive operations and inter-tile                     a small dedicated on-chip SRAM, since the edge lists are
pipelining in GNN computation, we design a flexible hard-                        accessed more frequently and randomly. The new data will
ware substrate, as shown in Figure 6. Generally, we deploy                       be loaded to the TH when an eStream finishes tile. For the
two types of computing units for the primitive operations                        interaction between the on-chip and off-chip memory, we
and launch multiple streams concurrently for the inter-tile                      design a memory controller that responds to the data-transfer

                                                                            8
instructions, where the vertex (tile) request is converted to the         Table 3: Details of the graph datasets [19] for evaluation.
off-chip memory transactions according to the vertex ID and
embedding size (the address and size of the previous tile).                     Dataset                #Vertex        #Edge              Type
                                                                               ak2010 (AK)           45,293         108,549          Redistrict Set
7.2    Scheduling                                                         coAuthorsDBLP (AD)        299,068         977,676       Citation Networks
   The Z IPPER hardware uses a two-level scheduling, which                hollywood-2009 (HW)      1,139,905       57,515,616   Collaboration Networks
                                                                              cit-Patents (CP)     3,774,768       16,518,948      Patent Networks
exploits both the tile-level parallelism (TLP) and the operator-
                                                                          soc-LiveJournal1 (SL)    4,847,571       43,369,619      Social Networks
level parallelism (OLP) with streams. Recall that a stream                  europe-osm (EO)        50,912,018      54,054,660      Street Networks
executes the SDE functions generated from the GNN model
on a tile. We implement it as a group of registers that repre-
sent its state.                                                          experiments, but it is also straightforward for our accelerator
   The first scheduler creates and manages the streams for the           to run different layers and embedding sizes.
pipelined execution. It adopts a simple first-ready-first-serve
policy for scheduling different streams. When a stream is                System Configuration. Table 4 shows the system configura-
ready, the scheduler fetches and feeds the current instruction           tion, where the baseline is DGL 0.5 on one NVIDIA V100
to the input queue of the next level dispatcher, and switches            GPU with 32GB memory and two Intel Xeon E5-2630 v4
stream state to issued. If the dispatcher queue is full, the             with 256GB memory, respectively. For Z IPPER, we use a
schedule will stall.                                                     32 × 128 systolic array as a MU, and eight 32-wide as a VU.
   The second dispatcher is responsible for decoding and is-             Intuitively, we deploy one dStream, four sStreams and four
suing the incoming instructions from the scheduler. It also              eStreams on top of one MU and two VUs in Z IPPER. We
bookkeeps the state of each computing unit. For a computa-               explore the design space of the streams and computing units
tional instruction, the dispatcher will find a target unit that          later in Section 8.3.
is ready to issue the instruction. If all the target units are                            Table 4: System configurations.
busy, the instruction will be added to an instruction queue
until the previous instructions and a target unit finish. We set                          DGL (CPU)      DGL (GPU)                  ZIPPER
the queue size to the maximum stream number to avoid con-                                  1.25GHz         2.20GHz     1GHz 1×MU (32×128 systolic array),
gestion. For a data-transfer and synchronization instruction,             Compute Unit
                                                                                          5120 cores       20 cores     2×VUs (each with 8×SIMD32 cores)
the dispatcher will issue the instruction back to the scheduler           On-Chip Mem.      50MB            34MB        21MB (EM) + 256KB (GU) eDRAM
for state update and to the memory controller for transaction                              136GB/s        900GB/s
generation, respectively.                                                 Off-Chip Mem.                                         256GB/s HBM-1.0
                                                                                            DDR4          HBM-2.0

8.    EVALUATION                                                         Performance Simulation. We develop a C++-based cycle-
   In this section, we present the evaluation results for Z IPPER.       level architecture simulator to evaluate our system and op-
We first explain the evaluation methodology and then present             timizations. The simulator also integrates Ramulator [23]
the detailed performance results.                                        to simulate the interaction behavior between the off-chip
                                                                         High Bandwidth Memory (HBM) and the on-chip memory.
8.1    Methodology                                                       We validate the performance against HyGCN [37] with our
Benchmark Datasets and Models. The details of the se-                    best effort, and the functionality of each operation and the
lected datasets can be found in Table 3. We also select five             tiling-based execution against DGL.
popular and diverse GNN models for the evaluation.                       Energy Estimation. We estimate the energy consumption of
• GCN [25] is the most famous but simplest GNN model                     Z IPPER with three main components: the multiply-accumulate
   that bridges the gap between spectral- and spatial-based              (MAC) units, the on-chip memory, and off-chip memory. For
   approaches. It consists only of a pair of Scatter-Gather              the MAC energy, we obtain the average energy of a single
   (also known as a SpMM) and a GEMM.                                    MAC operation by synthesizing a small systolic array with
                                                                         registers under TSMC 16 nm technology and multiplying it
• GAT [35] extends the GCN with a multi-head attention                   with the number of total MAC operations. We employ Cacti
   mechanism, involving multiple ELWs and GOPs. We only                  6.5 [1] to measure the dynamic and leakage energy of the
   use one head in our evaluation for simplicity.                        two on-chip memory under 32 nm technology and then con-
• SAGE [20] generalizes GCN with different aggregator, from              vert them down to 16 nm [29]. The off-chip memory access
   which we choose maxpool in our benchmark.                             energy is estimated with 7 pJ/bit [38].
• GGNN [26] employs a gated recurrent unit (GRU). We imple-              Area Measurement. We mainly focus on the area estimation
   ment the GRU with separate ELWs and GEMMs on Z IPPER,                 of on-chip memory and compute units, which is obtained
   but leverage the GRUcell kernel on GPU.                               also by the synthesis and Cacti 6.5 as described above. The
• R-GCN [32] introduces the GCN to the graph with multiple               schedule and control logic in the Z IPPER is based on state
   edge types, which correspond to different weights in the              machines with simple lookup tables, which is insignificant
   GEMM. We set the type number to 3 and randomly generate               compared to the compute units and memory.
   the edge type for each benchmark graph.
We run the forward pass of a single layer of each model under
                                                                         8.2    Detailed Results
a typical 128 for the input and output embedding sizes in all            Performance. Figure 9 shows the performance speedup over

                                                                     9
1000                                                                                                            4                                                                        Sparse Graph Tiling               Sparse Tiling with Reorder
                                                                                                                                                                  200

                                                                                                                                        Memory Access Reduction
                                                                                                             93.6
                                                                                                                                                                                                                             250     221

                                                                                                          1.56
       100                                                                                                          2                                                           150                        149                                192      194

                                                                                                                        Over GPU
Over CPU

                                                                                                                                                                  150                                                        200
                                                                                                                                                                         108           108       108

                                                                                                                                                                                                                   Speedup
           10                                                                                                       1                                                                                                        150
                                                                                                                                                                  100
                                                                                                                                                                               64                       64                   100 78                                          79

                                                                     OOM
                                                                           OOM
                                                                                 OOM

                                                                                           OOM
                                                                                                  OOM
                                                                                                        OOM
                                                                                                                                                                        54            54       54                                          70        73            69
            1                                                                                                       0.5                                           50
                                                                                                                                                                                                                             50                               25        26
                 AK AD CP HW SL AK AD CP HW SL AK AD CP HW SL AK AD CP HW SL AK AD CP HW SL

                                                                                                           GM
                     GAT            GCN            SAGE          GGNN                    RGCN                                                                      0                                                          0
                                                                                                                                                                        GAT    GCN    SAGE     GGNN     RGCN                       GAT      GCN      SAGE     GGNN       RGCN
  Figure 9: Z IPPER speedup over the baseline CPU and GPU.
                                                                                                                                        Figure 11: The total off-chip memory access reduction (left)
 the baselines. We observe that Z IPPER outperforms the CPU                                                                             and speedup (right) of sparse tiling and reordering over the
 and GPU, achieve 93.6× and 1.56× speedup on average,                                                                                   regular tiling on dataset CP.
 respectively. In the meantime, Z IPPER is able to process the
 large graph datasets and achieves the highest speedup over                                                                                We analyze the effect of each optimization in Z IPPER.
 CPU because of the application of graph tiling. In contrast,                                                                           Sparse Tiling and Reordering. We compare the off-chip
 GPU processing the graph as a whole is limited to its off-chip                                                                         memory read and execution latency with different tiling strate-
 memory size and issues the out-of-memory error. We also                                                                                gies: regular tiling, sparse tiling, and sparse tiling with re-
 observe limited speedup and even slowdown for GAT. This                                                                                ordering. Figure 11 shows the results on CP while the results
 is because DGL has their special operation support for the                                                                             also follow the same trend on other datasets. We can observe
 softmax attention in the GAT while we implement it with                                                                                that the two sparse tiling methods provide 58× and 123×
 ordinary instructions. Besides, the dataset HW is much denser                                                                          of memory access reduction on average. The lower reduc-
 than others, so there is less sparsity to exploit and makes the                                                                        tion of GAT, SAGE and GGNN is because they also access the
 sparse tiling strategy less effective. Nevertheless, Z IPPER                                                                           destination vertex embeddings, which cannot be reduced.
 still achieves considerable speedups on the other four models.                                                                            Besides, the tiling methods also achieve 48× and 135×
 Energy. Figure 10 shows the total energy reduction of Z IP -                                                                           speedup on average over the regular tiling. The reason for the
 PER over the baseline CPU and GPU. We observe that the                                                                                 low speedup of GGNN and RGCN is that the two models involve
 energy consumptions of the CPU and GPU are 147× and                                                                                    edge-type-guided batch matrix multiplication, which suffers
 4.85× that of Z IPPER on average, respectively. This is not                                                                            from a longer latency of on-chip memory access and dilutes
 surprising since we leverage multiple dedicated computing                                                                              the benefit of memory access and computation reduction. But
 units for the different GNN primitives, while the GPU and                                                                              in general, the sparse tiling with reordering can effectively
 CPU spend excessive hardware resources to flexibly support                                                                             reduce memory access and improve performance.
 various workloads. Besides, the sparse tiling and reordering                                                                           Compiling Optimization. We evaluate the compiling opti-
 also reduce a large number of redundant accesses to both the                                                                           mization effectiveness by comparing the optimized and the
 on-chip and off-chip memory, which takes a large part of the                                                                           naive implementation of the same GNN models. Figure 12
 energy consumption.                                                                                                                    shows the speedup results of GAT and SAGE, which we find the
 Area. Table 5 presents the area breakdown of the Z IPPER                                                                               opportunities for the compiling optimization. But libraries
 architecture except for HBM. The overall area of Z IPPER is                                                                            have optimized the models manually because of their high
 53.58 mm2 , which is down to 6.57% of the baseline GPU die                                                                             popularity. So we just implement a naive but straightfor-
 size. The on-chip memory (including the unified embedding                                                                              ward version of the same models using DGL. GAT and SAGE
 memory and the tile hub) consumes 97.91% area of the Z IP -                                                                            achieves 1.87× and 1.03× speedup, respectively. Our opti-
 PER , and the computing units (including one MU and two                                                                                mization also works on the baseline V100 GPU with DGL
 VUs) consume about 2.09% area.                                                                                                         and achieves a speedup of 2.36× and 1.62×.
                                                                                                                                        Multi-Stream on Hybrid Architecture. We evaluate differ-
                        Table 5: The area of Z IPPER architecture.                                                                      ent design choices by changing the numbers of s/eStreams,
                                                                                                                                        VU, and MU. Figure 13 shows execution latencies normal-
                             One MU       One VU      Embedding Mem.                   Tile Hub          Total                          ized to the result of each model with two s/eStreams, one MU,
       Area (mm2 )             1.00         0.06             52.31                       0.15            53.58                          and two VUs, where we make two observations. First, there
           Percentage         1.86%        0.12%            97.63%                      0.28%            100%
                                                                                                                                        is usually a sweet point of the s/eStream number. while we in-
                                                                                                                                        crease the s/eStream number, the performance first increases
                                                                                                                                        but then decreases with at most 1.72× speedup. But the
 8.3                Optimization Effectiveness                                                                                          sweet points usually vary between models and architectures.

                                                                                                                                                                                                          SAGE                 GAT
    10000                                                                                                           16

                                                                                                                                                                                                                      1.03
      1000                                                                                                          8                                 ZIPPER (CP)                                                                                    1.87
Over CPU

                                                                                                                        Over GPU
                                                                                                            4.846
                                                                                                           146.6

           100                                                                                                      4
                                                                                                                                             DGL-V100 (AD)                                                                                   1.62
                                                                     OOM
                                                                           OOM
                                                                                 OOM

                                                                                            OOM
                                                                                                  OOM
                                                                                                        OOM

                                                                                                                                                                                                                                                                        2.36
            10                                                                                                      2
                 AK AD CP HW SL AK AD CP HW SL AK AD CP HW SL AK AD CP HW SL AK AD CP HW SL
                                                                                                                                                                               0.00           0.50           1.00        1.50                          2.00              2.50
                                                                                                              GM

                      GAT            GCN            SAGE          GGNN            RGCN
                                                                                                                                                                                                          Normalized Speedup

           Figure 10: Z IPPER energy reduction over the baselines.                                                                      Figure 12: Speedup of compiler optimization on dataset CP.

                                                                                                                                   10
You can also read