HYGCN: A GCN ACCELERATOR WITH HYBRID ARCHITECTURE - UNPAYWALL

Page created by Felix Tran
 
CONTINUE READING
To Appear in 2020 IEEE International Symposium on High Performance Computer Architecture

                                                 HyGCN: A GCN Accelerator with Hybrid Architecture
                                                                                    † ‡ §             §         §                §             †
                                                                        Mingyu Yan          , Lei Deng , Xing Hu , Ling Liang , Yujing Feng ,
                                                                                     †                     †               † ‡                 §
                                                                         Xiaochun Ye * , Zhimin Zhang , Dongrui Fan , and Yuan Xie
                                             †
                                                 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences
                                                           ‡
                                                               School of Computer Science and Technology, University of Chinese Academy of Sciences
                                                                                     §
                                                                                         University of California, Santa Barbara

                                        ABSTRACT                                                               ing and inference. Recently, GCNs attract substantial ef-
                                                                                                               forts from both the industrial and academic communities
arXiv:2001.02514v1 [cs.DC] 7 Jan 2020

                                        Inspired by the great success of neural networks, graph con-
                                        volutional neural networks (GCNs) are proposed to analyze              [3,14,18,20,28,43,44] to solve problems including node clas-
                                        graph data. GCNs mainly include two phases with distinct               sification [25], link prediction [14, 16], graph clustering [44],
                                        execution patterns. The Aggregation phase, behaves as graph            and recommendation [12]. As a result, GCNs gradually be-
                                        processing, showing a dynamic and irregular execution pat-             come a new workload family member in data-centers, such
                                        tern. The Combination phase, acts more like the neural net-            as in Google [13], Facebook [28], and Alibaba [3, 47].
                                        works, presenting a static and regular execution pattern. The             The convolutional layers occupy the major execution time
                                        hybrid execution patterns of GCNs require a design that allevi-        of GCNs through two primary execution phases: Aggrega-
                                        ates irregularity and exploits regularity. Moreover, to achieve        tion and Combination [15, 40, 47]. The Aggregation phase
                                        higher performance and energy efficiency, the design needs             maintains most graph processing behaviors. It heavily relies
                                        to leverage the high intra-vertex parallelism in Aggregation           on the graph structure that is inherently random and sparse.
                                        phase, the highly reusable inter-vertex data in Combination            Processing of each vertex requires aggregating features from
                                        phase, and the opportunity to fuse phase-by-phase execution            all its source neighbours. Unfortunately, the amount and lo-
                                        introduced by the new features of GCNs. However, existing              cation of these source neighbors vary significantly among
                                        architectures fail to address these demands.                           vertices. As a result, the computational graph [26] and mem-
                                           In this work, we first characterize the hybrid execution            ory access pattern in the Aggregation phase of each vertex
                                        patterns of GCNs on Intel Xeon CPU. Guided by the char-                are dynamic and irregular. The Combination phase acts more
                                        acterization, we design a GCN accelerator, HyGCN, using                like the neural networks. It transforms the feature vector
                                        a hybrid architecture to efficiently perform GCNs. Specif-             of each vertex to a new one using a multi layer perceptron
                                        ically, first, we build a new programming model to exploit             (MLP), which is usually expressed by a matrix-vector multi-
                                        the fine-grained parallelism for our hardware design. Sec-             plication (MVM). Due to the identical connection pattern of
                                        ond, we propose a hardware design with two efficient pro-              each neuron within a neural network layer, the computational
                                        cessing engines to alleviate the irregularity of Aggregation           graph [26] and memory access pattern in the Combination
                                        phase and leverage the regularity of Combination phase. Be-            phase of each vertex are static and regular. Besides, there
                                        sides, these engines can exploit various parallelism and reuse         are additional characteristics in these two phases that distin-
                                        highly reusable data efficiently. Third, we optimize the over-         guish GCNs from conventional workloads. First, the length
                                        all system via inter-engine pipeline for inter-phase fusion            of vertex property is short and fixed in conventional graph
                                        and priority-based off-chip memory access coordination to              analytics. However, in GCNs, the feature vector of each ver-
                                        improve off-chip bandwidth utilization. Compared to the                tex is quite long and variable across layers, which introduces
                                        state-of-the-art software framework running on Intel Xeon              high-degree intra-vertex parallelism in Aggregation phase.
                                        CPU and NVIDIA V100 GPU, our work achieves on average                  Second, the parameters in conventional MLP-based neural
                                        1509× speedup with 2500× energy reduction and average                  networks are never shared, while they can be fully shared
                                        6.5× speedup with 10× energy reduction, respectively.                  among vertices in GCNs, which induces abundant highly
                                                                                                               reusable inter-vertex data in Combination phase. Third, the
                                                                                                               two phases are executed alternatively. An inherent dataflow
                                        1.   INTRODUCTION                                                      exists between phases, providing an opportunity to fuse the
                                           Inspired by the powerful learning capability of neural net-         phase-by-phase execution.
                                        works, graph convolutional neural networks (GCNs) are pro-                To achieve high-performance and energy-efficient acceler-
                                        posed as an effective category of models to represent and              ation of GCNs, aforementioned characteristics have imposed
                                        process graph data [38, 45, 46, 47]. GCNs convert the graph            new requirements on architecture design. First, not only can
                                        data into a low dimensional space while keeping both the               the GCN architecture alleviate the irregularity in Aggregation
                                        structure and property information to the maximum extent,              phase, but it can also exploit the regularity in Combination
                                        and then construct a neural network for the consequent train-          phase. Second, it needs to exploit the high-degree intra-vertex
                                                                                                               parallelism and highly reusable inter-vertex data. Third, it is
                                         ∗ Corresponding author is Xiaochun Ye and his email is yexi-          able to efficiently fuse the execution of these two phases.
                                        aochun@ict.ac.cn.
Unfortunately, existing architectures fail to implement                  propose a hardware design to tackle irregularity and lever-
GCN-specific characteristics. For CPUs, although they can                   age regularity with Aggregation Engine and Combination
employ complex caching and prefetching techniques to offset                 Engine, respectively.
the processor-memory disparity by exploiting the regular ac-              • We propose a flexible inter-engine pipeline and a priority-
cess pattern [11], they fail to address the abundant dynamic                based memory access coordination to efficiently fuse the
and irregular data accesses in the Aggregation phase since the              execution of Aggregation phase and Combination phase.
irregularity harms the predictability of memory accesses [10].            • We implement our architecture design in RTL and evalu-
Besides, it is difficult to efficiently implement the reuse of              ate it using a detailed microarchitectural simulation. We
the highly reusable parameter data between computing units                  use four well-known GCN models on six popular graph
in CPUs as like TPU [22] and Eyeriss [8]. Thus, the energy-                 datasets. Compared to the state-of-the-art software frame-
hungry data accesses to cache introduce high energy con-                    work PyTorch Geometric [15] running on Intel Xeon CPU
sumption [8]. For GPUs, although they are well optimized                    and NVIDIA V100 GPU, our work achieves on average
for neural networks, they lack the ability to alleviate irregu-             1509× speedup with 2500× energy reduction and 6.5×
larity in Aggregation phase, which significantly hinders the                speedup with 10× energy reduction, respectively.
performance improvement [30, 41]. Furthermore, although
they leverage the regularity in Combination phase, the data
copy and synchronization between threads for the parameter                2.     BACKGROUND
reuse are expensive. For graph analytics and neural network                  GCNs follow a neighborhood aggregation scheme, where
accelerators, they are only optimized to alleviate irregularity           the feature vector of each vertex is computed by recursively
or exploit regularity, rather than both simultaneously. At last,          aggregating and transforming the representation vectors of its
all of them are short of the ability to efficiently fuse the execu-       neighbor vertices [18, 39, 47]. Fig. 1 illustrates the execution
tion of these two phases. In conclusion, existing architectures           phases of GCN models. After k iterations of aggregation via
are not the ideal platforms to execute GCNs.                              the Aggregate function and transformation via the Combine
   In this work, we first characterize the hybrid execution pat-          function, a vertex is represented by its final feature vector,
terns of GCN workloads on Intel Xeon CPU. Next, guided by                 which captures the structural information within the vertex’s
the characterization, we propose a GCN accelerator, HyGCN,                k-hop neighborhood. Table 1 lists the notations used in GCNs.
using a hybrid architecture to efficiently perform GCNs. Spe-             In this work, we mainly focus on undirected graphs and the
cially, we first propose a programming model to achieve                   inference stage rather than training.
the hardware transparency for programmers and exploit fine-
grained parallelism. It abstracts GCNs as edge-centric aggre-                                       Table 1: GCN Notations.
                                                                           Notation                Meaning              Notation                   Meaning
gation for the Aggregation phase and MVMs for the Combi-                      G                graph G = (V, E)            V                     vertices of G
nation phase. Second, we design HyGCN with two efficient                      E                   edges of G              Dv                  degree of vertex v
processing engines, Aggregation Engine and Combination En-                   e(i, j)    edge between vertex i and j    N(v) (S(v))   (sampling subset of) v’ neighbor set
                                                                            A (Ai j )   (element of) adjacent matrix       av           aggregation feature vector of v
gine, to accelerate the Aggregation and Combination phases,                   hG             feature vector of G           W             combination weight matrices
respectively. In Aggregation Engine, interval-shard graph                     hv          feature vector of vertex v       b               combination bias vectors
partitioning and window sliding-shrinking methods are in-                     X           initialized feature matrix       Z                  embedding matrix
                                                                              C               assignment matrix            ε                 learnable parameter
troduced to alleviate irregularity by increasing data reuse
and decreasing unnecessary accesses for sparsity, respec-                    Typically, the k-th layer/iteration of GCNs is formulated as
tively. Additionally, we implement a vertex-disperse process-                                          (k−1)
                                                                                   akv = Aggregate hu
                                                                                                                                   
                                                                                                              : u ∈ {N(v)} ∪ {v} ,
ing method to exploit the edge parallelism and intra-vertex                                                                               (1)
                                                                                                              hkv = Combine akv .
                                                                                                                                   
parallelism. In Combination Engine, to leverage the regular-
ity, we build multi-granular systolic arrays to perform MVMs              where hkv is the representation feature vector of vertex v at
in parallel and reuse the shared parameters. Besides, they                the k-th iteration. Simply, the Aggregate function aggregates
can be flexibly used either independently for lower latency               multiple feature vectors from source neighbors to one single
or in combination for lower energy. Third, to improve the                 feature vector, and the Combine function transforms the
overall execution, on the basis of individual optimizations               feature vector of each vertex to another feature vector using
of these two phases, we build a fine-grained inter-engine                 an MLP neural network. Note that the MLP parameters,
pipeline to fuse the phase-by-phase execution and propose a               including weights and biases, are shared between vertices.
priority-based memory access coordination for the off-chip                   In order to decrease the computational complexity, the
data accesses between the two engines.                                    Sample function is usually applied before the Aggregate
   To summarize, we list our contributions as follows:                    function to sample a subset from the neighbor vertices of
• We study an emerging domain, GCNs, from a computer                      each vertex [6, 18] as the new neighbors, specifically,
   architecture perspective and show that hybrid execution                                    S(v) = Samplek N(v) .
                                                                                                                       
                                                                                                                                          (2)
   patterns exist in GCNs. Specially, the Aggregation phase in            Sometimes, the Pool function [44] follows the Combine
   GCNs presents a dynamic and irregular execution pattern,               function to transform the original graph into a smaller graph.
   while Combination phase is static and regular.                            After several iterations, the features will be used for final
• We propose a GCN accelerator, HyGCN, using a hybrid                     prediction or classification. For the node classification prob-
   architecture to efficiently perform GCNs. First, we build a            lem, vertex feature vectors hkv at the last iteration are used for
   programming model to enable our hardware design to ex-                 prediction. For the graph classification problem, a Readout
   ploit various parallelisms inherent in this domain. Next, we           function further aggregates the hkv at the last iteration to obtain

                                                                      2
H           B    Feature Vector
                     F         B                                                                                        F                              B
                                                      C        F              A                        A
                           A                                                                                                                 A
                                                       G           D                        Shared
                 H                 C                                                      Parameters
                                                                                           (Wk, Bk)                G                                       C
                                                                                                                                                                                            Σ or Concat
                       G                                           G
                E                  D                      A                   H                        H                G
                                                                                                                   E                                       D
                                                                   E
               Input Graph                            Sample              Aggregate            Combine                 Pool (Optional)                                                 Readout
                                                                                                       Iterative
                                                                         Figure 1: Illustration of the GCN model.
the entire graph’s representation vector, i.e.                                                             and adjacent matrix Ak are produced, which can be com-
                 hG = Readout hkv | v ∈ G .
                                            
                                                        (3)                                                bined to construct a new and smaller graph. In the new graph,
                                                                                                                 k
                                                                                                           GCN pool                                             k
                                                                                                                     determines the number of vertices, and GCNembedding
   Next, we provide several typical GCN models as examples
to explain the above operations in detail.                                                                 determines the length of vertex feature vector.
   GCN is one of the most successful convolutional networks                                                   Summary. As explained above, we introduce several typ-
for graph learning [25, 38], which bridges the gap between                                                 ical operations in GCNs: Sampling, Aggregation, Combi-
spectral-based convolutions and spatial-based convolutions.                                                nation, Pooling, and Readout. Except for Combination, all
Its inference model can be described as                                                                    the operations are graph structure-dependent, which involve
                                                                                                           graph processing. Combination usually is a typical MLP
                           1                 (k−1)                                                         neural network (single layer or multiple layers). Sampling
      akv =
                                                                         
               ∑ √Dv · Du hu                          | ∀u ∈ {N(v)} ∪ {v} ,
                                                                                                (4)
                                                                                                           is used to sample a subset from neighbors, which can be
                                                                                                           done during preprocessing [20] or with random selection dur-
                                                hkv   = ReLU(W k akv + bk ).                               ing runtime [18]. Aggregation aggregates the features from
   GraphSage further adopts uniform neighbor sampling to                                                   its 1-hop neighbors. Pooling acts like the pooling layer in
alleviate receptive field expansion that effectively trades off                                            CNNs to realize graph transformation by reducing the num-
accuracy and execution time [18]. It is formulated as                                                      ber of vertices and the length of feature vectors. Readout
                         (k−1)       (k−1)                                                                 can be a simple summation [15] across vertices or further
         akv = Mean {hv
                                                       
                               } ∪ {hu     , ∀u ∈ S(v)} ,                                                  concatenation across iterations [39]. Therefore, Readout can
                                                            (5)
                               hkv = ReLU(W k akv + bk ).                                                  be viewed as an extreme Aggregation. This work focuses on
                                                                                                           Aggregation and Combination, two major phases in GCNs.
   GINConv is a simple neural architecture, and its discrimi-
native power is equal to the power of the Weisfeiler-Lehman
graph isomorphism test [39]. Vertex features learned by GIN-                                               3.    MOTIVATION
Conv can be directly used for tasks like node classification                                                  In this section, we quantitatively characterize and identify
and link prediction. We can perform this model as                                                          the hybrid execution patterns in processing GCNs. Next, we
              akv = (1 + εk ) · hv
                                              (k−1)
                                                      +                  hu
                                                                             (k−1)
                                                                                      ,                    explain our motivation behind designing a GCN accelerator.
                                                                   ∑
                                                               u∈N(v)                           (6)        3.1     Characterization on CPU
                            = MLP      hkv W , b ).       k
                                                              (akv ,     k        k
                                                                                                              We conduct quantitative characterizations using a state-of-
   For graph classification tasks, the following Readout func-                                             the-art GCN software framework PyTorch Geometric [15]
tion is further used to produce the representation of the entire                                           on Intel Xeon CPU. The execution time breakdown of GCN
graph, given the representations of individual vertices. It                                                (GCN) [25], GraphSage (GSC) [18], and GINConv (GIN)
concatenates across all iterations of GINConv to acquire the                                               [39] on several datasets [23] is illustrated in Fig. 2. The
final graph representation as                                                                              profiling results of GCN [25] on the COLLAB dataset [23]
                                                                                                           are presented in Table 2. The details of system configuration
             hG = Concat ( ∑ hkv ) | k = 1, ..., K .
                                                  
                                                             (7)                                           and datasets are shown in Section 5.1.
                                             v∈G
   DiffPool provides a general tool to realize hierarchical
                                                                                                                        Ratio of Execution Time (%)

                                                                                                                                                                       Aggregation         Combination
                                                                                                                                                      100
graph-level transformation for a broad set of input graphs [44].
                                                                                                                                                      80
It can be inserted after the Combine function of any GCNs
                                                                                                                                                      60
                                                                                                                                                               99.95

                                                                                                                                                               99.96
                                                                                                                                                               99.87

                                                                                                                                                               99.37

                                                                                                                                                               98.85
                                                                                                                                                               98.72
                                                                                                                                                               95.64
                                                                                                                                                               94.97

                                                                                                                                                                93.21

to transform the original graph to a smaller one (like the
                                                                                                                                                                 86.73

                                                                                                                                                                 82.88
                                                                                                                                                                  78.13
                                                                                                                                                                   67.71

                                                                                                                                                      40
                                                                                                                                                                    60.01
                                                                                                                                                                    55.78

pooling layer in convolutional neural networks (CNNs)). In
                                                                                                                                                      20
fact, Diffpool uses two extra GCNs to implement the graph
                                                                                                                                                       0
transformation, which follows                                                                                                                                  IB CR CS CL PB IB CR CS CL PB IB CR CS CL PB

       C(k−1) = so f tmax GCN pool
                                 k
                                    (A(k−1) , X (k−1) ) ,
                                                       
                                                                                                                                                                   GCN               GSC            GIN
                                                                                                            Figure 2: Execution time breakdown of the two phases.
               Z (k−1) = GCNembedding
                            k
                                      (A(k−1) , X (k−1) ),                                      (8)
                                                                                                           Execution Time Breakdown. Both of Aggregation and
              (k−1)T                                      (k−1)T                                           Combination phases can occupy a significant amount of exe-
    Xk = C             Z (k−1) , Ak = C                                A(k−1)C(k−1) .
                                                                                                           cution time, which implies that both need acceleration. Fig. 2
After the DiffPool transformation, a new feature matrix X k                                                illustrates their execution time ratio on different models and

                                                                                                       3
datasets. Their execution times differ due to the variable                analytics, which is determined by the input dataset and MLP
length of feature vectors and the execution flow of GCNs.                 structure. Moreover, the length of the feature vectors in each
For example, the long feature length of CR and CS datasets                vertex is usually orders of magnitude longer than that of tra-
causes more time on Combination phase for GCN and Graph-                  ditional graph analytics. This introduces high intra-vertex
Sage. Since GINCov executes Aggregation phase first, it                   parallelism. In the Combination phase, the MLP parameters
spends more time on Aggregation phase without the reduc-                  are fully shared by all vertices while non-reusable in tradi-
tion of feature length through Combination phase like the                 tional MLP models if not using the batching technique. This
other two models do.                                                      induces numerous highly reusable inter-vertex data. Besides,
Hybrid Execution Pattern. The Aggregation phase heavily                   these two phases are executed alternatively to produce the
relies on the graph structure that is inherently random and               final result, while conventional workloads iteratively perform
sparse, which results in numerous dynamic computations and                only the graph traversal or the neural network propagation.
irregular accesses. From Table 2, it is observed that each op-
eration in the Aggregation phase requires much more data to               3.2    The Need for a GCN Accelerator
be accessed from DRAM than Combination phase, resulting
in higher DRAM access energy. Besides, the extremely high                    GCNs are showing great potential in various tasks [16,
numbers of misses per kilo-instruction (MPKI) of L2 and                   18, 19, 38, 43, 46]. Many companies, such as Google [13],
L3 caches in the Aggregation phase are caused by the high                 Facebook [28], and Alibaba [47] have deployed GCNs in
randomness of neighbor indices of each vertex. In addition,               data centers, which reflects the increasing importance and
the indirect and irregular accesses render the data prefetch-             scope of upcoming applications. An efficient architecture
ing in the Aggregation phase ineffective, since it is difficult           is timely to achieve high performance and stimulate GCN
to predict the data addresses without knowing the indices                 development. Therefore, given the above characterizations,
of neighbors in advance. This causes abundant ineffectual                 we explain our motivation of designing a GCN accelerator.
memory accesses to prefetch data.                                         Design Requirements. Given the characteristics of GCNs,
   The Combination phase executes a MVM for each vertex                   we present the design requirements to perform GCNs with
with a shared MLP-based neural network, which performs                    high performance and energy efficiency. First, Aggregation
static and regular computations and accesses. Table 2 illus-              phase demands efforts to alleviate the irregularity that de-
trates that each operation in the Combination phase requires              grades performance. On the other hand, Combination phase
only small amount of data to be accessed from DRAM. This                  needs more attention to leverage the regularity to improve
is because the MVMs are very compute-intensive and the                    the intensive computations with better parallelism and faster
weight matrix of MLP is widely shared between vertices.                   synchronization. Second, the high-degree intra-vertex paral-
Nevertheless, up to 36% of execution time for shared data                 lelism and the highly reusable inter-vertex data need to be
copy and synchronization between threads is observed.                     exploited. Third, to achieve higher performance and energy
                                                                          efficiency, the execution of Aggregation phase and Combina-
    Table 2: Quantitative Characterization on CPU.                        tion phase need to be efficiently fused. Unfortunately, exist-
                                     Aggregation    Combination           ing architectures fail to address these requirements, resulting
                                                                          in the following inefficiencies.
         DRAM Byte per Ops               11.6           0.06              Inefficiencies of General-Purpose Processors. On CPUs,
     DRAM Access Energy per Ops         170nJ           0.5nJ
                                                                          the irregularity in Aggregation phase makes GCNs ill-suited
            L2 Cache MPKI                 11             1.5
                                                                          to current cache hierarchy design and data prefetching tech-
            L3 Cache MPKI                 10             0.9
     Ratio of Synchronization Time        —             36%
                                                                          niques. Besides, it is hard to efficiently reuse the highly
                                                                          reusable parameter data between compute units [8].
Table 3: Different Execution Patterns of Aggregation                         GPUs are inherently optimized for compute-intensive work-
Phase and Combination Phase.                                              loads with regular execution pattern [29] such as neural net-
                                                                          works, but handling the Aggregation phase with irregular
                               Aggregation          Combination
                                                                          memory accesses suffers from low efficiency. Besides, the
      Access Pattern        Indirect & Irregular   Direct & Regular       processing of Combination with strong parameter sharing
     Data Reusability               Low                  High             needs costly data copy and thread synchronization.
   Computation Pattern      Dynamic & Irregular    Static & Regular          Both CPUs and GPUs lack inter-phase optimization for
   Computation Intensity            Low                  High             GCN execution. To leverage the advantages of hardware-
     Execution Bound              Memory               Compute            optimized functions [4, 31], current programming framework
                                                                          for GCNs usually adopts coarse-grained execution, which
   According to above analysis, hybrid execution patterns                 results in phase-by-phase execution. This compromises the
exist in GCNs, which are summarized in Table 3. The Ag-                   design space with phase interaction, hindering the improve-
gregation phase performs dynamic and irregular execution                  ment beyond the individual optimization for each phase.
pattern, bounded by memory, while the Combination phase                   Inefficiencies of Conventional Accelerators. Specialized
is static and regular, bounded by computation.                            accelerators tailored to graph analytics or neural networks
Differences from Conventional Workloads. Beside hybrid                    gain significant speedup and energy savings compared to
execution patterns in GCNs, there are additional characteris-             general-purpose processors. Whereas, they are inefficient
tics that distinguishes GCNs from conventional workloads.                 in processing GCNs due to following reasons: i) they are
Specifically, in the Aggregation phase, the length of feature             usually only designed to either alleviate irregularity or exploit
vectors is variable rather than fixed as in traditional graph             regularity, while GCNs need both; ii) they fail to leverage the

                                                                      4
new kinds of parallelism and data reuse to further improve                 Note that in Algorithm 1 we do not express the Pool
performance; iii) single-paradigm design make them hard to              and Readout operations explicitly since they are not always
fuse the execution of the two phases.                                   needed. In fact, the Pool operation can be represented by two
Opportunities for Customization. Designing a specialized                GCNs and additional matrix operations. The GCNs can be
accelerator for a specific domain is an efficient and prevalent         performed entirely by the two engines, the matrix transposes
solution to address the inefficiencies of existing architectures,       can be executed by the flexible Aggregation engine, and the
since it can tailor the memory hierarchy and computation                matrix multiplications can be executed by the Combination
unit to the specific workload. For GCNs, we can build an                engine. The Readout operation can be expressed by an ad-
accelerator with a hybrid architecture using different opti-            ditional single vertex that connects all vertices in the graph,
mizations for the two phases. For the Aggregation phase, it             which can be accomplished by the Aggregation engine.
is possible to obtain the knowledge of graph data in advance
and schedule the accesses to alleviate the irregularity. More-            Algorithm 1: Edge- and MVM-centric Programming
over, the computation for each vertex can also be scheduled               Model for Aggregation and Combination Phase
to exploit edge parallelism and intra-vertex parallelism. For
                                                                         1 initial SampleNum;
the Combination phase, we draw inspirations from current
                                                                         2 initial SampleIndexArray;
neural network accelerators to efficiently perform MVMs in
                                                                         3 for each node v ∈ V do
parallel with parameter sharing. Beyond the individual opti-
mizations of the two phases, the serial inter-phase dataflow             4      agg_res ← init();
can be pipelined in finer grain. Moreover, all off-chip mem-                                  / Edge-centric Parallelism
ory accesses can be controlled to improve the overall memory             5      sample_idxs ← SampleIndexArray[v.nid];
access efficiency. Putting all these together, there are huge            6      for each sample_idx in sample_idxs do
opportunities to design an efficient GCN accelerator with                7            e(u, v) ← EdgeArray[sample_idx];
high performance.                                                        8            agg_res ← Aggregate(agg_res, u. f eature);
                                                                         9      end
4.    ARCHITECTURE DESIGN                                                                     / MVM-centric Parallelism
                                                                        10      v. f eature ← Combine(agg_res, weights, biases);
  In this section, we design HyGCN to support the efficient             11 end
execution of GCNs. We first introduce the programming
model and then present details of the architecture design.
4.1    Edge- and MVM-Centric PM                                         4.2     Architecture Overview
   The goal of building a programming model (PM) is to                     Based on the proposed PM, Fig. 3 depicts the architecture
exploit available parallelisms and achieve hardware trans-              of HyGCN. We construct the system using a hybrid archi-
parency for programmers [47]. For Aggregation, there are                tecture, which includes two engines (Aggregation Engine
gather- and scatter-based processing methods. Since the                 and Combination Engine) and one memory access handler.
scatter-based method usually produces large amount of atomic            A communication interface (Coordinator) is introduced to
operations and requires a synchronization after the processing          bridge these two engines. Therefore, the interference between
of all vertices, the degree of parallelism will be degraded [17].       them is mitigated and their execution pipeline is established.
On the contrary, the gather-based method can control the                       Aggregation Engine                                   Combination Engine
program behavior easily and preserve the execution paral-
                                                                                        SIMD

                                                                                                                                            PE   PE        PE
                                                                                                             Aggregation Buffer

lelism. Therefore, we select the gather-based processing in

                                                                                                                                                                                Output Buffer
                                                                                                                Coordinator

                                                                                                                                                                Activate Unit
                                                                          eSched                                                            PE   PE        PE
                                                                                        SIMD

our design. Nevertheless, this processing mode leads to in-
tensive memory access and vertex computation. To address                                                                                    PE   PE        PE
                                                                          Sampler
this problem, we employ an edge-centric PM to exploit the
                                                                                        SIMD

                                                                            Sparsity
edge-level parallelism. Each vertex possesses many incoming                Eliminator                                             vSched    PE   PE        PE

edges (neighbors), which can be aggregated in an edge-by-                           Edge Buffer   Input Buffer                             Weight Buffer
edge pipeline. In this way, workload for each vertex can be
divided into subworkloads and assigned to each computation                                              Memory Access Handler
unit for processing in parallel. For Combination, the situation
                                                                                                                           DRAM
is relatively easier. Since the computation of each vertex acts
like the MLP, we directly focus on the MVM operations.                         Figure 3: Architecture overview of HyGCN.
   Our edge- and MVM-centric PM for GCNs is shown in                       The Aggregation Engine aims to realize the efficient ex-
Algorithm 1. At each vertex v ∈ V , the sampled neighbor                ecution of irregular accesses and computations. To exploit
indices are read first, which is a subset of all neighbors. Each        the edge-level parallelism, a task scheduler (eSched) is de-
index corresponds to an edge connecting v and a neighbor ver-           signed to assign the edge processing workloads onto SIMD
tex u, i.e. e(u, v). By traversing all sampled edges connected          cores. To support the Sampling operation, we introduce a
v, all the feature vectors of corresponding neighbors can be            Sampler into the Aggregation Engine. The Sampler selects
aggregated onto the feature vector of v. Then, a Combine                edges from the edge list of each vertex using a uniform or
function can start performing the Combination phase that is             predefined distribution in terms of index interval. The former
comprised of a series of MVMs. In this PM, the edge-level               indices for edge sampling are based on dynamic generation
and MVM-level parallelism can be exploited.                             while the latter ones are predefined and can be read from

                                                                    5
off-chip memory like in [6, 20]. To reduce the latency of data         vertex cannot occupy all cores, free cores can be assigned to
access, we employ embedded DRAM (eDRAM) to cache                       other vertices. Thus, all cores are always busy without work-
various data to improve data reuse. An Edge Buffer is used             load imbalance. Moreover, since the intra-vertex parallelism
to cache edges to exploit spatial locality in the edge array.          has been exploited, the vertex latency for a single vertex is
An Input Buffer is used to cache the vertex features in X k−1          smaller than processing multiple vertices together. Further-
and an Aggregation Buffer is used to cache the intermediate            more, it also enables the immediate processing of each vertex
aggregation results, to exploit temporal locality. To hide the         in the following Combination Engine.
DRAM access latency, both the Edge Buffer and Input Buffer               Aggregating V1             Neighbor1         Neighbor2         Neighborn
adopt the double buffer technique. Specifically, we design a                       Aggregating           Feature 1         Feature 1         Feature 1

                                                                          Core 1
Sparsity Eliminator to avoid redundant feature loads of the                         Feature 1             in X(k-1)         in X(k-1)         in X(k-1)

vertices that share no edges with the aggregating vertex.                          Aggregating           Feature 8         Feature 8         Feature 8
                                                                                    Feature 8             in X(k-1)         in X(k-1)         in X(k-1)
   The Combination Engine is designed to maximize the ef-
ficiency of regular accesses and computations. In order to
improve the processing parallelism and data reuse, we adopt                        Aggregating       Feature 8i+1
                                                                                                       in X(k-1)
                                                                                                                       Feature 8i+1
                                                                                                                         in X(k-1)
                                                                                                                                         Feature 8i+1
                                                                                                                                           in X(k-1)

                                                                          Core i
                                                                                   Feature 8i+1
the well-known systolic array design [22] and modify it to
                                                                                   Aggregating       Feature 8i+8      Feature 8i+8      Feature 8i+8
be compatible with GCNs. A Weight Buffer is used to cache                          Feature 8i+8        in X(k-1)         in X(k-1)         in X(k-1)
the weight matrix to exploit their temporal locality, and an
Output Buffer is used to coalesce the write accesses of the
                                                                                                    T1                T2                Tn
                                                                                            Time:
final features. Similarly, they also leverage the double buffer        Figure 4: Vertex-disperse processing mode where the
technique to hide off-chip access latency. The Combination             workloads of each vertex are assigned to all SIMD cores.
engine takes the aggregation result of each vertex v from the
Aggregation engine and the weight matrix from the Weight               4.3.2           Graph Partitioning (Static)
Buffer as inputs to execute the MVM operation. The vSched is
responsible for the workload assignment. After the MVM op-                 We borrow the abstraction of vertex interval and edge shard
erations, an activation operation is performed by Activate Unit        from [9, 27] to partition graph data, which is the basis of our
to produce the new feature vector of vertex v. Different from          data-aware sparsity elimination in the next subsection. We do
normal systolic array, our systolic array is multi-granular that       not need explicit preprocessing to generate the intervals and
can be used as multiple smaller arrays or a whole large array          shards since we directly take the data format of compressed
under different optimization scenarios.                                sparse column (CSC) as input. As exampled in Fig. 5(a), the
   To improve the bandwidth utilization, a prefetcher is de-           16 vertices are organized as several intervals (i.e. from I1 to
signed to explicitly prefetch graph data and parameter data.           I4 , each with four vertices), and the edges are organized as
For example, the prefetching of the feature vectors is as fol-         4×4 shards (i.e. from S(1, 1) to S(4, 1), each with 16 edges
lows. The prefetcher first prefetches the edges of current             at most). The intervals and shards are disjoint.
processing vertices. After receiving these edges, Sparsity                 The feature vector length of each vertex is usually large,
Eliminator obtains the indices of neighbors from these edges           so exploiting the locality of features is critical. We group the
and sends them to the prefetcher. The prefetcher uses them             vertices within the same interval together (e.g. Ii ) and then
to prefetch the feature vectors immediately.                           process the aggregation of their source neighbors also interval
                                                                       by interval (i.e. traverse I j ), as expressed in Algorithm 2.
                                                                       Based on this flow, the feature accesses of all vertices in an
4.3     Aggregation Engine                                             interval are merged (see Fig. 5(b)). The resulting benefits
   To optimize the computation of Aggregation, we introduce            are twofold. First, the vertices in Ii usually have overlapped
a vertex-disperse processing mode. To optimize memory ac-              neighbors in I j , therefore, the loaded feature data of I j can be
cesses, we employ a static graph partition method to enhance           reused when performing feature aggregation. Second, when
data reuse and a dynamic sparsity elimination technique to             traversing all I j , the intermediate aggregated results of Ii are
reduce unnecessary data accesses.                                      remained in buffer which can also be reused when performing
                                                                       feature update. In practice, edge shards usually are not square
4.3.1    Execution Mode                                                as our simplified illustration in Fig. 5. The shard height is
   There are two processing modes for SIMD cores to pro-               determined by the capacity of Input Buffer, while the shard
cess edges in parallel. The first one is vertex-concentrated,          width is determined by the capacity of Aggregation Buffer.
where the workloads of each vertex are assigned to a single            The Edge Buffer size affects both height and width since it
SIMD core. This mode can produce the aggregated features               accommodates all edges of each shard.
of vertices in burst mode, i.e. periodically processing a group
of vertices. However, the processing latency of a single ver-          4.3.3           Data-Aware Sparsity Elimination (Dynamic)
tex (termed as vertex latency) is long, and the fast vertices             With the data reuse optimization, we further attempt to
have to wait for the slow vertices leading to workload im-             reduce the redundant accesses since the graph connections
balance. Furthermore, it also loses the parallelism that the           are sparsely distributed. To eliminate the sparsity, we propose
aggregation of each element can be performed in parallel               a window-based sliding and shrinking approach. The key idea
(i.e., intra-vertex parallelism). Therefore, we use the second         is that we first slide the window (with the same size of an
processing mode, which is shown in Fig. 4. It assigns the              edge shard) downward until an edge appears in the top row,
aggregation of elements inside the vertex feature vector of            and then we shrink the window size by moving the bottom
each vertex to all cores, termed as vertex-disperse mode. If a         row upward until an edge is met.

                                                                   6
Xk                       Sliding
            I1    I2     I3     I4                                                                      I1           I2          I3     I4                      Sliding             I1    I2     I3     I4

                                       Node ID
            1-4   5-8   9-12   13-16             X(k-1)                                                 1-4          5-8        9-12   13-16                                        1-4   5-8   9-12   13-16

                                                                         Interval 1
                                                            Interval 1
    1-4

                                                                                                                                                                            1-4
    I1

                                                                                               1-4

                                                                                                                                                                            I1
                                                                                               I1
                                                                                                       Sliding
                                                          Interval p
    5-8

                                                                                                                                                                            5-8
    I2

                                                                                               5-8

                                                                                                                                                                            I2
                                                                                               I2
                                                                                      X(k-1)
                                       Node ID
    9-12

                                                                                                                                                                            9-12
                                                                                               9-12
     I3

                                                           Store

                                                                                                                                                                             I3
                                                                                                I3
                                                          Interval 1

                                                                         Interval p
    13-16

                                                                                                                                                                            13-16
                                                                                               13-16
     I4

                                                                                                                                                                             I4
                                                                                                I4
                                                 Xk
                                                            Interval p                                                                                          Shrinking
                  (a)                                     (b)                                                              (c)                      Shrinking                (d)
Figure 5: Static graph partition for data reuse and dynamic sparsity elimination to reduce redundant accesses: (a)
interval-shard partition; (b) interval-wise feature access; (c) window sliding; (d) window shrinking.

Window Sliding. Fig. 5(c) illustrates the window sliding                                                                   interval I j is dynamically determined by window sliding and
process. For each vertex interval, the top shard window grad-                                                              shrinking (see Algorithm 4). The starting row of each neigh-
ually slides downward. It will not stop until an edge appears                                                              bor interval varies due to sliding and the interval length in
on its top row. Then a new window with the same size is                                                                    the row dimension also varies due to shrinking. In this way,
created, whose top row follows the bottom row of its previous                                                              only the feature data of remaining neighbor vertices when
window. The stop criterion is the same for every window. In                                                                performing the aggregation operation for each interval Ii are
this way, windows continuously arise, slide downward, and                                                                  loaded, which eliminates plenty of redundant accesses.
stop. All the positions where windows stop are recorded as
effectual shards.                                                                                                        Algorithm 4: GetOneEffectInterval
 Algorithm 2: Interval-wise Aggregation                                                                                                          / Window Sliding
                                                                                                                        1 while (edge(row_pos, v) == ∅ f or ∀v ∈ Ii ) do
1   for each interval Ii in X k do
2       agg_res ← init();                                                                                               2        row_pos ← row_pos + 1;
                                                                                                                        3 end
3       for each interval I j in X (k−1) do                                                                             4 winstart ← row_pos;
4           agg_res ← Aggregation(I j , agg_res);                                                                       5 winend ← row_pos +Windowheight − 1;
5       end                                                                                                             6 row_pos ← winend + 1;
6       Ii ← Combination(agg_res);                                                                                                              / Window Shrinking
7   end                                                                                                                 7 while (edge(winend , v) == ∅ f or ∀v ∈ Ii ) do
                                                                                                                        8        winend ← winend − 1;
Window Shrinking. Although the window sliding can cap-                                                                  9 end
ture most effectual edges, sparsity still exists on the bottom                                                                              (k−1) [win
                                                                                                                       10 Ie f f ectual ← X            start : winend ];
side (within the purple dashed boxes). This is because the
                                                                                                                       11 return Ie f f ectual ;
above sliding direction is downward. To reduce this part of
sparsity, we propose window shrinking here. Specifically, the
bottom row of each recorded window moves upward until                                                                         Compared to traditional graph analytics, the feature data
it meets an edge, and then the window shrinks. Fig. 5(d)                                                                   reuse from graph partitioning and redundant access reduction
illustrates the sliding and shrinking process of one window in                                                             from sparsity elimination in GCNs are considerable efforts.
detail and gives the final recorded effectual shards. Different                                                            This is because the feature of each vertex in GCNs is a vector
from previous partition, the sizes of final shards are usually                                                             with thousands of elements, while the feature data in tradi-
different due to the window shrinking.                                                                                     tional graph analytics are small, usually with one element for
                                                                                                                           each vertex. Besides, our optimization achieves more when
 Algorithm 3: Interval-wise Aggregation with Sparsity                                                                      the Sampling operation is used, which increases sparsity since
 Elimination                                                                                                               only sampled neighbors are required during Aggregation.
                            k
1 for each interval Ii in X do
                                                                                                                           4.4          Combination Engine
2     row_pos ← 1;
3     agg_res ← init();                                                                                                       The Combination operation at each vertex acts like a neu-
4     do                                                                                                                   ral network, the execution of which is regular but compute-
5         (I j , row_pos) ←                                                                                                intensive. Our design is based on the well-known systolic
                                                                                                                           array. To adapt it for the two processing modes of Aggrega-
           GetOneEffectInterval( X (k−1) , A, Ii , row_pos);                                                               tion Engine (see Fig. 4), we integrate multiple arrays rather
6         agg_res ← Aggregation(I j , agg_res);                                                                            than a single one, as shown in Fig. 6(a). A group of systolic
7     while (I j != ∅);                                                                                                    arrays is assembled to form a systolic module. We allow a
8     Ii ← Combination(agg_res);                                                                                           multigranular use of these systolic modules, including the
9 end                                                                                                                      independent working mode and cooperative working mode.

  Given the effectual shards after sparsity elimination, the                                                                4.4.1              Independent Working Mode
execution flow of Aggregation follows Algorithm 3. The                                                                        In this mode, the systolic modules work independently
only difference from Algorithm 2 is that the each neighbor                                                                 from each other. Each of them processes the MVM opera-

                                                                                                                 7
gation engine and Combination engine by the Coordinator.

                                                    Module
                         PE    PE              PE

                                                                    Aggregation Buffer
                                                      1
  Aggregation Buffer
                         PE    PE              PE
                                                                                                                         4.5.1        Latency- or Energy-Aware Pipeline

                                                    Module
                         PE    PE              PE
                                                                                                                         Ping-Pong Aggregation Buffer. To reuse the aggregation

                                                      2
                         PE    PE              PE
                                                                                                                         results produced by the Aggregation engine, we add an Ag-
                                                                                                                         gregation Buffer between the two engines. This buffer can

                                                    Module
                         PE    PE              PE

                                                      i
                         PE    PE              PE                                                                        be written by the Aggregation Engine and can be read by the
                                                                                                                         Combination Engine. Before the final aggregated results are
                          Weight Buffer                                                       Weight Buffer              generated, the Aggregation Buffer stores the partial results
                              (a)                                                               (b)                      that will be read by the Aggregation Engine for feature accu-
Figure 6: Combination Engine design: (a) multiple sys-                                                                   mulation. In order to increase the parallelism of these two
tolic modules; (b) different dataflow patterns.                                                                          engines, we implement a ping-pong buffering mechanism
                                                                                                                         where the Aggregation Buffer is split into two chunks. In
tions of a small group of vertices, as illustrated in Fig. 7(a).                                                         this way, the executions of aggregation and combination are
The weight parameters for each module in this case are di-                                                               decoupled, which enables an inter-engine pipeline.
rectly accessed from the Weight Buffer and just reused within                                                               To accommodate the needs of different applications, we
module, as depicted in Fig. 6(b). The advantage of this                                                                  provide two pipeline modes as follows.
mode is the lower vertex latency because we can process                                                                  Latency-Aware Pipeline. In this pipeline mode, the Com-
the Combination operations of this small group of vertices                                                               bination Engine works in the systolic module independent
immediately once their aggregated features are ready, without                                                            mode. The aggregated features are produced vertex by vertex
waiting for more vertices. This mode matches well with the                                                               in the Aggregation Engine, and the following combination
vertex-disperse processing mode of Aggregation Engine in                                                                 will be processed immediately once the aggregated features
Fig. 4, where the aggregated features are produced quickly                                                               of a small group of vertices are ready. Therefore, the average
but sequentially.                                                                                                        processing latency for each vertex can be lower. The overall
    Aggregating Feature                                                 Aggregating Feature
                                                                                                                         timing is illustrated in Fig. 8(a), where V denotes the vertices
          Vector                                                              Vector                                     for aggregation, and I represents the neighbor intervals.
             V1                      Systolic                                            V1
 V2                                  Module 1                       V2                                                   Energy-Aware Pipeline. The energy-aware pipeline uses
                                                                V3
                                                                                                                         the systolic module cooperative mode in the Combination
                                    Weight

                                                                                                          Systolic       Engine. The vertex-by-vertex processing changes to a burst
                                                                                                           Array         mode, where a large group of vertices will be processed
         Aggregating Feature
        V(i+1)
               Vector
                                       Systolic            V(i+1)                                                        together every time. Although the vertex latency is longer,
 V(i+2)                                Module i         V(i+2)                                                           the energy consumption can be reduced due to the weight
                                                                                                                         propagation in the merged systolic arrays without redundant
                                                                                                      Weight
                                      Weight

                                                                                                                         accesses. Fig. 8(b) presents its timing sequence.
          (a)                             (b)                                                                              X(k-1)                Xk                                               Xk
                                                                                                                                                            Small Group
Figure 7: Different use of the systolic arrays: (a) inde-                                                                   I1 V 1 V 2 V 3 V 4                               V1 V2 V 3 V4              Large Group
                                                                                                                                                        V
pendent working mode; (b) cooperative working mode.                                                                         I2                 V1 V3 V4 V2                                  V1 V3 V4
                                                                                                                                                          3                                                      V1
                                                                                                                             I3                         V1 V4                                          V1 V4    V2
4.4.2                   Cooperative Working Mode                                                                             I4                               V1 VV1                                         V1 VV3
                                                                                                                                                                    4                                             4

   Besides working separately, these systolic modules can be                                                                    Time        V2           V3    V4 V 1        Time     V2            V3      V4 V1
                                                                                                                                          (a)            Aggregation      Combination            (b)
further assembled together to simultaneously process more
vertices, as shown in Fig. 7(b). Different from the immediate                                                            Figure 8: Timing illustration of different pipeline modes:
processing of vertices, this mode requires to assemble the ag-                                                           (a) latency-aware pipeline; (b) energy-aware pipeline.
gregated features of a large group of vertices together before
performing their Combination operations. The advantage is                                                                4.5.2        Coordination of Off-chip Memory Access
that, the weight parameters can flow from the Weight Buffer                                                                It is hard to determine the memory bandwidth ratio be-
to the downstream systolic modules and then gradually to the                                                             tween the two engines since the practical workloads usually
upstream ones (see Fig. 6(b)), which are greatly reused by all                                                           vary between Aggregation and Combination. Moreover, the
systolic arrays. This helps reduce the energy consumption.                                                               separation of memory systems will increase the configuration
   No matter which working mode is selected in the Combi-                                                                overheads and cause bandwidth waste. This is the reason why
nation Engine, the weights can be reused inherently in Weight                                                            we use only one off-chip memory.
Buffer when processing different vertices. However, in tradi-                                                                         Request Issue                          Data in DRAM Row Buffer
tional neural networks, especially MLPs, the weights cannot
be shared without batching technique. The multi-granular                                                                                                        Time                                           Time
                                                                                                                                                                    (a)
systolic array design is also specific to our architecture in                                                                         Request Issue                          Data in DRAM Row Buffer
order to accommodate different application needs.
                                                                                                                                                                Time                                           Time
                                                                                                                              Edges     Input Features                        Weights        Output Features
                                                                                                                                                                    (b)
4.5                    Inter-Engine Optimization
                                                                                                                            Figure 9: Coordination of off-chip memory access.
  To efficiently fuse the phase-by-phase execution, we or-
chestrate the execution pipeline and DRAM access of Aggre-                                                                 Both the two engines access this memory at runtime, which

                                                                                                                     8
causes a frequent switching of access locations, leading to
inefficiencies. In total, there are four buffers (Edge Buffer                     Table 4: Dataset information [23, 42].
& Input Buffer in Aggregation Engine, and Weight Buffer &                    Dataset          #Vertex     Feature Length        #Edge          Storage
Output Buffer in Combination Engine) that will be used for              IMDB-BIN (IB)          2,647            136            28,624          1.5MB
                                                                           Cora (CR)           2,708           1,433           10,556           15MB
accessing the off-chip memory. Due to the interval process-
                                                                          Citeseer (CS)        3,327           3,703            9,104           47MB
ing and pipeline mechanism, these accesses usually come                  COLLAB (CL)           12,087           492           1,446,010         28MB
concurrently as shown in Fig. 9(a). If we sequentially handle             Pubmed (PB)          19,717           500            88,648           38MB
these access requests, the discontinuous addresses greatly                Reddit (RD)         232,965           602          114,615,892       972MB
degrade the utilization of row buffer within DRAM.
   To solve this problem, we predefine an access priority             Table 5: Configuration of convolution layers. Here |akv |
(edges > input f eatures > weights > out put f eatures) to as-        denotes the length of feature vector akv .
semble the discontinuous requests shown in Fig. 9(b). The                                  #Sampling Neighbors      Aggregation & Combination (MLP)
motivation in using this priority is based on the access se-
                                                                         GCN (GCN)                   —                         Add & |akv |–128
quence when processing a vertex. The access requests are               GraphSage (GSC)               25                       Max & |akv |–128
executed batch-by-batch. Therefore, low-priority accesses in            GINConv (GIN)                —                       Add & |akv |–128–128
the current batch are handled before high-priority accesses                                       GCN pool                      GCNembedding
coming at the next batch, rather than always high-priority              DiffPool (DFP)
                                                                                               Min & |akv |–128                Min & |akv |–128
accesses first. With the improved continuity, the utilization
of row buffer can be significantly enhanced. Next, we remap
these reordered addresses to index the channel and bank using         length of feature vectors. On CPU, the datasets with more
low bits. In this way, the memory channel- and bank-level             than one graphs are tested by assembling randomly selected
parallelism can be further exploited.                                 128 graphs into a large graph before processing for GCN,
                                                                      GSC, and GIN or batching the same number of graphs for
5.    EVALUATION RESULTS                                              DFP. On HyGCN, the testing methods remain the same with
   We first describe our experimental setup in Section 5.1.           CPU except that the selected graphs for DFP are processed
Next, to demonstrate the advantages of our design, we com-            one by one rather than in a batched mode.
pare HyGCN to the state-of-the-art software framework in              Baseline Platform. To compare the performance and energy
Section 5.2. Next, we give the detailed analysis of our op-           consumption of HyGCN with state-of-the-art works, we eval-
timization techniques in Section 5.3. Finally, we present a           uate PyTorch Geometric (PyG) [15] on a Linux workstation
scalability exploration of our architecture in Section 5.4.           equipped with two Intel Xeon E5-2680 v3 CPUs and 378
                                                                      GB DDR4 memory and on an NVIDIA V100 GPU, denoted
5.1    Experimental Setup                                             as PyG-CPU and PyG-GPU, respectively. Table 6 lists the
                                                                      system configurations for above implementations.
Methodology. The performance and energy of HyGCN are
measured by using the following tools.                                                   Table 6: System configurations.
   Architecture Simulator. We design and implement a cycle-                       PyG-CPU       PyG-GPU                        HyGCN
accurate and execution-driven simulator to measure execution           Compute    2.5 GHz @     1.25Ghz @           1 GHz @ 32 SIMD16 cores and
time in number of cycles. This simulator models the microar-             Unit      24 cores     5120 cores    8 systolic modules (each with 4×128 arrays)
chitectural behaviors of each module, which is integrated              On-chip
                                                                                    60MB          34MB
                                                                                                             128 KB (Input), 2 MB (Edge), 2 MB (Weight),
                                                                       Memory                                  4 MB (Output) and 16 MB (Aggregation)
with Ramulator [24] to simulate the behaviors of memory
                                                                       Off-chip   136.5GB/s     ∼900GB/s                       256GB/s
accesses to High Bandwidth Memory (HBM).                               Memory       DDR4        HBM∼2.0                       HBM∼1.0
   CAD Tools. For the measurements of area, power, and                  Note: GPU’s on-chip memory includes the register files, and L1 and L2 caches.
critical path delay (in cycles) for each module, we implement
and synthesize each module in Verilog. We use the Synopsys
Design Compiler with the TSMC 12 nm standard VT library               5.2     Overall Results
for the synthesis, and estimate the power using Synopsys                 We first apply our algorithm optimization on PyTorch Geo-
PrimeTime PX. The slowest module has a critical path de-              metric. And then, we compare our work (HyGCN) with PyG-
lay of 0.9 ns including the setup and hold time, putting the          CPU and PyG-GPU in terms of speedup, energy consumption,
HyGCN comfortably at 1 GHz clock frequency.                           utilization of DRAM bandwidth, and DRAM access. Finally,
   Memory Measurements. The area, power, and access la-               the area and power of our design is presented.
tency of the on-chip scratchpad memory are estimated using            • Algorithm Optimization on PyG Framework. To show the
Cacti 6.5 [1]. Since Cacti only supports down to 32 nm tech-          effect of our algorithm optimization on CPU and GPU plat-
nologies, we apply four different scaling factors to convert          forms, we implement our algorithm optimization proposed
them to 12 nm technology as shown in [33, 36]. The energy             in Section 4.3 on PyG framework. The graph is partitioned
of HBM 1.0 is estimated with 7 pJ/bit as in [32, 41].                 into multiple shards and they are executed shard by shard
Benchmark Graph Datasets and GCN Models. Table 4                      (see Fig. 5(a)). The number of partitions is determined by
and Table 5 provide the information of the benchmark graph            the capacity of L2 Cache and the length of feature vectors.
datasets and GCN models used in our evaluation. The datasets          Note that, PyG leverages the Pytorch Scatter library [4] for
in Table 4 are standard ones in the GCN domain. They are              the acceleration of Aggregation on both CPU and GPU. It
actually not small although the number of vertices is smaller         helps eliminate the sparsity and exploit the edge parallelism
than that used in conventional graph analytics, due to the long       by executing each vertex’s Aggregation in a hardware thread.

                                                                  9
10                                                                               1                                                                                        10000
                                                              11.9                                                                                                                                PyG-GPU    HyGCN
          8                                                                           0.8                                                                                            1000

                                                                                                                                                                                                                                                                      13,427.7
                                                                            Speedup

                                                                                                                                                                           Speedup
Speedup

          6                                                                           0.6

                                                                                                                                                                                                                3,132.4
                                                                                                                                                                                                                2,793.3

                                                                                                                                                                                                                 1,854.9
                                                                                                                                                                                                                 1,721.9
                                                                                                                                                                                              1,660.9
                                                                                                                                                                                      100

                                                                                                                                                                                                                 1,213.5
                                                                                                                                                                                                                  1,085.5
                                                                                                                                                                                                                   999.4

                                                                                                                                                                                                                   730.3
                                                                                                                                                                                                                   703.3
                                                                                                                                                                                                                    534.2
          4                                                                           0.4

                                                                                                                                                                                                 422.7

                                                                                                                                                                                                                    412.8

                                                                                                                                                                                                                    401.2
                                                                                                                                                                                                 350.6

                                                                                                                                                                                                 282.2

                                                                                                                                                                                                                     265.3
                                                                                                                                                                                                                     255.3
                                                                                                                                                                                                                     248.2
                                                                                                                                                                                                  222.7
                                                                                                                                                                                                  191.7
                                                                                                                                                                                                  167.1

                                                                                                                                                                                                                      158.1

                                                                                                                                                                                                                      107.1
                                                                                                                                                                                                   101.1

                                                                                                                                                                                                                       92.9
                                                                                                                                                                                                    87.3

                                                                                                                                                                                                                       73.4
                                                                                                                                                                                                                       71.9

                                                                                                                                                                                                                       69.9
                                                                                                                                                                                                                       62.2
                                                                                                                                                                                       10

                                                                                                         OoM

                                                                                                                                                        OoM

                                                                                                                                                                                                              OoM

                                                                                                                                                                                                                                                                                 36.8
                                                                                                                                                                                                                                                                  OoM
                                                                                                                                                                                                     30.9

                                                                                                                                                                                                                                                                                  21.0
                                                                                                                                                                                                                                                                                  20.9
          2                                                                           0.2

                                                                                                                                                                                                                                                                                  17.1
                                                                                                                                                                                                                          4.8
          0                                                                            0                                                                                               1
              IB CR CS CL PB RD IB CR CS CL PB RD IB CR CS CL PB RD IB CL                   IB CR CS CL PB RD IB CR CS CL PB RD IB CR CS CL PB RD IB CL                                       IB CR CS CL PB RD IB CR CS CL PB RD IB CR CS CL PB RD IB CL
                    GCN               GSC               GIN          DFP                          GCN               GSC               GIN                           DFP                              GCN                    GSC                     GIN                           DFP
                                            (a)                                                                           (b)                                                                                                           (c)
Figure 10: Comparistion to PyG-CPU and PyG-GPU: Speedup of our algorithm optimization on (a) CPU and (b) GPU;
(c) Speedup over the optimized PyG-CPU. OoM means the evaluation fails in running on GPU due to out of memory.
Furthermore, the hardware-optimized libraries such as Intel                                                                           greatly reduces the feature length before performing Aggrega-
MKL [21] and NVIDIA cuBLAS library [31] are used to                                                                                   tion. This difference causes the inefficient execution of GIN
accelerate Combination on CPU and GPU, respectively.                                                                                  on CPU and GPU, while our HyGCN can maintain the per-
    Fig. 10(a) shows the speedup of PyG-CPU with our algo-                                                                            formance to a great extent due to the parallel processing and
rithm optimization (PyG-CPU-OP) over the naive one with-                                                                              data reuse. For DFP, it includes three matrix multiplications
out optimization. Thanks to the algorithm improvement,                                                                                (see Equation (8)) that can be efficiently executed on CPU
PyG-CPU-OP achieves 2.3× speedup on average. The per-                                                                                 and GPU. Therefore, our speedup when performing DFP is
formance benefits come from the reduction of frequent re-                                                                             relatively lower. The GSC model consumes significant time
placement of feature vectors since the reusable features after                                                                        on the Sampling operation in a preprocessing step, which is
graph partition and the intermediate results of Aggregation                                                                           not included in the result of PyG-CPU and PyG-GPU. For
are buffered in L2 Cache. Fig. 10(b) presents the same testing                                                                        example on the RD dataset, the preprocessing can cost up to
on GPU. The performance of PyG-GPU-OP degrades since                                                                                  15 seconds while the execution time is only 0.65 second on
only a small amount of vertices are processed for each graph                                                                          PyG-CPU and 0.0025 second on PyG-GPU. In our work, the
partition, which cannot fully utilize thousands of hardware                                                                           Sampling operation is executed together with Aggregation
threads on GPU and miss the core advantage of GPU to hide                                                                             and considered in the reported result. Thus, the performance
the access latency through many parallel threads. As a result,                                                                        of our work is lower than PyG-GPU in Fig. 10(c) but the
it is inefficient for GPU to exploit our optimization to im-                                                                          overall execution time ratio is 0.136 second v.s. 15.7 seconds.
prove performance. The optimized PyG-CPU and the naive                                                                                • Energy Consumption. As Fig. 11 shows, HyGCN con-
PyG-GPU are used as baselines in the following evaluation.                                                                            sumes only 0.04% and 10% energy on average compared
• Speedup. Fig. 10(c) depicts that HyGCN achieves average                                                                             to PyG-CPU and PyG-GPU, respectively. The energy con-
1509× and 6.5× speedup compared with PyG-CPU and PyG-                                                                                 sumption of all platforms includes the off-chip memory. Note
GPU, respectively. The performance improvement comes                                                                                  that, although the results of PyG-CPU and PyG-GPU do not
from the individual optimizations in Aggregation Engine &                                                                             include the overhead of the Sampling operation, they are still
Combination Engine, and the inter-engine pipeline & coordi-                                                                           costly. For example, the Sampling energy of GSC is 2715J
nation. First, the parallel processing in SIMD cores and sys-                                                                         on the RD dataset. In contrast, our work consumes only 1.79J
tolic arrays speed up the computations. Second, the graph par-                                                                        compared to the total 2716J in PyG-GPU.
tition and sparsity elimination increase the feature reuse and
                                                                                                                                                                                                            PyG-GPU                   HyGCN
decrease redundant accesses in Aggregation Engine, which                                                                                                           0.6
                                                                                                                                            Norm. Energy (%)

                                                                                                                                                                                                                              1.031

                                                                                                                                                                                                                                                              1.2
                                                                                                                                                                          1.1

                                                                                                                                                                                                                                                                      0.9
saves DRAM bandwidth. Third, the weight parameters are                                                                                                             0.5
reused efficiently in Combination Engine, which also helps                                                                                                         0.4
better utilize the bandwidth. Finally, the inter-engine pipeline                                                                                                   0.3
                                                                                                                                                                                                                                  0.6
                                                                                                                                                                                                                    0.5
                                                                                                                                                                                        0.5

                                                                                                                                                                                                       0.4

                                                                                                                                                                   0.2
further optimizes the parallelism and the off-chip memory
                                                                                                                                                                                                                      0.3
                                                                                                                                                                                                     OoM

                                                                                                                                                                                                                                                      OoM
                                                                                                                                                                            0.3
                                                                                                                                                                            0.003
                                                                                                                                                                            0.033
                                                                                                                                                                                            0.126
                                                                                                                                                                                            0.004
                                                                                                                                                                                            0.035
                                                                                                                                                                                                         0.005
                                                                                                                                                                                                         0.002
                                                                                                                                                                                                         0.036
                                                                                                                                                                                                                       0.121
                                                                                                                                                                                                                       0.001
                                                                                                                                                                                                                       0.110

                                                                                                                                                                                                                                      0.005
                                                                                                                                                                                                                                      0.005
                                                                                                                                                                                                                                      0.014
                                                                                                                                                                                                                                      0.002
                                                                                                                                                                                                                                      0.010
                                                                                                                                                                                                                                                          0.001
                                                                                                                                                                                                                                                                  0.066
                                                                                                                                                                                                                                                                             0.123
                                                                                                                                                                                                           0.2

                                                                                                                                                                                                                                        0.2

                                                                                                                                                                   0.1
                                                                                                                                                                                              0.2
                                                                                                                                                                                              0.2

                                                                                                                                                                                                                          0.1

                                                                                                                                                                                                                          0.1

                                                                                                                                                                                                                                         0.1

                                                                                                                                                                                                                                         0.1
                                                                                                                                                                                                                                         0.1

access coordination improves the DRAM access efficiency.                                                                                                            0
    For PyG-CPU and PyG-GPU, abundant DRAM accesses                                                                                                                        IB CR CS CL PB RD IB CR CS CL PB RD IB CR CS CL PB RD IB CL

and synchronization overheads lead to performance degrada-                                                                                                                                  GCN                       GSC                     GIN                  DFP

tion. Specifically, the high randomness of neighbor indices                                                                                                        Figure 11: Normalized energy over PyG-CPU.
results in poor locality of neighbors’ feature vectors, caus-                                                                                                                           Aggregation Engine            Combination Engine        Coordinator
                                                                                                                                            Energy Breakdown (%)

                                                                                                                                                             100
ing many unnecessary DRAM accesses. From the perspec-
                                                                                                                                                                   80
tive of computation, PyG-CPU and PyG-GPU leverage the
                                                                                                                                                                   60
hardware-optimized functions (such as scatter [4] and matrix
                                                                                                                                                                   40
multiplication [31]) to perform GCNs in a coarse-grained                                                                                                           20
fashion. Although it is the best way to utilize CPU and GPU,                                                                                                        0
it loses the inter-phase parallelism and produces redundant                                                                                                               IB CR CS CL PB RD IB CR CS CL PB RD IB CR CS CL PB RD IB CL
operations. The delay for data copy and synchronization                                                                                                                                     GCN                       GSC                     GIN                 DFP
between threads further degrades the performance.                                                                                                                        Figure 12: Energy breakdown of HyGCN.
    In term of models, GIN achieves better performance than                                                                              As aforementioned, GIN causes additional computations
others. The underlying reason is that GIN executes Aggrega-                                                                           and data accesses when performing Aggregation, which in-
tion first on PyG-CPU and PyG-GPU, which introduces abun-                                                                             troduces extra energy consumption on PyG-CPU. Although
dant computations and accesses since the feature vector size                                                                          HyGCN cannot reduce these computations, the optimiza-
is an order of magnitude larger than that after Combination.                                                                          tions of data reuse, sparsity elimination, and inter-engine
By contrast, other models execute Combination first, which                                                                            pipeline can reduce redundant accesses to these additional

                                                                                                                                10
You can also read