Domino: A Tailored Network-on-Chip Architecture to Enable Highly Localized Inter- and Intra-Memory DNN Computing

 
CONTINUE READING
Domino: A Tailored Network-on-Chip Architecture to Enable Highly Localized Inter- and Intra-Memory DNN Computing
Domino: A Tailored Network-on-Chip Architecture to Enable
                                          Highly Localized Inter- and Intra-Memory DNN Computing

                                                                           Kaining Zhou* , Yangshuo He* , Rui Xiao* and Kejie Huang*

                                                                                               * Zhejiang   University
arXiv:2107.09500v1 [cs.AR] 18 Jul 2021

                                         ABSTRACT                                                             or fully digital [11, 22] CIM schemes and allows weight up-
                                         The ever-increasing computation complexity of fast-growing           dating during inference. Resistive Random Access Memory
                                         Deep Neural Networks (DNNs) has requested new com-                   (ReRAM), which has shown great advantages of high density,
                                         puting paradigms to overcome the memory wall in conven-              high resistance ratio, and low reading power, is one of the
                                         tional Von Neumann computing architectures. The emerg-               most promising candidates for CIM schemes [7,29,36,44,45].
                                         ing Computing-In-Memory (CIM) architecture has been a                   However, most research focuses only on the design of the
                                         promising candidate to accelerate neural network comput-             CIM array, which lacks a flexible top-level architecture for
                                         ing. However, the data movement between CIM arrays may               configuring the storage and computing units of DNNs. Due to
                                         still dominate the total power consumption in conventional           the high write power of ReRAM, the weight updating should
                                         designs. This paper proposes a flexible CIM processor archi-         be minimized [35, 38]. Therefore new flexible interconnect
                                         tecture named Domino to enable stream computing and local            architectures and mapping strategies should be employed
                                         data access to significantly reduce the data movement energy.        to meet the various requirements of DNNs while achieving
                                         Meanwhile, Domino employs tailored distributed instruction           high components utilization and energy efficiency. There
                                         scheduling within Network-on-Chip (NoC) to implement                 are two main challenges to designing a low-power flexible
                                         inter-memory-computing and attain mapping flexibility. The           CIM processor. The first challenge is that complicated 4-D
                                         evaluation with prevailing CNN models shows that Domino              tensors have to be mapped to 2-D CIM arrays. The conven-
                                         achieves 1.15-to-9.49× power efficiency over several state-          tional 4-D tensor flattening method has to duplicate data or
                                         of-the-art CIM accelerators and improves the throughput by           weight, which increases the on-chip memory requirement.
                                         1.57-to-12.96×.                                                      The second challenge comes from the flexibility requirement
                                                                                                              to support various neural networks. Therefore, partial-sums
                                                                                                              and feature maps are usually stored in the external memory,
                                         1.   INTRODUCTION                                                    and an external processor is generally required to maintain
                                            The rapid development of Deep Neural Network (DNN)                complicated data sequences and computing flow.
                                         algorithms has led to high energy consumption due to mil-               Network-on-Chip (NoC) with high parallelism and scal-
                                         lions of parameters and billions of operations in one infer-         ability has attracted a lot of attention from industry and
                                         ence [19, 37, 39]. Meanwhile, the increasing demand for              academia [3]. In particular, NoC can optimize the process of
                                         Artificial Intelligence (AI) computing entails flexible and          computing DNN algorithms by organizing multiple cores
                                         power-efficient computing platforms to reduce inference en-          uniformly under specified hardware architectures [1, 6, 8,
                                         ergy and accelerate DNN processing. However, shrinkage in            9, 13]. This paper proposes a tailored NoC architecture
                                         Complementary Metal-Oxide Semiconductor (CMOS) tech-                 called Domino to enable highly localized inter- and intra-
                                         nology nodes abiding by Moore’s Law has been near the                memory computing for DNN inference. Intra-memory com-
                                         end. Meanwhile, the conventional Von Neuman architectures            puting is the same as the conventional CIM array to execute
                                         encounter a “memory wall” where the delay and power dissi-           Multiplication-and-Accumulation (MAC) operations in mem-
                                         pation of accessing data has been much higher than that of an        ory. Inter-memory computing is that the rest of computing
                                         Arithmetic Logic Unit (ALU), which is caused by the separa-          (partial sum addition, activation, and pooling) is performed in
                                         tion of storage and computing components in physical space.          the network when data are moving between CIM arrays. Con-
                                         Therefore, new computation paradigms are pressed for the             sequently, “computing-on-the-move” dataflow is proposed to
                                         computing power demand for AI devices in the post-Moore’s            maximize data locality and significantly reduce the energy of
                                         Law era.                                                             the data movement. The dataflow is controlled by distributed
                                            One of the most promising solutions is to adopt Computing-        local instructions instead of an external/global controller or
                                         in-Memory (CIM) scheme to greatly increase the parallel              processor. A synchronization and weight duplication scheme
                                         computation speed with much lower computation power. Re-             is put forward to maximize parallel computing and through-
                                         cently, both volatile memory and non-volatile memory have            put. The high-energy efficient CIM array is also proposed. It
                                         been proposed as computing memories for CIM. SRAM is a               is worth noting that various CIM schemes can be adopted in
                                         volatile memory that enables partially digital [10, 17, 23, 47]      our proposed Domino architecture. The contributions of this
paper are as follows:                                                      Parameter     Description
   • We propose an NoC based CIM processor architecture                    H/W           IFM height / width
with distributed local memory and dedicated routers for input
                                                                           C             Number of IFM / filter channels
feature maps and output feature maps, which enables flex-
ible dataflow and reduces the data movement significantly.                 P             Padding size
The evaluation results show that Domino improves power                     K             Convolution filter height / width
efficiency and throughput by more than 15% and 57%, re-                    Kp            Pooling filter height / width
spectively.                                                                M             Number of filters / OFM channels
   • We define a set of instructions for Domino. The dis-
tributed, static, and localized instruction schedules are pre-             E/F           OFM height / width
loaded to Domino to control specific actions in the runtime.               S             Convolution stride
Consequently, our scheme avoids the overhead of the instruc-               Sp            Pooling stride
tion and address movement in the NoC.
   • We design a “computing-on-the-move” dataflow that per-                    Table 1: Shape Parameters of a CNN.
forms Convolution Neural Network (CNN) related operations
with our proposed NoC and local instruction tables. MAC
operations are completed in CIMs, and other associated com-          followed by pooling to reduce the spatial resolution and avoid
putations like partial sum addition, pooling, and activation         data variation and distortion. After a series of CONV and
are executed in the network when data are moving between             pooling operations, Fully Connected (FC) layers are applied
CIM arrays.                                                          and performed to classify target objects.
   The rest of the paper is organized as follows: Section 2
introduces the background of CNNs, dataflow, and NoCs;               2.2   Dataflow
Section 3 details the opportunity and innovation with respect           The dataflow rules how data are generated, calculated and
to Domino; Section 4 describes the architecture and building         transmitted on an interconnected multi-core chip under a
blocks of Domino; Section 5 illustrates the computation and          given topology. General dataflow for a CNN accelerator can
dataflow model; Section 6 defines the instruction and execu-         be categorized into four types: Weight Stationary (WS), Input
tion; Section 7 presents the evaluation setup, experimental          Stationary (IS), Output Stationary (OS), and Row Stationary
results, and experimental comparison; Section 8 introduces           (RS), based on the taxonomy and terminology proposed in
related works; finally, Section 9 concludes this work.               [42].
                                                                        Kwon et al. propose MEARI, i.e., a type of flexible
2.     BACKGROUND                                                    dataflow mapping for DNN with reconfigurable intercon-
   High computation costs of DNNS have posed challenges              nects [25, 27]. Chen et al. present a row-stationary dataflow
to AI devices for power-efficient computing in a real-time           adapting to their spatial architecture designed for CNNs pro-
environment. Conventional processors such as CPUs and                cessing [9]. Based on that, [8] is put forward further as a
GPUs are power-hungry devices and inefficient for AI com-            more hierarchical and flexible architecture. FlexFlow is an-
putations. Therefore, accelerators that improve computing            other dataflow model dealing with parallel types mismatch
efficiency are under intensive development to meet the power         between the computation and CNN workloads [31]. These
requirement in the post Moore’s Law era.                             works attempt to make the best advantages of computation
                                                                     parallelism, data reuse, and flexibility [5, 16, 34].
2.1      CNN Basics
   An essential computation operation of a CNN is convolu-           2.3   Network-on-Chip
tion. Within a convolution layer (CONV layer), 3-D input                Current CIM-based DNN accelerators use a bus-based
activations are stacked as an Input Feature Map (IFM) to             H-tree interconnect [33, 40], where most latency of each
be convolved with a 4-D filter. The value and shape of an            different type of CNN is spent on communication [32]. A
Output Feature Map (OFM) are determined by convolution               bus has limited address space and is forced synchronized on
results and other configurations such as convolution stride          a complex chip. In contrast, an NoC can span synchronous
and padding type.                                                    and asynchronous clock domains or use asynchronous logic
   Given parameters in Tab. 1, and let O, I, W , and B denote        that is not clock-bound, to improve a processor’s scalability
the OFM tensor, IFM tensor, weight tensors (filters), and bias       and power efficiency.
(linear) tensor, respectively, the total OFM can be calculated          Unlike simple bus connections, routers and Processing
as                                                                   Elements (PEs) are linked according to the network topology,
                       C−1 K−1 K−1                                   which makes a variety of dataflows on an NoC. Following
        O[m][x][y] =   ∑ ∑ ∑ I[c][Sx + i][Sy + j]                    NNs’ characteristics of massive calculation and heavy traffic
                       c=0 i=0 j=0                                   requirements, the NoC is a prospective solution to provide
                  ×W[m][c][i][ j] + B[m],                (1)         an adaptive architecture basis. Kwon et al. propose an NoC
            0 ≤ m < M, 0 ≤x < E, 0 ≤ y < F,                          generator that generates customized networks for dataflow
                                                                     within a chip [26]. Firuzan et al. present a reconfigurable
            H + 2P − K + S          W + 2P − K + S                   NoC architecture of 3-D memory-in-logic DNN accelerator
      E =b                 c, F = b                 c.
                   S                        S                        [15]. Chen et al. put forward Eyeriss v2, which achieves high
     Usually, a nonlinear activation is applied on the OFM           performance with a hierarchical mesh structure [8].

                                                                 2
3.    OPPORTUNITIES & INNOVATIONS                                         This section details the designed architecture for DNN
  Aimed at challenges mentioned in Section 1, we identify              processing. To boost DNN computation while maintaining
and point out opportunities lying in prior propositions.               flexibility, we propose an architecture called Domino. The
                                                                       purpose of Domino’s architecture is to enable “computing-
3.1    Opportunities                                                   on-the-move” within a chip, so Domino assigns hardware
   Opportunity #1. The CIM schemes significantly reduce                resources uniformly and identically on each building block
the weight movement. However, they may still have to ac-               of different hierarchy.
cess the global buffer or external memory to complete the                 From a top view, Domino consists of an input buffer and
operations such as tensor transformation. partial addition,            Ar × Ac tiles interconnected in a 2-D mesh NoC. The number
activation, and pooling. Methods like image-to-column and              of tiles is adjusted for different applications. The weights and
systolic matrix need to duplicate input data, leading to ad-           configuration parameters are loaded initially. The input buffer
ditional storage requirement and low data reuse. Therefore,            is used to store the required input data temporarily. A Domino
the power consumption may be still dominated by the data               block is an array of tiles virtually split in mesh NoC to serve a
movement.                                                              DNN layer. A tile contains a PE performing in-memory DNN
   Opportunity #2. Some overhead still exists in instruction           computation, and the two routers transmit the results in a tile.
and address transmission on an NoC. They need to be dis-               By this means, Domino achieves a high level of distributed
patched from a global controller, and transmitted in the router.       computation and uniformity, making it a hierarchical, flexible,
These overheads will cause extra bandwidth requirement and             and easily reconfigurable DNN processor architecture.
bring in long latency and synchronization issues.
                                                                       4.1    Domino Block
3.2    Domino Innovations                                                 Domino is virtually split into blocks corresponding to the
   In face of the aforementioned opportunities and challenges,         layers in a neural network. Let mt and ma denote the number
we employ the “computing-on-the-move” dataflow, localized              of rows and columns of a tile array in a block. In the initial-
static schedule table, together with tailored routers to avoid         ization and configuration stages, a mt × ma array of tiles are
the cost of off-chip accessing, instruction transmitting over          assigned to form a block used to deal with a CONV layer
NoC. Following are our innovations in Domino’s architecture.           together with a pooling layer (if needed) or an FC layer. A
   Innovation #1. Domino adopts CIM array to minimize the              block provides interconnection via a bi-direction link between
energy cost of weight refreshing and transmission. The data            tiles in four directions, as shown in Fig. 1. A Domino block
are transmitted through dedicated dual routers NoC structure,          provides diverse organizations of tiles for various dataflows
where one router is for input feature map processing and               based on layer configurations, including DNN weight dupli-
another one is equipped with a computation unit and schedule           cation and block reuse schemes. For instance, in pursuit of
table for localized control and partial-sum addition. The              layer synchronization, block adopts weight duplication that
localized data processing greatly reduces the energy for data          the block processes ma rows of pixels simultaneously. We
movement.                                                              will discuss the details of block variations in the dataflow
   Innovation #2. Domino adopts “computing-on-the-move”                section.
dataflow to reduce the excessive and complex input and
partial-sum movement and tensor transformation. In prior
works of CIM-based architectures, the data have to be dupli-
cated or stored in extra buffers for future reuses, the addition
of partial-sum is either be performed in an external accumu-
lator or a space-consuming adder tree outside PE arrays [27].
The “computing-on-the-move” dataflow used in Domino aims
at performing the extra computation in data movements and
maximize data locality. Domino reuses input by transferring
over the array of tiles and adds partial-sums along unified
routers on NoC. The overall computation can be completed
on-chip and no off-chip data accesses are required.
   Innovation #3. Domino adopts localized instructions in
routers to achieve flexible and distributed control. We identify
that existing designs of “computing-on-the-move” dataflow,
like [21], are supported by the external controller or the head
information including source and destination in transmis-
sion packets. They introduce additional transmission delay,            Figure 1: The top-level block diagram of the Domino ar-
bandwidth requirement, and power consumption. A local                  chitecture. A Domino block is a mt × ma array of tiles on
instruction table enables distributed and self-controlled com-         NoC for computation of a DNN layer.
putation to reduce the energy and time consumed by external
instruction or control signals. The instructions fit and support
Domino’s “compute-on-the-move” well.                                   4.2    Domino Tile
                                                                         A tile is the primary component of Domino that is used
4.    DOMINO ARCHITECTURE                                              for DNN computation. It involves a CIM array called PE, a

                                                                   3
“enable" signals to I/O ports to receive or transmit data based
                                                                      on the initial configuration and the counter’s value.

                                                                      4.4    Domino Rofm
                                                                         Rofm is the key component for “computing-on-the-move”
                                                                      dataflow controlled by instructions to manage I/O ports and
                                                                      buffers, add up partial/group-sum results, and perform acti-
                                                                      vation or pooling to get convolution results. Fig. 2 shows
                                                                      the micro-architecture in Rofm, which consists of a set of
                                                                      four-direction I/O ports, input/output registers, an instruction
                                                                      schedule table, a counter to generate instruction indices, a
                                                                      Rofm buffer to store partial computation results, a reusable
                                                                      adder, a computation unit with adequate functions, and a de-
                                                                      coder. The instructions are generated by the compiler based
                                                                      on DNN configurations. The instructions in the schedule
                                                                      table are executed periodically based on initial configurations.
                                                                      The internal counter starts functioning and keeps increasing
                                                                      its value as soon as any input packet is received. The decoder
                                                                      reads in the value of counter every cycle used as the index
                                                                      of the schedule table to fetch an instruction. Instructions are
                                                                      then decoded and split into control words to control ports,
                                                                      buffers, and other computation circuits within Rofm. The
                                                                      partial-sums are added to group-sums when transferring be-
                                                                      tween tiles. The group-sums are queued in the buffer for
                                                                      other group-sums to be ready and then form a complete com-
Figure 2: A Domino tile contains two routers Rifm and                 putation result.
Rofm and a computation center PE.                                        A computation unit is equipped in each Rofm to cope with
                                                                      the non-linear operations in DNN computation, including
                                                                      activation and pooling, as shown in Fig. 2. Activation is
router transferring IFMs called Rifm, and a router transfer-
                                                                      only used in the last tile of a block. The comparator is used
ring OFMs or their partial-sums in convolution computation
                                                                      for the max-pooling, which outputs the larger value of the
called Rofm. The basic structure of a tile is illustrated in
                                                                      convolution results from adjacent Rofms. The multiplier and
Fig. 2. Rifm receives input data from one out of four direc-
                                                                      adder perform the average-pooling.
tions in each tile and controls the input dataflow to remote
Rifm, local PE, and local Rofm. The in-memory computing
starts from the Rifm buffer and ends at Analog-to-Digital
(ADC) converters in PE. The outputs of PE are sent to Rofm
for temporary storage or partial-sum addition. Rofm is con-
trolled by a series of periodic instructions to receive either
computation results or input data via a shortcut from Rifm
and maintain the dataflow to adding up partial-sums.
4.3    Domino Rifm
   As shown in Fig. 2, each Rifm possesses I/O ports in four
directions to communicate with Rifms in the adjacent tiles.
It is also equipped with a buffer called Rifm buffer to store         Figure 3: Domino PE: (a) The block diagram of the
received input data in the current cycle. Moreover, Rifm              Domino PE, (b) the neuron circuit in the Domino PE, and
has in-tile connections with PE and Rofm. The bits in the             (c) the simulation waveform of the PE.
Rifm buffer controlled by an 8-to-1 MUX are sequentially
sent to PE for MAC computations. The buffer size is 8×Nc
bits where Nc is the number of rows in PE (PE consists a              4.5    Domino PE
Nc × Nm ReRAM array which is discussed in Section 4.5). It               PE is the computation center that links a Rifm and a Rofm
supports an in-buffer shifting operation with a step size of 64       in the same tile. Fig. 3 (a) shows the architecture of the pro-
or the multiple of 64, which supports the case when the input         posed Domino PE, which is composed of a 1T1R crossbar
channels of a layer are less than Nc . A shortcut connection          array, current mirrors, two integrators, and a Successive Ap-
from Rifm to Rofm is established to support the situation that        proximation Register (SAR) ADC [50]. As shown in Fig. 3
MAC computation is skipped (i.e., the shortcut in a ResUnit).         (b), eight single-level 1T1R cells are used to represent one
A counter and a controller in Rifm decide input dataflow              8-bit weight w[7:0] . The voltage on ReRAM is clamped by the
based on the initial configuration. Once Rifm receives input          gate voltage and threshold voltage, which will generate two
packets, the counter starts to increment the counter’s value.         levels’ of current based on the status of ReRAM and access
The controller chooses to activate MAC computation or sends           transistors. Weight bits are divided into two sets (BL7−4 and

                                                                  4
BL3−0 ). In each set, four current mirrors are used to provide                  described in Eqn. 2 is used to deal with the above-mentioned
the significance for each bit line ( 8k , 4k , 2k , and k, where k is the       issue: the Cin × Cout weight matrix is divided into mt × ma
gain to control the integration speed). The output of current                   smaller matrices Wi j with the dimension of Nc × Nm . The
mirrors is accumulated in the integrator. The higher four bits                  input vector is divided into mt small slices xi to multiply the
and lower four bits are jointed by the charge redistribution                    partitioned weight matrices. As shown in Fig. 4 (a), the sum
between two capacitors in the two integrators with a ratio of                   of the multiplication results within a column of partitioned
16:1. The significance of the input data is also realized by                    matrices is a slice of output vector y j . The slices of vectors
averaging the charge between integrators and ADC [50]. The                      are concatenated to produce the complete output.
simulation waveform of the integration and charge sharing
process is shown in Fig. 3 (c). In the initial configuration,                              x = [x1 , x2 · · · , xmt ]
neural network weights are mapped to ReRAM arrays in PEs,                                                                                 
which will be discussed in detail in Section 5.                                                    W11 W12                    · · · W1ma
                                                                                           W =  ...              ..          ..     ..
                                                                                                                                          
                                                                                                                   .              .   .    
5.    DATAFLOW MODEL                                                                                                                               (2)
                                                                                                        Wmt 1 Wmt 2           · · · Wmt ma
   We propose “computing-on-the-move” dataflow to reduce
                                                                                                                                       mt
both data movement and data duplication. Below we will in-
                                                                                               y = [y1 , y2 · · · , yma ],    y j = ∑ xi Wij
troduce the dataflow in FC layers, CONV layers, and pooling                                                                            i=1
layers, then reveal how it supports different types of DNN and
achieve high performances. The other “computing-on-the-                             We propose a mapping scheme for efficient computation
way” schemes such as MAERI [27] and Active-routing [21]                         in FC layers to efficiently handle partitioned matrix multi-
map computation elements to memory network for near-                            plication in our tile-based Domino. As shown in Fig. 4 (b),
memory computing and aggregate intermediate results when                        the partitioned blocks are mapped to mt × ma tiles (mt =
transmitting in a tree adder controlled by a host CPU. In                       d CNinc e, ma = d CNout
                                                                                                     m
                                                                                                        e), which is divided into four columns and
contrast, Domino conducts MAC operations in memory. The                         four rows. The input vectors are transmitted to all ma columns
partial-sums are added up to get group-sums, which are then                     of tiles. The multiplication results, ¶ to ¹, are added while
stored in the buffer to wait for other group-sums ready for                     transmitting along the column. The final addition results in
addition. The group-sums are added up when moving along                         the last tiles of the four columns, U to Z, are small slices of an
the routers controlled by local schedule tables. Domino only                    output vector. Concatenating the small slices in all columns
transmits data while MAERI/Active-routing contains extra                        gives the complete partitioned matrix multiplication result.
complex information such as operator and flow states.
                                                                                5.2    Dataflow in CONV Layers
5.1     Dataflow in FC layers
                                                                                    Based on Eqn. 1, the computation of a pixel in OFM is
                                                                                the sum of point-wise MAC results in a sliding window.
                                                                                As shown in Fig. 5, we define the Nm point-wise and row-
                                                                                wise MAC results as the partial-sum and group-sum, respec-
                                                                                tively. This figure illustrates our distributed way of adding
                                                                                partial-sums and timing control of “computing on the move”
                                                                                                (l)
                                                                                dataflow. Let Pa denote the partial-sum of the l th pixels in
                                                                                      th                       (b)
                                                                                the a sliding window and Ga denote the group-sum of the
                                                                                  th            th
                                                                                b row in the a sliding window.
                                                                                                       C−1
                                                                                                (l)
                                                                                               Pa =     ∑ I[Sx + i][Sy + j][c] × W[i][ j][c]
                                                                                                     c=0
                                                                                                 C−1 c−1
                                                                                         (b)
Figure 4: The proposed mapping and dataflow in FC lay-
                                                                                       Ga =       ∑ ∑ I[Sx + b][Sy + j][c] × W[b][ j][c]           (3)
                                                                                                  c=0 j=0
ers: (a) the partitioned input vector and weight matrix;                                                               K−1
(b) the dataflow to transmit and add multiplication re-                                                   O[x][y] =
                                                                                                                               (b)
sults to a complete MVM result.
                                                                                                                        ∑ Ga
                                                                                                                        b=0

                                                                                                                                 (l)         (b)
   In an FC layer, IFMs are flattened to a vector with di-                      where l = iK + j, and a = xF + y. Pa , Ga , O[x][y], and
mension 1 ×Cin to perform a Matrix-Vector Multiplication                        W[i][ j][c] ∈ RM . As shown in Fig. 5 (a), K partial-sums (¶,
(MVM) computing with the weights. MVM can be formu-                             ·, and ¸) can be added up to be a group-sum (U1 ). Fig. 5
lated as y = xW, where x ∈ R1×Cin , y ∈ R1×Cout are input and                   (b) illustrates how K group-sums (U1 , U2 , and U3 ) are added
output vector, respectively, and W ∈ RCin ×Cout is a weight                     up to be a complete convolution result U (the same for X, Y,
matrix.                                                                         Z). U1 , U2 , and U3 are sequentially generated and summed
   In most cases, the input and output vector dimensions in                     up one by one, in different timing and tiles. The group-sums
FC layers are larger than the size of a single crossbar array                   wait in the buffer for ready of other group-sum, or are evicted
in PE. Therefore, the partitioned matrix multiplication as                      once they are no longer needed.

                                                                            5
array. As defined in Eqn. 3, a group-sum is the sum of K
                                                                          partial-sums. We can cluster K tiles mapped into the same
                                                                          row to a group. We categorize the CONV dataflow into three
                                                                          types: input dataflow, partial-sum dataflow, and group-sum
                                                                          dataflow.
                                                                             The input dataflow. The blue arrows in Fig. 6 (a) illus-
                                                                          trate the input dataflow in a K 2 × 1 array of tiles. The inputs
                                                                          are transmitted to an array in rows and flow through Rifms in
                                                                          K 2 tiles with identical I/O directions. Another type of zigzag
                                                                          dataflow is demonstrated in Fig. 6 (b) in a K × K array of
                                                                          tiles. When the input data reach the last tile of a group, Rifm
                                                                          inverts the flow direction, and the input data are transmitted
                                                                          to the tile at the adjacent column.
Figure 5: (a) The defined partial-sum and group-sum; (b)
                                                                             The partial-sum dataflow. The partial-sum dataflow is
The timing and location of how Domino generates OFM
                                                                          shown in Fig. 6 (c). a represents the first input data of IFM.
with naive dataflow.
                                                                          At each cycle, input packets are transmitted to the Rifm of
                                                                          the succeeding tile. Meanwhile, MAC computation in PE
   As shown in Fig. 6 (a), the input channels and output                  is enabled if needed. Take input data a as an example, in
channels of a CONV layer are mapped to the inputs and the                 the first cycle, a is multiplied with weight A to produce a
outputs of a tile, respectively. If Nc = C and Nm = M, K 2                partial-sum ¶ which will be transmitted by Rofm to the next
points of the filters are mapped to K 2 tiles. In the case that the       tile. In the second cycle, a is transmitted to tile whose PE
dimension of a weight matrix exceeds the size of the ReRAM                is mapped with weight B. The controller in the Rifm will
crossbar array in PE (Nc ≤ C and Nm ≤ M), the filters are split           bypass MAC computation because a only multiplies with A
and mapped to d NCc e × d NMm e tiles. The tiles are placed closely       in convolution. In the meanwhile, ¶ is stored in the Rofm
to minimize the data transmission. Likewise, the input vector             buffer waiting for addition. In the third cycle, b is multiplied
is split and contained in d NCc e input packets. Adding and               with B in PE, and partial-sum · is received by Rofm. Then
                                                                          ¶ is popped out from the Rofm buffer and added with ·.
concatenating MAC results of these PEs will generate the
                                                                          Generally, partial-sums are added up to be a group-sum U1
same result as the non-split case. This case is similar to the
                                                                          when transmitting along a group of tiles.
FC dataflow as discussed earlier. When Nc > C, multiple
                                                                             The group-sum dataflow. Adding group-sums to convo-
points in a filter can be mapped to the same tile to improve
                                                                          lution results is also a process executed during data trans-
ReRAM cells’ utilization and reduce the energy for data
                                                                          mission. As shown in Fig. 6 (a) and (b), the orange arrows
movement and partial-sum addition. In such a scenario, the
                                                                          indicate the group-sum dataflow. Group-sums add up to get
CONV computing is accomplished by the in-buffer shifting
                                                                          the convolution result U in the last tile of each group. When
operation. In case Nm ≥ 2M, the filters can be duplicated
                                                                          U1 is generated, it will wait in the third tile until U2 is gen-
inside a tile to maximize parallel computing.
                                                                          erated in the sixth tile. Then, U1 will be transferred to the
                                                                          sixth tile to add U2 . Similarly, U1 + U2 waits in the sixth
                                                                          tile for U3 to generate a complete result U. After all linear
                                                                          matrix computation, the computation unit in Rofm controlled
                                                                          by instructions takes effect. An activation function is applied
                                                                          on the complete convolution result in Rofm in the last tile.

                                                                          5.3    Synchronization
                                                                             Because of down sampling in CONV layers, the CONV
                                                                          steps change with layers. For example, if the pooling filter
                                                                          size K p = 2 and the pooling filter stride S p = 2, every four
                                                                          OFM pixels will produce a pooling result. Therefore, the com-
                                                                          puting speed of the next layer has to be four times slower than
                                                                          the preceding layer, which will waste the hardware resource
Figure 6: The proposed mapping and dataflow in CONV                       and severely affect the computing throughput. To maximize
layer: (a) a K 2 × 1 tile array mapping weights in a CONV                 computation parallelism and throughput, the CONV filters
layer; (b) another type of K × K tiles array mapping                      need to be duplicated. In other words, the weights of the
weights in a CONV layer; (c) adding a group-sum while                     preceding filters should be duplicated by four times in the
transmitting MAC results in a group of tiles.                             above example.
                                                                             However, a DNN model usually has a few down sampling
  As shown in Fig. 6 (a) and (b), each split block is allocated           layers. As a result, the number of duplicated tiles may exceed
with an array of tiles that maps weights in a CONV layer.                 the total number of tiles in Domino. Therefore, a block
The array has multiple mapping typologies, such as K 2 × 1                reuse scheme is proposed for such a situation to alleviate
(mt = K 2 , ma = 1) and K × K (mt = ma = K), in pursuit                   heavy hardware requirements. Fig. 7 demonstrates our weight
of flexible dataflow and full utilization of tiles in the NoC             duplication and block reuse scheme for VGG-11 model used

                                                                      6
tiles in reverse order. As shown in Fig. 8 (a), the first four
                                                                         rows are transmitted to the block in increasing order. The
                                                                         fifth row of an IFM flows into the last column of the block.
                                                                         Group-sum T1 is computed in the third tile when the fourth
                                                                         row flows through the tile. When the fifth row flows through,
                                                                         group-sum T2 is computed in the sixth tile. Since T1 and T2
                                                                         are generated in the same column of a block, inter-column
                                                                         data transfer is reduced.
                                                                            The partial-sum dataflow. Fig. 8 (b) and (c) depict the
                                                                         partial-sum dataflow at a certain cycle and the third cycle
                                                                         thereafter, respectively. In Fig. 8 (b), partial-sums ¶, ¹, and
                                                                         ¼ are computed in the first tiles of their groups and stored in
                                                                         the Rofm buffer in their succeeding tiles. After two cycles,
Figure 7: Duplication and reuse scheme in VGG-11                         partial-sums ·, º, and ½ are generated in the second tiles
model used in [23], there are three pooling layers before                in their groups, which are then added with the stored partial-
L5, L7, and L9. The left axis shows the number of tiles                  sums ¶, ¹, and ¼, respectively. The computing process is
and duplication, and the right axis shows the number of                  similar to the dataflow without weight duplication except that
reuses.                                                                  K group-sums are computed in K columns.
                                                                            The group-sum dataflow. A group-sum dataflow with
                                                                         weight duplication aims to reduce the data moving distance
in [23] (CIFAR-10 dataset). There are three pooling layers,
                                                                         by adding up group-sums locally. Blue and pink arrows
and the CONV steps before these three layers are 64, 16,
                                                                         indicate the group-sum dataflow in Fig. 8 (a). We take the
and 4. If all layers are synchronized, a total of 892 tiles are
                                                                         convolution results Q and T for instances. Based on the input
required to map the network. If we reuse the tiles for all
                                                                         dataflow and partial-sum dataflow, group-sums Q1 and Q2
CONV layers by four times, the speed of the CONV layers
                                                                         are computed in the first two columns of a block. Thus Q1
will be four times faster than the FC layers. In such a situation,
                                                                         is transmitted from the first column to the adjacent column
a total of 286 tiles are required to map the network. It can
                                                                         when it reaches the sixth tile. Q2 is stored in the Rofm buffer
be concluded that there is a trade-off between chip size and
                                                                         waiting for Q1 . Q2 will be popped out by instruction when
throughput.
                                                                         Rofm receives Q1 . They will be added up before transmitting
5.4    Dataflow with Weight Duplication                                  along the second column. The pink arrows represent another
                                                                         situation that the second group-sum is not waiting in Rofm.
                                                                         Based on the input dataflow, T1 and T2 are computed in the
                                                                         same column but in two turns of inputs. When T1 reaches the
                                                                         sixth tile, T2 is not ready yet. Therefore, T1 will be stored in
                                                                         the Rofm buffer to wait for the fifth row of data to flow into
                                                                         the tiles. Once T2 is computed in the sixth tile, T1 and T2
                                                                         are added up to get T1 + T2 . The group-sum dataflow greatly
                                                                         reduces the data moving distance and frequency.
                                                                         5.5    Pooling Dataflow
                                                                            The computation of CONV and FC layers is processed
                                                                         within Domino blocks, while the computation of the pooling
                                                                         layer is performed during data transmission between blocks.
Figure 8: The proposed mapping and dataflow in CONV                      If a pooling layer follows a CONV layer, with pooling filter
layers with weight duplication: (a) each column of tiles is              size K p = 2 and pooling stride S p = 2, every four activation re-
assigned with duplicated weights and receives different                  sults produce a pooling result. Weight duplication situation is
rows of data in IFM; (b) (c) adding partial-sums to be                   shown in Fig. 9 (b), in every cycle a block produces four acti-
group-sums while transmitting at two cycles.                             vation results T to Y. When transmitting across tiles, the data
                                                                         are compared, and the pooling result Z is computed. Fig. 9
                                                                         (c) shows the block reuse case that activation results are com-
   Take weight duplication and reuse into consideration, the             puted and stored in the last tile. A comparison is taken when
dataflow should be modified to match such scheme. The over-              the next activation result is computed. The Rofm outputs a
all dataflow with weight duplication is illustrated in Fig. 8. In        pooling result Z once the comparison of the pooling filter is
this scenario, a block has ma (ma = 4 is the number of dupli-            completed. In this case, the computation frequency before
cation in this example) duplicated K 2 × 1 (mt = K 2 ) arrays            pooling layers is 4× higher than the succeeding blocks.
of tiles.
   The input dataflow. The input dataflow is demonstrated
in Fig. 8 (a): four rows of data of an IFM are transmitted               6.    INSTRUCTIONS AND WORKFLOW
through four arrays of tiles in parallel. The massive data                  Domino is a highly distributed and decentralized archi-
transmission is alleviated by leveraging the spatial locality.           tecture and adopts self-controlled instructions, which are
Every ma rows of data are alternatively transferred to the               stored in each Rofm. The reason is that Domino tries to

                                                                     7
distributed and local-controlled structure avoids the extra
                                                                        instruction transmission and external control signals when
                                                                        processing DNNs.
                                                                           After cycle-accurate analyses and mathematical derivation,
                                                                        instructions reveal an attribute of periodicity. During the con-
                                                                        volution computation, C-type instructions are fetched from
                                                                        the schedule table and executed periodically. The period p
                                                                        (p = 2(P +W )) is related to IFM. Within a period, actions are
Figure 9: Output in the last tile: weight duplication or                fixed to the given IFM and DNN configurations. Furthermore,
block reuse scheme is used to deal with pooling layer.                  every port’s behavior exhibits a period of p with a different
                                                                        beginning time. The control words for each port are stored in
                                                                        Rofm, and they are generated based on the assumption that
reduce the bandwidth demand for transmitting data or in-                the convolution stride is one. When the convolution stride is
structions through whole NoC, and avoid the delay and skew              not one, the compiler will shield certain bit in control words
of long-distance signals; meanwhile, localized instructions             to “skip” some actions in the corresponding cycles to meet
retain flexibility and compatibility for different DNNs to be           the required dataflow controlling. When a Rofm is mapped
processed.                                                              and configured to process the last row of a layer in a CNN, it
                                                                        will generate activation and pooling instructions. Its period is
6.1     Instruction Set                                                 related to pooling stride, p = 2S p .
   A set of instructions is designed for computing and trans-              Once instructions for each Rofm are received and stored,
mitting data automatically and flexibly when dealing with               the Rofm is configured to prepare for computation. For a
different DNNs. In Domino, Rofm is responsible for send-                given Rofm, when a clock cycle begins, a counter provides
ing packets and accumulating partial-sums and group-sums                an index to Rofm to fetch corresponding control words peri-
correctly. Because our packets transmitted through NoC only             odically. When the execution of an instruction ends, the state
contain payloads of DNN without any other auxiliary infor-              of the current Rofm will be updated.
mation (i.e., a message header, a tail, etc.), there must be an
intuitive way to control the behavior of each Rofm on the               6.3     Workflow
chip. Consequently, we define an instruction set for Domino.               After elaborated introduction and illustration of Domino
   The instruction consists of several fields, each of which            technical details, the overall workflow and processing chain
represents a set of control words for relative ports or buffers,        are naturally constructed and revealed. Fig. 10 shows how
as well as the designed actions, as shown in Tab. 2. The                Domino works in a sequence to process DNN. These build-
instruction length is 16 bits, including four distinct fields: an       ing blocks cover both ideas and implementations mentioned
Rx field, a Function field, a Tx field, and an opcode field.            above and integrate them into a whole architecture.
The Rx field takes responsibility for receiving packets from
different ports. The Function field contains sum and buffer             7.    EVALUATION
control in C-type, activation function, pooling, and FC layer
                                                                           This section evaluates Domino’s characterization and per-
control in M-type. The Tx field governs the transmission
                                                                        formances in detail, including energy efficiency and through-
of one Rofm to four ports. There are two types of opcodes
                                                                        put. Meanwhile, we compare Domino against other state-of-
bearing two usages. C-type (stands for convolution type)
                                                                        the-art CIM-based architectures on several prevailing types
denotes the instruction is served for controlling the process
                                                                        of CNNs to show the proposed architecture and dataflow
during convolution computation, while the M-type is for
                                                                        advantages.
miscellaneous operations other than convolution, such as
activating and pooling.                                                 7.1     Methodology
                                                                           First, we specify the configurations of our Domino under
   15     11   10   7   6   5    4   1           0
                                                                        test; second, we describe the experiment setup and bench-
   Rx Ctrl.    Sum Buffer        Tx Ctrl.    Opc.    C-type             marks; then we determine the normalization methods.
   Rx Ctrl.       Func           Tx Ctrl.    Opc.    M-type
                                                                        7.1.1    Configuration
                                                                           The configuration of Domino and its tile is displayed in
Table 2: The instruction format for Domino. There are                   Tab. 3. Domino adopts silicon-proven ADC in [20], silicon-
two basic types of instructions generated by the compiler               proven SRAM array in [43], and the transmission parameters
to handle dataflow correctly.                                           in [4]. The rest of components are simulated by spice or
                                                                        synthesized by Synopsys DC with a 45 nm CMOS Process
                                                                        Design Kit (PDK). Domino runs at a step frequency of 10
6.2     Schedule Table                                                  MHz for instruction updating. The data transmission fre-
   In Domino, instructions in a Rofm should fit both intra-             quency is 640 MHz, and the data width is 64 bits. The supply
and inter- block dataflows to support the “computing-on-the-            voltage is 1 V. ReRAM’s area information is normalized from
move” procedure. The compiler generates the instruction                 the silicon-proven result in [30]. In the initialization stage,
table and configuration information for each tile based on              each ReRAM cell in a PE is programmed with a 1-bit weight
the initial input data and the DNN structure. The highly                value. Eight ReRAM cells make up an 8-bit weight. At

                                                                    8
Figure 10: The workflow of Domino. Domino takes network model, dataset information, and area or power specification
for compilation. The generated schedule tables and weights are mapped into routers and CIM arrays respectively. After
initialization, Domino begins to execute instructions and infer.

  Component       Descript.   Energy/Compo.     Area (µm2 )          tent to each comparison object respectively.
     ADC          8b×256          1.76 pJ          351.5
   Integrator     8b×256          2.13 pJ          568.3             7.1.4    Normalization
  Crossbar cell       8b          32.9 fJ           1.62                A variety of CIM models lead to discrepancies of many
    PE total       Nc ×Nm      48.1 fJ/MAC        341632             attributes that need normalization and scaling. To make a
     Buffer       256B×1           281.3 pJ       826.5              fair comparison, we normalize technology nodes and supply
   Ctrl. circ.       1              4.1 pJ        1400.6             voltage of digital circuits in each design to 45 nm and 1 V
   Rifm total                  -                  2227.1             according to the equations given in [41]. The activations’
     Adder        8b×8×2           0.03 pJ/8b      0.07
                                                                     and weights’ precision is scaled to 8-bit as well. The supply
    Pooling         8b×8            7.6 fJ/8b     34.06
                                                                     voltage of analog circuits is also normalized to 1 V.
   Activation       8b×8            0.9 fJ/8b      7.07              7.2     Performance Results
   Data buffer     16KiB            281.3 pJ      52896
                                                                       We evaluate Domino’s Computational Efficiency (CE),
  Sched. Table    16b×128          2.2 pJ/16b     826.5
                                                                     power dissipation, energy consumption, throughput, and exe-
  Input buffer     64b×2             17.6 pJ       51.7
                                                                     cution time for each experiment environment in benchmarks,
  Output buffer    64b×2             17.6 pJ       51.7
                                                                     and compare them with other architectures.
   Ctrl. circ.        1              28.5 pJ      2451.2
   Rofm total                  -                 53867.1             7.2.1    Overall Performance
    Tile total                 -                0.398 mm2               We evaluate Domino on various DNN models to prove
                                                                     that its performance can be generalized to different model
Table 3: The configuration information summary for                   patterns. Tab. 4 shows Domino’s system execution time,
Domino under evaluation.                                             energy consumption, and computational efficiency when it
                                                                     runs benchmarks with specified configurations. The time and
                                                                     energy spent on initialization and compilation are not consid-
the same time, each Rifm and Rofm is configured, and the             ered. We break down energy into five parts: CIM, on-chip
schedule tables pre-load instructions. Once the configuration        data moving, on-chip data memory, other computation, and
of Domino is finished, the instructions are ready for execu-         off-chip data accessing. The first one is the energy consumed
tion, and Domino will commence processing DNNs when                  in PEs related to MAC operations. The latter four are the
the clock cycle starts.                                              energy of peripheral circuits. Domino achieves a peak CE
                                                                     of 25.92 TOPS/W when running VGG-19 with ImageNet,
7.1.2    Experiment Setup                                            and a valley CE of 19.99 TOPS/W running ResNet-18 with
   We select several types of CIM accelerator architectures          CIFAR-10. It denotes that the larger the DNN model size,
as the reference, including three silicon-proven architectures       the better CE it achieves. The peripheral energy in VGG-19
and another four simulated architectures. The performance            only occupies one-third of the total energy. The through-
data and the configuration parameters of the baselines are           put of Domino varies with input data size and DNN model.
taken from their papers. The following three experiments are         VGG and ResNet models with a small dataset could achieve
conducted: (1) running different architectures with the bench-       6.25E5 inferences/s high throughput. VGG-19 with Ima-
mark DNNs to compare the power efficiency; (2) evaluating            geNet dataset still has a throughput of 1.276E4 inferences/s.
the throughput of Domino under different DNNs and input;             The high throughput and low peripheral energy of Domino
and (3) estimating PE utilization with various DNN models            are benefited from the pipelined “computing-on-the-move”
and PE size.                                                         dataflow that fully utilizes the data locality.
7.1.3    Benchmarks                                                  7.2.2    Computational Efficiency
   We evaluate Domino using representative DNN models                   The comparison between Domino and other state-of-the-
like VGG-11 [23], VGG-16, VGG-19 [39], ResNet-18, and                art CIM architectures and DNN accelerator is summarized
ResNet-50 [18] as the model benchmarks. The datasets                 in Tab. 4. The data in the table are either taken or calculated
CIFAR-10 [24] and ImageNet [14] are chosen to be dataset             based on the results provided in their papers. The silicon-
benchmarks for evaluation. Then we run benchmarks consis-            proven designs usually have low throughput and power con-

                                                                 9
Dataset                         CIFAR-10                                                                ImageNet
            Model                  VGG-11             ResNet-18                       VGG-16                             VGG-19                         ResNet-50
         Architecture         [23]      Ours       [48]      Ours           [46]*1     [27]        Ours        [35]        [12]        Ours         [28]        Ours
          CIM type          SRAM      ReRAM      SRAM      ReRAM           ReRAM       n.a.       ReRAM      ReRAM       ReRAM        ReRAM       ReRAM       ReRAM
         Tech. (mm)            16        45         65         45             40        28           45         32          65           45          65          40
           VDD (V)             0.8        1         1          1             0.9       n.a.          1          1            1           1           1.2          1
      Frequency (MHz)         200        10        100         10            100       200           10        1200        1200          10          40          10
     Act. & W. precision        4         8         4          8              8         16           8          16          16           8            8           8
       # of CIM array          16        900        4         900             1        n.a.        2500        2560        6400        2500        20352        900
     CIM array size (kb)      288        512        16        512             64       n.a.         512        256           4          512         256         512
         Area (mm2 )           25       358.2       9        358.2            9         6           995        5.32        0.99         995          91        358.2
     Tapeout/Simulation         T         S         T          S              T         S            S          S            S           S            S           S
       Exec. time (us)        128       129.5     1890       203.5          0.67s      n.a.        3471        6920         n.a.       3557         n.a.        2397
      CIM energy (uJ)        11.47      36.74     36.11      26.44          3700        0          744.1      25000       13000        944.3       130.2       168.3
        On-chip data
                              0.16       2.63        n.a.       3.89         n.a.         36741      46.39      n.a.        n.a.       52.81        0.64        16.97
     moving energy (uJ)
      On-chip memory
                              5.11       25.41       n.a.       24.21        n.a.         27556      446.4     2310        2180        508.1        71.22      115.41
         energy (uJ)
      Other comp. (uJ)*2      2.67       0.48        n.a.       0.46         n.a.         50519      8.41       n.a.       1040         9.59        44.96        1.68
        Off-chip data
                              n.a.         0        123.89        0         3690           n.a.       0        5650        4000           0         85.98         0
    accessing energy (uJ)
          Power (W)          0.17        40.78      0.005       34.38       0.011         0.54       15.89     4.8         0.003       19.33         166        30.84
       CE (TOPS/W)           71.39       23.41       6.91       19.99       4.15          0.27       24.84     0.68         3.55       25.92          21        23.14
         Normalized*3
                              9.53       23.41       2.82       19.99        9.24         0.36       24.84     2.73        12.98       25.92        22.46       23.14
       CE (TOPS/W)
     Throughput (TOPS)       12.14      954.66      0.036      687.26       0.046         0.15       394.7     3.28         0.1         501         3488        713.6
         Throughput
                              0.49       2.67       0.004       1.92        0.005         0.025       0.4      0.62         0.10         0.5        38.33        1.99
        (TOPS/mm2 )
         inferences/s        7815       6.25E5       n.a.      6.25E5        n.a.          n.a.   1.28E4        n.a.        n.a.       1.28E4        n.a.      1.02E5
        Accuracy(%)          91.51       89.85      91.15       91.57        46            n.a.    70.71        n.a.        n.a.        72.38        76         74.89
     *1 Adapted and normalized from average statistics. *2 Including CNN related computation like partial sum accumulation, activation functions, pooling operations, etc.
     *3 Voltage normalized to 1 V, precision normalized to 8-bit and digital circuit normalized to 45 nm referring to [41].

                       Table 4: Domino’s evaluation results and comparison under different DNN models.

sumption due to the small chip size. The typical bit width is
only 4-bit because analog computing is difficult to achieve
a high signal-to-noise ratio. It can be seen between the CIM
computing efficiency and the system computing efficiency
that energy consumed by the peripheral circuits for data mov-
ing and memory accessing may still dominate the total power
consumption in most of the design. Benefited from a 16 nm
technology node and tailored DNN model, Jia [23] could
achieve a smaller gap between the CIM computing efficiency
(121 TOPS/W) and the system computing efficiency (71.39
TOPS/W) at 4-bit configuration.
   To make a fair comparison among different designs, all
circuits are normalized to 1 V, 8-bit and digital components
are further normalized to 45 nm. From Fig. 11 we can see that
Domino achieves the highest system CE (25.92 TOPS/W),
which is 1.15-9.49× higher than the CIM counterparts. More-                                 Figure 11: Performance comparisons with other archi-
over, Domino has a lower percentage of memory accessing,                                    tectures (a) System computational efficiency after nor-
data transmitting energy (in Tab. 4). The energy breakdown                                  malization; (b) Throughput normalized to each 8-bit
shows that Domino effectively reduces peripheral energy                                     crossbar array cell.
consumption, which is a vital factor in the overall system
performance. The data locality and “computing-on-the-move”
dataflow are very efficient in reducing the overall energy                                   7.2.3      Throughput
consumption for DNN inference.
                                                                                               Domino has a distinguished advantage over other architec-
   As for DNN accelerators, Domino overwhelms MAERI
                                                                                            tures in terms of throughput. The existing architectures need
[27] on CE, which is the representative of conventional digital
                                                                                            to store feature maps and partial-sums to external memory
architecture. The superiority comes not only from the use of
                                                                                            to support the complex dataflow. Furthermore, some de-
CIM but also from “computing-on-the-move” dataflow.
                                                                                            signs also have to access external memory to update weights
                                                                                            for different layers. In Domino, the weights of the DNN

                                                                                     10
are all stored in crossbar arrays in the initial configuration.
Layer synchronization with weight duplication and block
reuse scheme is further proposed to maximize the parallel
computing speed of inference.
   The chip size and the number of memory cells greatly af-
fect the overall throughput of a system, and different memory
type has different memory size (i.e., ReRAM and SRAM).
As shown in Tab. 4, the area normalized throughput of [48]
is 0.004 TOPS/mm2 due to the small array size and 65 nm
technology node. With a larger crossbar array size and 16 nm
technology node, [23] achieves 0.486 TOPS/mm2 . Our pro-
posed scheme can achieve up to 2.67 TOPS/mm2 throughput,                Figure 12: Utilization rates over all layers in four DNN
which is better than the state-of-the-art schemes except [28].          models considering three PE crossbar configurations.
The high density of [28] is due to the transistor-less cross-
bar array estimated on 20nm technology node, which often
suffers from the write sneak path issue and is very difficult           is a ReRAM-based accelerator that uses ReRAM crossbars to
to program the ReRAM cells to the desired state. Therefore,             store weights and eDRAM to store feature maps, which are
we normalize the throughput to an 8-bit cell, which reflects            organized by routers and H-trees. Pipelayer [40] stores both
both throughput and PE utilization. As shown in Fig. 11                 filters and feature maps in ReRAM crossbars and computes
(b), our design achieves 16.19 MOPS/8-b-cell throughput,                them in the pipelined dataflow. The interface between analog
which is 3.10× higher than TIMELY and 270× higher than                  domain and digital domain also contributes to high power
CASCADE. Although [47] supports zero-skipping and im-                   consumption. [50] proposes a binary input pattern for CIM,
plements a ping-pong CIM for computing while refreshing,                instead of a high power consumption DAC, to reduce the
and [23] provides various mapping strategies to improve the             interface energy. In this way, both computing efficiency and
throughput, Domino still outperforms them by 7.36× and                  reliability are improved. CASCADE [12] applies the signifi-
1.57×, respectively. The high throughput of Domino comes                cance of the bit lines in the analog domain to minimize A/D
from the synchronization, the weight pre-loading, the data              conversions, thus enhancing power efficiency. TIMELY [28]
locality, and the pipelined dataflow to reduce the computing            also focuses on analog current adder to increase the CIM
latency while maximizing the parallelism.                               block size, TDC and DTC to reduce the power consumption,
                                                                        and input data locality to reduce data movement. However,
7.2.4    Utilization Rates                                              the uncertainty caused by the clock jitter and the time vari-
   The size of the crossbar array greatly affects the CIM               ation caused by the complex RRAM crossbar network will
power efficiency and cell utilization in PEs. Fig. 12 de-               greatly reduce the ENOB of the converters. Moreover, the
picts the crossbar array utilization over all layers in four            large sub-chip and weight duplication inside the crossbar ar-
neural network models: VGG-11, VGG-16, ResNet-18, and                   ray also significantly reduce the utilization of the CIM arrays.
ResNet-50 with three crossbar array configurations (128 ×               Till now, most of the designs don’t provide on-chip control
128, 256 × 256, and 512 × 512). Though the mapping strate-              circuits to maintain the complicated dataflow. Therefore, the
gies in Domino have improved cell utilization in PEs, they              feature maps or partial-sums have to be stored on the external
still have low utilization in the first few layers because the          memory and controlled by additional processors to transform
input and output channels are much less than the side length            the feature maps’ flow and synchronize the computing la-
of the crossbar array. The average utilization of four models           tency of different data paths. Accessing feature maps and
is 98%, 96%, 93%, and 92% when using a 128×128 crossbar                 partial-sums have been the leading energy consumption in
array. The utilization is reduced to 90%, 89%, 86% and 79%              most designs. Reducing the data movement energy has been
using a 256×256 crossbar array, and 75%, 76%, 67% and                   one of the most important research topics to design CIM
54% with a 512×512 crossbar array. Lower utilization in                 processors.
ResNet comes from its architecture that layers with small
channels are prevalent. Though a smaller crossbar array has             9.   CONCLUSION
higher utilization, it sacrifices the CIM computing efficiency,
which is 31.4 TOPS/W, 41.58 TOPS/W, and 49.38 TOPS/W                       This paper has presented a tailored Network-on-Chip ar-
at these three configurations. Therefore, it is a balance be-           chitecture called Domino with highly localized inter- and
tween utilization and computational efficiency.                         intra-memory computing for DNNs. The key contribution
                                                                        and innovation can be concluded as follows: Domino (1)
                                                                        changes conventional NoC tile structure by using two dual
8.   RELATED WORK                                                       routers for different usages aimed at DNN processing, and en-
   Various CIM architectures have been proposed to adopt                abling architecture to substitute PE for different types of CIM
different designs, computing strategies, and architectures.             arrays, (2) constructs highly distributed and self-controlled
SRAM- or DARAM-based CIM chips usually need to update                   processor architecture, proposes efficient matching dataflow
weights by accessing off-chip memory [10, 17, 23, 47], re-              to perform “computing-on-the-move”, and (3) defines a set of
sulting in high energy consumption and latency. [23] utilizes           instructions for routers, which is executed periodically to pro-
1152 input rows to fit a 3×3 convolution filter, which will face        cess DNNs. Benefiting from such design, Domino has unique
low cell utilization in other DNN configurations. ISAAC [38]            features with respect to (1) eliminating data access to memory

                                                                   11
during one single inference, (2) minimizing data movement                          [13] M. Davies, N. Srinivasa, T. Lin, G. Chinya, Y. Cao, S. H. Choday,
on-chip, (3) achieving high computational efficiency and                                G. Dimou, P. Joshi, N. Imam, S. Jain, Y. Liao, C. Lin, A. Lines, R. Liu,
                                                                                        D. Mathaikutty, S. McCoy, A. Paul, J. Tse, G. Venkataramanan,
throughput. Compared with the competitive architectures,                                Y. Weng, A. Wild, Y. Yang, and H. Wang, “Loihi: A neuromorphic
Domino has achieved 1.15-to-9.49× power efficiency im-                                  manycore processor with on-chip learning,” IEEE Micro, vol. 38,
                                                                                        no. 1, pp. 82–99, 2018.
provement and improved the throughput by 1.57-to-12.96×
over several current advanced architectures [2] [49] [30].                         [14] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei, “Imagenet:
                                                                                        A large-scale hierarchical image database,” in 2009 IEEE Conference
                                                                                        on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
REFERENCES                                                                         [15] A. Firuzan, M. Modarressi, M. Daneshtalab, and M. Reshadi,
 [1] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur,                    “Reconfigurable network-on-chip for 3d neural network accelerators,”
     P. Merolla, N. Imam, Y. Nakamura, P. Datta, G. Nam, B. Taba,                       in 2018 Twelfth IEEE/ACM International Symposium on
     M. Beakes, B. Brezzo, J. B. Kuang, R. Manohar, W. P. Risk,                         Networks-on-Chip (NOCS), 2018, pp. 1–8.
     B. Jackson, and D. S. Modha, “Truenorth: Design and tool flow of a
     65 mw 1 million neuron programmable neurosynaptic chip,” IEEE                 [16] D. Fujiki, S. Mahlke, and R. Das, “In-memory data parallel processor,”
     Transactions on Computer-Aided Design of Integrated Circuits and                   SIGPLAN Not., vol. 53, no. 2, p. 1–14, Mar. 2018. [Online]. Available:
     Systems, vol. 34, no. 10, pp. 1537–1557, 2015.                                     https://doi.org/10.1145/3296957.3173171
 [2] M. Bavandpour, M. R. Mahmoodi, and D. B. Strukov,                             [17] R. Guo, Z. Yue, X. Si, T. Hu, H. Li, L. Tang, Y. Wang, L. Liu, M. F.
     “Energy-efficient time-domain vector-by-matrix multiplier for                      Chang, Q. Li, S. Wei, and S. Yin, “15.4 a 5.99-to-691.1tops/w
     neurocomputing and beyond,” IEEE Transactions on Circuits and                      tensor-train in-memory-computing processor using
     Systems II: Express Briefs, vol. 66, no. 9, pp. 1512–1516, 2019.                   bit-level-sparsity-based optimization and variable-precision
                                                                                        quantization,” in 2021 IEEE International Solid- State Circuits
 [3] M. Besta, S. M. Hassan, S. Yalamanchili, R. Ausavarungnirun,                       Conference (ISSCC), vol. 64, 2021, pp. 242–244.
     O. Mutlu, and T. Hoefler, “Slim noc: A low-diameter on-chip network
     topology for high energy efficiency and scalability,” SIGPLAN Not.,           [18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
     vol. 53, no. 2, p. 43–55, Mar. 2018. [Online]. Available:                          image recognition,” 06 2016, pp. 770–778.
     https://doi.org/10.1145/3296957.3177158                                       [19] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
 [4] V. Catania, A. Mineo, S. Monteleone, M. Palesi, and D. Patti,                      2018 IEEE/CVF Conference on Computer Vision and Pattern
     “Cycle-accurate network on chip simulation with noxim,” ACM Trans.                 Recognition, 2018, pp. 7132–7141.
     Model. Comput. Simul., vol. 27, no. 1, Aug. 2016. [Online]. Available:        [20] Y. Hu, C. Shih, H. Tai, H. Chen, and H. Chen, “A 0.6v
     https://doi.org/10.1145/2953878                                                    6.4fj/conversion-step 10-bit 150ms/s subranging sar adc in 40nm
 [5] L. Chang, X. Ma, Z. Wang, Y. Zhang, Y. Ding, W. Zhao, and Y. Xie,                  cmos,” in 2014 IEEE Asian Solid-State Circuits Conference (A-SSCC),
     “Dasm: Data-streaming-based computing in nonvolatile memory                        2014, pp. 81–84.
     architecture for embedded system,” IEEE Transactions on Very Large            [21] J. Huang, R. Reddy Puli, P. Majumder, S. Kim, R. Boyapati, K. H.
     Scale Integration (VLSI) Systems, vol. 27, no. 9, pp. 2046–2059, 2019.             Yum, and E. J. Kim, “Active-routing: Compute on the way for
 [6] K.-C. J. Chen, M. Ebrahimi, T.-Y. Wang, and Y.-C. Yang, “Noc-based                 near-data processing,” in 2019 IEEE International Symposium on High
     dnn accelerator: A future design paradigm,” in Proceedings of the 13th             Performance Computer Architecture (HPCA), 2019, pp. 674–686.
     IEEE/ACM International Symposium on Networks-on-Chip, ser.                    [22] M. Imani, S. Gupta, Y. Kim, and T. Rosing, “Floatpim: In-memory
     NOCS ’19. New York, NY, USA: Association for Computing                             acceleration of deep neural network training with high precision,” in
     Machinery, 2019. [Online]. Available:                                              2019 ACM/IEEE 46th Annual International Symposium on Computer
     https://doi.org/10.1145/3313231.3352376                                            Architecture (ISCA), 2019, pp. 802–815.
 [7] W. Chen, K. Li, W. Lin, K. Hsu, P. Li, C. Yang, C. Xue, E. Yang,              [23] H. Jia, M. Ozatay, Y. Tang, H. Valavi, R. Pathak, J. Lee, and N. Verma,
     Y. Chen, Y. Chang, T. Hsu, Y. King, C. Lin, R. Liu, C. Hsieh, K. Tang,             “15.1 a programmable neural-network inference accelerator based on
     and M. Chang, “A 65nm 1mb nonvolatile computing-in-memory                          scalable in-memory computing,” in 2021 IEEE International Solid-
     reram macro with sub-16ns multiply-and-accumulate for binary dnn ai                State Circuits Conference (ISSCC), vol. 64, 2021, pp. 236–238.
     edge processors,” in 2018 IEEE International Solid - State Circuits
     Conference - (ISSCC), 2018, pp. 494–496.                                      [24] A. Krizhevsky, “Learning multiple layers of features from tiny images,”
                                                                                        University of Toronto, 05 2012.
 [8] Y. Chen, T. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible
     accelerator for emerging deep neural networks on mobile devices,”             [25] H. Kwon,, A. Samajdar, and T. Krishna, “A communication-centric
     IEEE Journal on Emerging and Selected Topics in Circuits and                       approach for designing flexible dnn accelerators,” IEEE Micro, vol. 38,
     Systems, vol. 9, no. 2, pp. 292–308, 2019.                                         no. 6, pp. 25–35, 2018.
 [9] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for         [26] H. Kwon, A. Samajdar, and T. Krishna, “Rethinking nocs for spatial
     energy-efficient dataflow for convolutional neural networks,” in                   neural network accelerators,” in Proceedings of the Eleventh
     Proceedings of the 43rd International Symposium on Computer                        IEEE/ACM International Symposium on Networks-on-Chip, ser.
     Architecture, ser. ISCA ’16. IEEE Press, 2016, p. 367–379. [Online].               NOCS ’17. New York, NY, USA: Association for Computing
     Available: https://doi.org/10.1109/ISCA.2016.40                                    Machinery, 2017. [Online]. Available:
                                                                                        https://doi.org/10.1145/3130218.3130230
[10] Z. Chen, X. Chen, and J. Gu, “15.3 a 65nm 3t dynamic analog
     ram-based computing-in-memory macro and cnn accelerator with                  [27] H. Kwon, A. Samajdar, and T. Krishna, “Maeri: Enabling flexible
     retention enhancement, adaptive analog sparsity and 44tops/w system                dataflow mapping over dnn accelerators via reconfigurable
     energy efficiency,” in 2021 IEEE International Solid- State Circuits               interconnects,” in Proceedings of the Twenty-Third International
     Conference (ISSCC), vol. 64, 2021, pp. 240–242.                                    Conference on Architectural Support for Programming Languages and
                                                                                        Operating Systems, ser. ASPLOS ’18. New York, NY, USA:
[11] Y. D. Chih, P. H. Lee, H. Fujiwara, Y. C. Shih, C. F. Lee, R. Naous,               Association for Computing Machinery, 2018, p. 461–475. [Online].
     Y. L. Chen, C. P. Lo, C. H. Lu, H. Mori, W. C. Zhao, D. Sun, M. E.                 Available: https://doi.org/10.1145/3173162.3173176
     Sinangil, Y. H. Chen, T. L. Chou, K. Akarvardar, H. J. Liao, Y. Wang,
     M. F. Chang, and T. Y. J. Chang, “16.4 an 89tops/w and 16.3tops/mm2           [28] W. Li, P. Xu, Y. Zhao, H. Li, Y. Xie, and Y. Lin, “Timely: Pushing
     all-digital sram-based full-precision compute-in memory macro in                   data movements and interfaces in pim accelerators towards local and
     22nm for machine-learning edge applications,” in 2021 IEEE                         in time domain,” in Proceedings of the ACM/IEEE 47th Annual
     International Solid- State Circuits Conference (ISSCC), vol. 64, 2021,             International Symposium on Computer Architecture, ser. ISCA ’20.
     pp. 252–254.                                                                       IEEE Press, 2020, p. 832–845. [Online]. Available:
                                                                                        https://doi.org/10.1109/ISCA45697.2020.00073
[12] T. Chou, W. Tang, J. Botimer, and Z. Zhang, “Cascade: Connecting
     rrams to extend analog dataflow in an end-to-end in-memory                    [29] Q. Liu, B. Gao, P. Yao, D. Wu, J. Chen, Y. Pang, W. Zhang, Y. Liao,
     processing paradigm,” in Proceedings of the 52nd Annual IEEE/ACM                   C. Xue, W. Chen, J. Tang, Y. Wang, M. Chang, H. Qian, and H. Wu,
     International Symposium on Microarchitecture, ser. MICRO ’52.                      “33.2 a fully integrated analog reram based 78.4tops/w
     New York, NY, USA: Association for Computing Machinery, 2019, p.                   compute-in-memory chip with fully parallel mac computing,” in 2020
     114–125. [Online]. Available:                                                      IEEE International Solid- State Circuits Conference - (ISSCC), 2020,
     https://doi.org/10.1145/3352460.3358328                                            pp. 500–502.

                                                                              12
You can also read