Optimization of GPU-based Sparse Matrix Multiplication for Large Sparse Networks
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
2020 IEEE 36th International Conference on Data Engineering (ICDE)
Optimization of GPU-based Sparse Matrix
Multiplication for Large Sparse Networks
Jeongmyung Lee, Seokwon Kang, Yongseung Yu, Yong-Yeon Jo, Sang-Wook Kim, Yongjun Park
Department of Computer Science
Hanyang University, Seoul, Korea
{jeongmyung, kswon0202, dydtmd1991, jyy0430, wook, yongjunpark}@hanyang.ac.kr
Abstract—Sparse matrix multiplication (spGEMM) is widely computational throughput using single-instruction, multiple-
used to analyze the sparse network data, and extract important thread (SIMT) programming models, such as CUDA [6]
information based on matrix representation. As it contains a and OpenCL [7]. A GPU generally consists of a set of
high degree of data parallelism, many efficient implementations
using data-parallel programming platforms such as CUDA and Streaming Multiprocessors (SMs). OpenCL/CUDA programs
OpenCL have been introduced on graphic processing units are executed on GPUs by allocating Thread Blocks (TBs) or
(GPUs). Several well-known spGEMM techniques, such as cuS- Cooperative Thread Arrays (CTAs) 1 , which are groups of
PARSE and CUSP, often do not utilize the GPU resources fully, threads, to each SM in parallel.
owing to the load imbalance between threads in the expansion The main challenge is developing an efficient matrix multi-
process and high memory contention in the merge process.
Furthermore, even though several outer-product-based spGEMM plication technique considering the data-specific characteristics
techniques are proposed to solve the load balancing problem of sparsity and power-law degree distribution [8]. Typical
on expansion, they still do not utilize the GPU resources fully, sparse networks contain a much smaller number of edges with
because severe computation load variations exist among the non-zero values, compared to the number of all possible edges
multiple thread blocks. between nodes, and therefore, most of the elements in a sparse
To solve these challenges, this paper proposes a new opti-
mization pass called Block Reorganizer, which balances the total matrix have a value of zero. To reduce memory waste caused
computations of each computing unit on target GPUs, based by sparsity, matrices are typically represented in the sparse
on the outer-product-based expansion process, and reduces the format [9]. Sparse networks also commonly have power-law
memory pressure during the merge process. For expansion, it distributions [8], where a very small number of hub nodes
first identifies the actual computation amount for each block, have extremely large numbers of connections and most other
and then performs two thread block transformation processes
based on their characteristics: 1) B-Splitting to transform a nodes have very small numbers of connections. Based on
heavy-computation blocks into multiple small blocks and 2) B- the power-law, the distribution of non-zero elements is often
Gathering to aggregate multiple small-computation blocks to a highly skewed, and the resulting matrices for sparse networks
larger block. While merging, it improves the overall performance generally contain a few rows with large numbers of non-zero
by performing B-Limiting to limit the number of blocks on each
elements while a large number of rows have a few non-zero
computing unit. Experimental results show that it improves the
total performance of kernel execution by 1.43x, on an average, elements.
when compared to the row-product-based spGEMM, for NVIDIA There have been several previous studies on implement-
Titan Xp GPUs on real-world datasets. ing efficient sparse matrix multiplication (spGEMM) for
Index Terms—Sparse matrix multiplication; sparse network; two sparse matrices on GPUs, including cuSPARSE [10]
GPU; linear algebra;
and CUSP [11]. These techniques generally consist of row-
I. I NTRODUCTION product-based intermediate data expansion and parallel data
Matrix multiplication is one of the core kernels in various merge processes. Despite their promising performance, GPU
data-mining applications, such as social network services resources are still not fully utilized. First, the row-product-
(SNSs) and graph analytics, and is used to extract key informa- based expansion process often leads to poor load balancing
tion. Based on the rapid growth of the size of sparse networks, among threads due to the irregular distributions of target sparse
the extraction of valuable information required for various networks. Second, excessive memory accesses during the par-
operations, such as ranking [1], similarity computation [2], allel merge process frequently leads to degraded performance
[3], and recommendation [4], [5], has become a critical than expected because of significant memory contention
challenge. Weighted graphs are typically used to model such caused by excessive accesses. Although several improved
network data and are represented in matrix forms, where each row-product-based techniques, such as bhSPARSE [12], have
element contains an edge weight between two nodes. Matrix recently been introduced, experimental results have shown that
multiplication based on the adjacent matrix format is widely they still suffer from poor thread-level load balancing problem
used to extract useful information from original data. of the row-product-based scheme and the high performance
Because matrix multiplication is a data-parallel operation, overhead during the merge process while performing multipli-
graphic processing units (GPUs) are considered to be the most
appropriate accelerators for their speed-up by providing high 1 In this work, we use the term thread block and CTA interchangeably.
2375-026X/20/$31.00 ©2020 IEEE 925
DOI 10.1109/ICDE48307.2020.00085cation on highly irregular matrices. SM Thread Shared Memory Requirement of each TB
Warp Scheduler
To overcome these limitations, several new spGEMM ap- Register File Thread Block
SM
proaches have been introduced by adopting the outer-product Core Core Core Core Core
Shared Memory
(column-row product) scheme [13], [14]. Outer-product-based L1 cache
expansion is expected to produce higher performance than Shared Memory
Thread Block
SM
row-product-based expansion, because the computational loads GPU
Shared Memory
of all threads in a TB are identical. However, the outer- SM 0 SM 1 SM 2 SM 3
product is not yet an ideal solution. First, the outer-product L2 cache
Thread Block
SM
algorithm creates another load imbalance problem among SMs
Global Memory Shared Memory
because of the high block-level workload variance. In the (a) (b)
outer-product scheme, each TB is formulated by a column and Fig. 1: (a) A GPU architecture overview and (b) an effect of
a row of input matrices. Therefore, the resulting TBs consist shared memory requirement per thread block on thread block
of several computation-heavy TBs (overloaded blocks) from allocation.
several columns and rows with huge numbers of non-zero
2) Block Gathering: it merges several underloaded
elements, and a massive number of computation-light TBs
blocks into a combined block for better SM resource
(underloaded blocks) with large numbers of zero elements. As
utilization and latency hiding effectiveness.
a result, the SMs that execute overloaded blocks can become
3) Block Limiting: it prevents the blocks from exe-
a performance bottleneck, while all other SMs are idle.
cuting with other blocks on an SM for minimizing
Second, the outer-product scheme is mainly effective for
resource contention.
expansion, and the merge performance remains the same or
might even become worse, because it produces intermediate • An extensive evaluation of the effectiveness of the Block
results in a matrix form during expansion, whereas the row- Reorganizer framework using synthetic and real-world
product produces the intermediate results in a single row datasets on multiple target GPUs.
form [15]. Therefore, full matrix-wise accumulation may be
slower than row-wise accumulation owing to the additional II. BACKGROUND
column address indexing.
A. GPU Architectures and SIMT Programming Model
To address the limitations, we propose a novel outer-
product-based spGEMM optimization pass referred to as the GPUs are accelerators that provide high throughput by
Block Reorganizer. It first identifies the computation amount maximizing data parallelism using an SIMT programming
of each block and categorizes the blocks as overloaded blocks, model such as CUDA [6] and OpenCL [7], which enables
normal blocks, and underloaded blocks, based on their compu- multiple independent threads to execute the same instructions
tational loads. It then performs two different optimizations in concurrently. In such programming languages, a thread is the
the expansion process: Block Splitting for overloaded blocks basic unit of execution, and several threads are grouped into
and Block Gathering for underloaded blocks. Block Splitting is TBs or CTAs. A TB is the main scheduling unit for execution
the process of dividing an overloaded block into multiple small on GPUs, and the threads within a TB are affected by
blocks for better load balancing. For underloaded blocks, the barrier operations for synchronization. For NVIDIA GPUs in
Block Reorganizer performs the Block Gathering process by particular, a number of threads (typically 32) are also grouped
creating a combined block from multiple underloaded blocks into another scheduling unit, called a warp. In NVIDIA GPUs,
to increase intra-SM computation unit utilization and improve the threads in a warp are executed in lock-step similar to SIMD
latency hiding efficiency via fast context-switching support. accelerators [16].
After executing all operations to produce intermediate results To support such operations efficiently, recent GPUs have
during the expansion process, Block Limiting is applied to been equipped with multiple SMs to execute the kernel in-
improve performance further during the merge process. Block structions of allocated TBs in an SIMD manner. Each SM
Limiting is the process where each merging block is forced contains multiple computing cores, a large register file, an L1
to execute solely on the allocated SM in order to minimize cache, and a shared memory, as shown in Figure 1 (a). To
resource contention. hide memory access latency, GPUs also allow fast context
This paper provides the following three contributions: switching between warps. Thus, GPUs attempt to allocate the
• An in-depth analysis of the inefficient resource utilization maximum allowable number of threads to an SM within the
of outer-product operations on GPUs including expansion resource limit.
and merge processes on real-world datasets. The number of threads allocated to an SM is limited by
• The design of a novel optimization framework for ef- resource usage(e.g. shared memory and register files). For
ficient sparse matrix multiplication based on the outer- example, the shared memory requirement for each TB can
product scheme. To achieve this objective, we offer three change the total number of allowable TBs on an SM, as shown
key techniques: in Figure 1 (b). Although the number of threads in a TB
1) Block Splitting: it divides original blocks into sev- is determined statically, all threads are not always executed
eral small blocks for better load balancing. identically based on branch divergence. In this paper, we refer
926Input Row-product spGEMM
Threads Algorithm 1 Outer-product based spGEMM pseudocode
ptr 0 3 5 6 8 Thread
a00 b00 b01 b02 b03
idx 0 2 3 0 1
execution for i := 0, to i
60 60 bers of effective threads with small computations, and they
XWLOL]DWLRQ
lead to substantial performance degradation on GPUs.
60
While the five left-hand matrices in Figure 3 (a) exhibit a
fair load balancing of SMs, another inefficiency is generated
by underloaded blocks. In Figure 3 (b), most of the thread
SORW
KDUERU SURWHLQ 4&' ILOWHU' VKLS \RXWXEH ORFJRZDOOD DVFDLGD V[PDWKRY VODVK'RW
D block have less than 32 effective threads for many matrices.
QXPEHURIHIIHFWLYHWKUHDGV
For this situation, two main reasons exist for the significant
5DWLRRIWKUHDGEORFNV
performance degradation in each SM. First, multiple comput-
ing cores within an SM are idle when executing underloaded
blocks with less than 32 threads, because 32 threads are
E executed in a lock-step manner, as described in Section II-A.
H[SDQVLRQ PHUJH
Second, a memory latency hiding technique with fast context
5DWLRRIH[HFXWLRQWLPH
switching cannot be utilized, because no eligible warps for
context switching exist when a warp stalls for several cycles
owing to the occurrence of a memory access. Therefore, gener-
F
Fig. 3: (a) Execution time variance of outer-product-based ating larger blocks by aggregating several underloaded blocks
spGEMM between SMs (Titan XP), (b) thread block distribu- is highly recommended for further performance enhancement.
tion at different number of effective threads, and (c) execution 3) Overhead on merge: In this work, the merge process
time distribution at expansion and merge processes. was implemented in a manner similar to the widely used
Gustavson’s dense accumulator algorithm [19], which uses
1) Overloaded block: As discussed in the previous section,
a temporary array with a length equal to the dimension of
sparse matrices often have a power-law degree distribution,
the target matrix. Using the dense accumulator algorithm
where some rows and columns related to the hub-nodes
gives an advantage to aggregate elements without sorting
contain massive numbers of non-zero elements, whereas oth-
overhead. For implementing the algorithm on GPUs, we used
ers have only a few non-zero elements. Therefore, several
atomic functions to manage parallel execution. In Figure 3
overloaded blocks used to perform multiplications of the
(c), high merge latency exists when the merge process is
columns and rows related to the hub nodes incur a substantial
performed for rows with large nnz, because the block requires
amount of computations, while other blocks (underloaded
massive number of memory transactions, which can lead to
blocks) perform very few computations. When overloaded
performance degradation due to significant memory resource
blocks are scheduled to a few SMs and underloaded blocks are
contention. Several recent studies [17], [18] have also reported
scheduled to the rest of the SMs, the SMs with the underloaded
that allocating the maximum amount of blocks on GPUs does
blocks should remain idle after completing their tasks until
not always guarantee the best performance because resource
all computations of the overloaded blocks on other SMs are
contention may decrease overall performance when excessive
completed.
threads are allocated. Therefore, the over-allocation of merging
Figure 3 (a) presents the variation in the SM-level exe- blocks on an SM should be avoided.
cution time of expansion-phase when running outer-product
spGEMM operations in multiple sparse network datasets on B. Beyond Conventional Approaches
an NVIDIA Titan Xp architecture, containing 30 SMs. In Several insights have been derived from comparisons be-
Figure 3 (a), the execution times for all SMs in the GPU are tween several spGEMM algorithms and the analysis of con-
presented in descending order for each dataset, and five sparse flicts between GPU characteristics and sparse network char-
matrices on the left have relatively regular distributions, but acteristics. First, an outer-product scheme is a better expan-
the five sparse matrices on the right have skewed distributions. sion technique than a row-product scheme owing to superior
In this figure, one can see that irregularity leads to high thread-level load balancing within a block, but the block-level
execution time variation between SMs. When the overloaded load imbalance problem must be solved by considering both
block is scheduled to a SM, the block occupies the SM for overloaded and underloaded blocks. Second, the performance
a long period and other small blocks are scheduled to the of the merge process must be improved as well, by reducing
remaining available SMs. Workload redistribution from long- resource contention by adjusting the block allocation to each
running SMs to idle SMs is therefore the key challenge for SM.
performance improvement on skewed matrices. For example, Based on these insights, we propose several intuitive high-
SM utilization for the “loc-Gowalla” and “as-Caida” sets is level solutions for improved spGEMM performance. We first
less than 20% owing to small numbers of long-running SMs. perform preprocessing to classify column-row product blocks
2) Underloaded block: Another issue is that most into three different categories, based on their computational
rows/columns in sparse matrices have zero or a small number loads: overloaded, normal, and underloaded blocks. Over-
of non-zero elements than the warp size, except for those loaded blocks are then split into multiple small blocks to
rows/columns related to hub nodes. Underloaded blocks for be distributed into different SMs. For underloaded blocks,
multiplication of those columns and rows contain small num- we improve performance by gathering multiple underloaded
928Block Reorganizer
Pre-process / Workload classification Merge phase
Index # of elem.s Index Index # of row-wise elem.s
0 27 0 35 Index Unmerged. 3 2 4 1 4 6
2 18863 Dominator bin. 2, 3, ... 2 2833
3 22751 3 3714 Limiting bin. 2, 3, ...
7 19 Normal bin. 11, ... 7 19 Merged. 3 7 4 6
11 371 11 658 Non-limiting bin. 0, 7, 11, 13 ...
13 9 Low performer bin. 0, 7, 13, ... 13 31
A ... ... ... ...
Block-limiting
Block-splitting Expansion phase Block-gathering Limiting bin. 2, 3
0, 7,
Dominator bin. #2 Low performer bin. Gathering factor M Non-limiting bin.
11, 13
Split.Split.
#2 #2
Splitting
...
factor N
N0 Low. Idx. #0
B Dom. Idx. #2 TB #0
Low. Idx. #7
M0 TB #2
#3
Input Dom. Idx. #3
Split.Split.
#3 #3 ...
Low. Idx. #13 Gathered. #0, #7, #13
TB #7
N1 SM SM
matrices
Fig. 4: An overview of the Block Reorganizer.
blocks into a single combined block, to maximize the number nnz is used to relocate the outer-product’s elements with same
of effective threads. We also improve merge performance by row closer together for faster merge process. We also calculate
limiting the number of allocated merging blocks on SMs. the block-wise nnz for workload classification.
Because of the irregular distributions of sparse networks,
IV. B LOCK R EORGANIZER the outer-product of the dominator pair produces a massive
A. Overview number of non zero elements compared to the other remaining
The Block Reorganizer is an optimization method for accel- pairs. As a single column/row pair operation is assigned to a
erating sparse matrix multiplication by applying an improved single block, the execution time for overloaded blocks can be
block-level load balancing mechanism that is adaptive to much greater than the total execution time for all remaining
sparse network characteristics. The Block Reorganizer is based blocks. This often leads to poor load balancing between SMs,
on the outer-product scheme, and applies several novel load and is one of the main causes of performance degradation
balancing techniques, based on an in-depth understanding of in skewed matrices. For low performer pairs, the underuti-
GPU architectures. Figure 4 presents a conceptual view of lization of in-SM computing units is another reason for poor
the Block Reorganizer that is proposed to improve resource performance. Therefore, different optimization techniques are
utilization during both expansion and merge processes. required for each column/row pair category.
As shown in Figure 4, the Block Reorganizer first precalcu- Based on block-wise nnz estimation, all dominator pairs
lates the workload sizes of all blocks to perform column-by- are identified from the input matrices (A, B). Because of
row product. The blocks are then classified into three groups the sparse data characteristics, the number of dominator pairs
of overloaded, normal, and underloaded blocks based on the is typically small, and the threshold ratio for identifying
sizes of their workloads. We will refer to a set of overloaded dominator pairs should be selected carefully. In this study,
column/row pairs having numerous non-zero elements, as a blocks that produce more than the threshold number of ele-
Dominator. A Low performer is a set of underloaded col- ments (threshold = nnz(Ĉ)/(#blocks × α)) are classified
umn/row pairs that requires only a few computations due to as dominators. The criteria for classification can be changed
their insufficient number of effective threads. by adjusting the value of α based on the target sparse network
Following categorization, dominator pairs are split into characteristics. Highly skewed networks can have lower α
multiple smaller column/row pairs (block splitting). Multiple values, but social networks with several medium-size hub-
underloaded blocks are gathered to generate larger blocks nodes should have high α values to avoid selecting too many
(block gathering). The newly created combined blocks can dominator pairs. The dominators are copied into new tempo-
be efficiently executed on GPUs by maximizing thread level rary matrices (A , B ), while blocks with less than 32(size of
parallelism through both high utilization of in-SM computing warps) effective threads are classified as underloaded blocks.
cores and better latency hiding using fast context switching
between warps. After all elements are generated and stored C. Expansion Optimization
in the intermediate matrix Ĉ, elements with the same indices 1) Block Splitting: We propose the Block-splitting tech-
are merged to produce the final matrix C. To achieve better nique for better block-level workload balance. Block-splitting
throughput by avoiding excessive memory contention, we is applied to overloaded blocks that are generated by domi-
adjust the number of thread blocks allocated to an SM. nator vectors, in order to distribute heavy workloads evenly
across multiple SMs. As expressed in Equation (2), the outer-
B. Precalculation & Workload Categorization product operations for each pair are independent of each
Block reorganizer first calculates nnz(Ĉ) to allocate the other, without the possibility of data reuse. Therefore, it can
upper bound memory space for C. There are two different be separated and modified without affecting the results of
ways to compute memory space as shown in Figure 4, and we other blocks. The dominator column vector, which is copied
employ both methods for later optimizations. The row-wise into temporary matrices A , is divided into multiple smaller
929$ƍSWU $ƍSWU
$ƍLG[ $ƍLG[ PDSSHU
$ƍYDO $ƍYDO &RO 5RZ
$ $ƍ $ƍ
%ƍSWU %ƍSWU
%ƍLG[ %ƍLG[
%ƍYDO %ƍYDO
% %ƍ %ƍ
2ULJLQDOLQSXW
%
!"" # $
:LWKRXWVSOLWWLQJ $IWHUVSOLWWLQJ
Fig. 5: B-Splitting: an overloaded block is split into multiple
""
small blocks. !"" # $
columns by modifying the column pointer values. This then
creates a mapper array, for storing the mapping between ""
!"" # $
divided vector pairs. The multiple divided blocks execute their
own products by referencing the mapper array, and therefore, ""
the overloaded workload can be reallocated to multiple SMs Fig. 6: B-Gathering: several underloaded blocks are combined
to achieve fair load balancing. Figure 5 illustrates a detailed into a large block through block-compaction.
example of the block-splitting process and highlights its effec- a sufficient number of effective threads in each block.
tiveness. First, the dominator vector a∗0 and b0∗ (originally
2) Block Gathering: Because of the irregularity of sparse
from input matrices A and B) are copied into matrices A
matrices, executing kernels with a fixed thread block size is
and B . During the splitting process, several elements from
inefficient, and therefore, executing blocks with an appropriate
each column vector are shifted to the next vector sequentially.
thread block size is required to avoid thread waste. However,
This operation can be accomplished by simply expanding
as shown in Figure 3 (b), underloaded blocks, which are
the pointer index of the sparse format matrix, as shown in
generated by low performer groups, contain fewer effective
Figure 5. A mapper array is constructed to track all of the
threads than the minimum block size (32). In the proposed
divided vector pairs to produce the same results as the original
method, nnz(bi∗ ) indicates the number of effective threads
vector pairs. As a result, the overloaded block requiring 25
within a block. As shown in Figure 3 (b), for some networks,
computations is split into three smaller blocks.
most row vectors have less than 32 non-zero elements. This
Block splitting not only improves SM-level load balanc- means that several computing units in an SM are idle when
ing, but also provides improved cache performance. Because executing such blocks because the threads in a warp are
global memory access requires hundreds of cycles, spatial executed in a lock-step manner, as discussed in Section II-A.
and temporal data localities should be fully utilized. Block- Thus, thread-level parallelism cannot be fully utilized through
splitting forces multiple SMs to share identical vectors, thereby concurrent executions.
increasing the probability of re-referencing data from SMs Having an insufficient number of effective threads in a block
and preventing the data from being evicted due to memory also significantly decreases performance, as latency hiding
space shortage. As a result, additional performance gains are using fast context switching cannot be applied. When the
achieved. current active warp cannot issue the next instructions for any
Determining the splitting factors for dominators is impor- reason, the warp scheduler chooses and schedules another
tant, because performance improvement depends heavily on warp among the eligible warps to hide latency. However,
these factors. Due to irregularity of sparse matrices, it is latency hiding based on fast warp-level context switching
difficult to identify the optimal factor that can be applied cannot be applied, as underloaded blocks contain only a small
to all datasets. Even within dominator groups, the nnz of number of warps with effective threads (typically only one).
vectors varies, and the splitting factor for each vector should be To solve the problem, we propose Block Gathering, which
selected carefully. From a GPU architectural view, overloaded is intuitive and can be applied easily. In Block Gathering,
blocks should be divided into a number of smaller blocks original underloaded blocks are first transformed into micro-
that is greater than the total number of SMs. The number blocks, which generate exactly the same results as the original
of effective threads within each block should be larger than underloaded blocks, although they only have fewer threads
the warp size to guarantee full utilization of in-SM cores. than the original blocks (block-compaction). Multiple micro-
Based on these two insights, we decided to choose the splitting blocks are then combined into a large combined block with
factor (2n ) heuristically. Column vectors, where the number of multiple partitions, which has the same number of threads as
elements is equal to the number of computations per thread, the original underloaded blocks.
are split into several smaller vectors in a greedy manner. On For block-gathering, it is relatively easy to determine the
the other hand, row vectors, where the number of elements optimal value of the gathering factor. In general, the number
corresponds to the number of threads, are not split to guarantee of threads in a block is set to a power of two. When the
930TB #0 TB #1 TB #2 TB #3 TB #0 TB #1 TB #2 TB #3 on the information. If the block-wise load of an (a∗i , bi∗ ) pair
SMEM SMEM SMEM SMEM SMEM SMEM SMEM SMEM
usage usage usage usage
...
usage usage usage usage
...
exceeds the threshold, the pair is classified as the Dominator.
Thread block configuration Thread block configuration If the row-wise nnz exceeds a certain threshold, the corre-
sponding rows are determined to cause resource contention
TB #0 TB #1 TB #2 TB #3
...
TB #0 TB #1
... during merging. For YouTube, 713 pairs are classified as
Shared memory Shared memory Shared memory Shared memory
SM 0 SM 1 SM 0 SM 1 the dominator, and 362736 pairs are classified as the low
GPU GPU
L2-cache / Global memory L2-cache / Global memory performer. 12657 rows are also selected to use B-Limiting
Large memory contention Small memory contention during merging. The overloaded blocks from the dominator
Fig. 7: B-Limiting: extra shared memory is allocated to group are then split into smaller blocks using a splitting
alleviate resource contention while merging long rows. factor. As a result, the B-Splitting technique shows 10.4%
number of threads of an underloaded block is in the range of performance gain with improved SM utilization from 16% to
2n−1 to 2n , the gathering factor is set to 32/2n. For example, 99%.
if a thread block contains 2 effective threads, and the gathering In contrast, low performer vector pairs are binned in four
factor is 16 to fill the 32 sized block completely. groups. Depending on their thread ranges, underloaded blocks
To illustrate this concept, we present a simple merging are gathered and compressed into single, same-sized block.
scenario in Figure 6. Here, the size of the thread block is This B-Gathering technique shows 6.7% performance gain.
set to 16 for simplicity, and “before gathering” represents After generating all non-zero elements, B-Limiting is applied
the original underloaded blocks. The original block indices to reduce memory contention in the merging process. Extra-
are binned based on the corresponding numbers of effective shared memory is allocated to perform merge process for
threads. The blocks contained in bin 1 are compressed into long rows in order to limit the number of allocated blocks
single block with gathering factor 4, and blocks in bin 2 in SM. As a result, the B-Limiting technique shows 16.8%
are gathered with factor 2. However, blocks in bin 3 are not performance gain with 32% l2 cache throughput improvement.
gathered to avoid serialization. Finally, combination of the three techniques improves the total
performance by 41.5% for Youtube data.
D. Merge Optimization: Block Limiting
After generating all non-zero elements in the intermediate V. E XPERIMENTAL E NVIRONMENT
result matrix Ĉ, elements with the same indices are merged Implementation The Block Reorganizer is implemented
into unique elements. This merging process is highly memory as an executable binary, which was originally written in
intensive and has a small computational overhead, meaning the CUDA [6] programming language and compiled using
it is sensitive to memory throughput. Similar to the input NVCC 8.0. Block Reorganizer first reads the input matrices
matrices, the result matrix often has a power-law distribution. and precalculates block-wise workloads. It then applies three
Therefore, during the merging process, some thread blocks optimization techniques called B-Splitting, B-Gathering, and
can generate too many memory requests and incur substantial B-Limiting. All preprocesses are performed on the target
performance degradation by reducing the L2 cache throughput, GPUs except for B-Splitting, which is performed on host
which is shared by multiple SMs [17], [18]. CPUs. When all preprocesses are completed, the sparse matrix
Based on the insight, we propose a B-Limiting technique, multiplication kernel is executed.
which reduces resource contention by limiting the number of System Configuration In our experiments, we evaluated
blocks allocated to an SM. Figure 7 illustrates the B-Limiting the Block Reorganizer mainly on a real machine with an
process. The allowable number of blocks is determined by the Intel Xeon E5-2060 (2.10 GHz) CPU with 64 GB of main
resource requirements of each block. Therefore, we allocate memory and an NVIDIA TITAN Xp GPU [21] with 12 GB
extra shared memory to the merge kernel functions in order of global memory as shown in Table I. We also tested the Block
to reduce the number of blocks in an SM [20]. Reorganizer on additional systems to determine its scalability:
Because allocating the maximum number of blocks in an a Xeon E5 and NVIDIA Tesla V100 system (DGX Station),
SM generally yields the best GPU performance, the block and a Xeon Gold and NVIDIA RTX 2080 Ti system (Table I).
limiting technique should be applied carefully only when Performance Measurement Our spGEMM algorithm gen-
it is expected to be better than the traditional allocation erates output data in an unordered CSR format similar to the
scheme. Block-limiting is therefore currently applied only Gustavson merge algorithm [19]. Therefore, we present our
to the large rows of ĉ∗i where the nnzs exceed the given performance results in two different ways for fairness. We
threshold(threshold = nnz(Ĉ)/(#blocks × β)), where β is first compare Block Reorganizer performance to a baseline
currently 10 to show fair performance gain. spGEMM, which uses a row-product based expansion and
E. Putting It All Together a Gustavson merge process, and four widely used spGEMM
In this section, an example workflow is presented for libraries (cuSPARSE, CUSP, and bhSPARSE for GPUs, and
YouTube data, for better understanding of the mechanism for MKL for CPUs) [13], in order to measure the performance
combining these three techniques into the Block Reorganizer. difference to other open libraries. We then perform detailed
Block Reorganizer first estimates the block-wise nnz and row- analysis of each Block Reorganizer technique and compared
wise nnz. Workload categorization is then performed based the results to the performance of the baseline spGEMM. All
931TABLE I: Target system configurations generally exhibits regular distributions. We also used synthetic
System 1 System 2 [22] System 3
CPU Xeon E5-2640v4 [23] Xeon E5-2698v4 [23] Xeon Gold 5115 [24]
datasets generated using R-MAT [29], [30] to evaluate both
Number of
10 / 20 20 / 40 10 / 20 C = A2 and C = AB.
Core/Threads
MAX CPU Clock 3.40GHz 3.60GHz 3.40GHz
Memory 64 GB 256 GB 128 GB VI. E VALUATIONS
GPU Titan Xp [21] Tesla V100 [25] 2080Ti [26]
Number of SMs 30 80 68 In this section, we show the effectiveness of Block Reor-
MAX GPU Clock 1582MHz 1380MHz 1545MHz
CUDA Capability 6.1(Pascal) 7.0(Volta) 7.5(Turing) ganizer, along with the following techniques used within it:
OS Ubuntu 16.04 Ubuntu 18.04 Ubuntu 16.04
Baseline NVIDIA cuSPARSE v2, CUSP 0.4.0, bhSPARSE, MKL block-splitting, block-gathering, and block-limiting. Section
VI-A shows the performance improvement and analyses on
TABLE II: Real-world datasets from Florida Suite Sparse [27] real-world datasets. Section VI-B presents an examination
and Stanford large network dataset collection [28] of the effectiveness of the techniques across multiple GPU
dimension dimension
Name nnz(A) plot Name nnz(A) plot architectures, and Section VI-C and VI-D present analysis of
nnz(C) nnz(C)
106k 140k the performance impact on various dataset characteristics using
filter3D 2.7 M ship 3.7M synthetic datasets.
20.1 M 23.0M
46k 36k
harbor 2.3M protein 2.1M A. Evaluation on Real-World Datasets
7.5M 18.7M
81k 99k Figure 8 and 9 show the normalized and absolute perfor-
sphere 2.9M
25.3M
2cube sphere 854k
8.6M
mance of Block Reorganizer compared to four widely used
118k 127k spGEMM libraries, and our two baselines based on row- and
accelerator 1.3M cage12 1.9M
17.8M 14.5M
outer-products. The X-axises represent the datasets, and the
215k 196k Y-axises represent the relative performance based on the row-
hood 5.2M m133-b3 782k
32.7M 3.0M product baseline (Figure 8) and the absolute performance
156k 381k in GFLOPs (Figure 9). Based on the figures, the Block
majorbasis 1.7M mario002 1.1M
7.9M 6.2M Reorganizer achieves a performance gain of 1.43x over the
165k 254k row-product baseline, while the outer-product baseline and the
mono 500Hz 4.8M offshore 2.1M
39.5M 22.2M libraries shows only 0.95x, 0.29x, 0.22x, 0.55x, and 0.48x
235k 13k
patents main 548k poisson3Da 344k speedups, respectively. Block Reorganizer also shows high
2.2M 2.8M coverage, as it exhibits the best performance on most datasets.
48k 167k
QCD 1.8M scircuit 0.9M Block-splitting and block-limiting are generally effective for
10.4M 5.0M irregular data that require numerous calculations and memory
193k 1.1M
power197k 3.3M youtube 2.8M accesses per block. However, block-gathering can be applied to
38.0M 148M
26k 87k
most matrices due to its high sparsity of matrices, regardless of
as-caida 104k sx-mathoverflow 495k the regularity. Figure 10 shows the performance improvement
25.6M 17.7M
192k 36k for the three techniques over the outer-product baseline. Block-
loc-gowalla 1.8M emailEnron 359k gathering, which is applied to all sparse matrices, shows the
456M 29.1M
76k 74k highest coverage for matrices. However, for some matrices
slashDot 884k epinions 497k with high skewness (mostly in Stanford datasets), block-
75.2M 19.6M
318k 275k gathering on the underloaded blocks cannot improve perfor-
web-Notredame 1.4M stanford 2.2M
16.0M 19.8M mance significantly because the execution time is dominated
by the overloaded blocks or the merging process. For these
experimental results include the overhead, except the data
datasets, block-splitting and block-limiting are very effective.
transfer time between host and the device. This is because
Consequently, block-limiting, block-splitting, block-gathering,
spGEMM is an application kernel with results that will be used
and Block Reorganizer show average performance gains of
in a GPU. The overhead includes the precalculation, workload
1.05x, 1.05x, 1.28x, and 1.51x, respectively.
classification and preprocessing for block-splitting.
1) Better load balancing with block-splitting: To evaluate
For block reorganizer and baseline spGEMM, basic
the effect of block-splitting on load balancing, we define a
memory-related optimizations, considering shard memory uti-
new metric, load balancing index (LBI), as shown in Equation
lization, cache blocking, and memory coalescing, are applied
(3). LBI indicates the average execution time of all SMs
for maximizing performance.
normalized to the SM with the longest execution time.
Dataset A total of 28 real-world datasets from the stanford
large network dataset collection [28] and the Florida matrix LBI = N i=1 (cycles(SMi )/M AX cycles(SM ))/N
suite [27] were used for computing C = A2 . Table II lists N : number of SMs in GPU
(3)
detailed information for the tested real-world datasets. We
chose specific datasets by considering the distribution and Figure 11 shows the LBI values and execution times of
size of each matrix, and datasets from the stanford large dominators for 10 Stanford datasets with increasing splitting
network dataset collection generally exhibit irregular distri- factors. As long-running overloaded blocks are the main
butions, whereas the datasets from the florida matrix suite performance bottleneck for the datasets, the execution time
932Florida matrix suite Stanford large network data
2.5
row-product outer-product cuSPARSE
CUSP bhSPARSE MKL
Normalized Perf.
2
Block-Reorganizer
1.5
1
0.5
0
Fig. 8: Speedup of spGEMM operations for row/outer-product baselines, multiple libraries (cuSPARSE, CUSP, bhSPARSE,
MKL), and Block Reorganizer on real-world datasets. All data are normalized to the row-product-based spGEMM performance.
Florida matrix suite Stanford large network data
18
row-product outer-product cuSPARSE
16
CUSP bhSPARSE MKL
14
Block-Reorganizer
12
GFLOPS
10
8
6
4
2
0
Fig. 9: Absolute performance of spGEMM operations for row/outer-product baselines, multiple libraries (cuSPARSE, CUSP,
bhSPARSE, MKL), and Block Reorganizer on real-world datasets.
Florida matrix suite Stanford large network data
2.5
B-Limiting B-Splitting
2
Normalized perf.
B-Gathering Block-Reorganizer
1.5
1
0.5
0
Fig. 10: Relative performance of B-Splitting, B-Gathering, B-Limiting, and Block Reorganizer.
25
LBI 1 2 4 8 16 32 64
20 1
comes larger than the number of existing SMs and there is no
Normalized perf.
15 0.75
significant LBI improvement. This performance gain is mainly
due to better cache utilization, and block-splitting improves
LBI
10 0.5
the L2-cache throughput, mainly by splitting the overloaded
5 0.25
blocks. Memory transactions are originally concentrated in
0 0
few overloaded blocks, and the transactions are distributed to
multiple divided blocks using block-splitting. Thus, L2 cache
Fig. 11: Load balancing effectiveness when applying B- utilization can be significantly improved by distributing the
Splitting. divided blocks to share the same memory spaces.
500 600 1 2 4 8 16 32 64
of dominator blocks is only measured to show the effect of
throughput(GB/s)
throughput(GB/s)
400 500
400
L2 write
block-splitting. The X-axis indicates splitting factors from 1 to
L2 read
300
300
200
64, and the Y-axis represents the LBI values and relative per- 100
200
100
formance gains normalized to the performance with a splitting 0 0
factor of 1. When the splitting factor increases, corresponding
LBI and performance increments are observed. The LBI values Fig. 12: L2 cache throughput improvements using B-Splitting.
converge to more than 90% when splitting factors almost equal Figure 12 shows the improvement in L2 cache throughput
the number of SMs in the target GPU. This implies a scale- when splitting overloaded blocks using the NVIDIA nvprof
up of hardware on increasing the number of SMs, and block- profiler [31]. The X-axis represents datasets and the Y-axis
splitting is still an effective technique to improve performance. shows L2 cache throughput. For all datasets, block-splitting
By applying block-splitting, LBI increases from 0.17 to 0.96, shows a substantial L2 cache improvement of 8.9x on average.
and dominator performance is improved by 8.68x on average. This explains the further performance gain when splitting
2) Better cache performance with block-splitting: Some factor is larger than the number of SMs.
matrices such as “loc-gowalla,” “sx-mathoverflow,” and “slash- 3) Better latency hiding efficiency with block-gathering:
Dot” are observed to improve even when splitting factor be- To prove the effectiveness of block-gathering, we profiled the
93380
sync stall before gathering sync stall after gathering row-product outer-product cuSPARSE CUSP
bhSPARSE MKL Block-Reorganizer
sync stalls of total
Normalized Perf.
60
2 1.66
stalls(%)
40 1.43 1.4
1.5
20 1
0 0.5
0
Titan XP Tesla V100 RTX 2080ti
Fig. 13: Changes in sync stall when applying B-Gathering. Fig. 15: Performance scalability on various GPUs.
kernel to observe the changes of the ratio of effective threads B. Performance Scalability on Different Architectures
using nvprof. The sync stall percentage is used as a metric To verify the scalability of Block Reorganizer on various
to demonstrate the ratio of effective threads, as numerous GPU architectures, we tested the performance on three differ-
synchronization stalls exist when many non-effective threads ent devices of different generations: TITAN Xp, Tesla V100,
await the complete computation of several effective threads. and RTX 2080 Ti, as shown in Table I. Figure 15 represents
Figure 13 shows the percentage of stall due to thread syn- the normalized performance after applying Block Reorganizer
chronization. The X-axis represents the datasets and the Y- technique on the target GPUs. The X-axis represents the
axis represents the percentage of sync stalls. As shown in devices, and the Y-axis represents normalized performance
Figure 13, the percentage of sync stalls highly decreases when gain of each technique based on the row-product baseline.
the block-gathering technique is applied. As shown in the figure, Block Reorganizer shows the best
As discussed, underloaded blocks cannot efficiently hide performance across all the target GPU architectures while the
latency due to the insufficient number of effective threads. outer-product baseline shows a similar performance level to
Therefore, most non-effective threads wait for effective threads the row-product baseline. This is because the main problems
to execute their instructions. By applying block-gathering of sparsity and skewness are on all the GPU architectures,
to underloaded blocks to increase the number of effective and three main techniques proposed by Block Reorganizer can
threads in a block, most stalls on synchronization disappear solve the problems successfully. Therefore, 1.43x, 1.66x, and
leaving only memory stalls. Consequently, block-gathering 1.40x speedups over the row-product baseline were achieved
highly increases the performance for underloaded blocks. on TITAN Xp, Tesla V100, and RTX 2080 Ti, respectively.
0 6144 12288 18432 24576 30720 36864 43008 0 6144 12288 18432 24576 30720 36864 43008
TABLE III: Synthetic datasets
200 300 Data Dimension(N) # elements Parameters
throughput(GB/s)
throughput(GB/s)
160 250
C = A2
200
L2 write
L2 read
120
150 s1 250000 62500
80 s2 500000 250000
100
40 S (0.45,0.15,0.15,0.25)
50 s3 750000 562500
0 0
s4 1000000 1000000
p1 (0.25,0.25,0.25,0.25)
p2 (0.45,0.15,0.15,0.25)
Fig. 14: L2 cache throughput improvements using B-Limiting. P
p3
1M 1M
(0.55,0.15,0.15,0.15)
p4 (0.57,0.19,0.19,0.05)
sp1 4M
4) Less resource contention with block-limiting: Limiting sp2 3M
SP 1M (0.25,0.25,0.25,0.25)
the number of blocks for an SM is effective for memory- sp3 2M
sp4 1M
intensive kernels as it alleviates the resource contention. Thus, C = AB
it is expected to increase the performance of merging kernels 15
A 32768 440747 scale=15
having many elements. Figure 14 shows the effect of block B 32768 440024 edge-factor=16
A 65536 908672 scale=16
limiting on L2 cache throughput. The X-axis represents the 16
B 65536 909957 edge-factor=16
10 Stanford datasets on which block-limiting is applied, and A 131072 1864289 scale=17
17
B 131072 1868244 edge-factor=16
the Y-axis represents the percentages of L2 cache throughput A 262144 3806124 scale=18
18
with different limiting factors. The limiting factor indicates B 262144 3801872 edge-factor=16
the additionally allocated shared memory size to adjust the
number of blocks in a single SM. For the experiment, the size C. Evaluation on Synthetic Datasets (C = A2 )
of allocated memory increases by 6144 bytes. As shown in In previous sections, We discussed the effectiveness of
the figure, the L2 cache throughput improves as the limiting Block Reorganizer with real-world datasets compared to the
factor increase initially at a certain point, and it decreases after libraries and our customized baseline. To show the general
the point. The reason for the performance degradation is that applicability of Block Reorganizer, we tested the effectiveness
the performance loss due to less warp occupancy increases using synthetic datasets of contrasting characteristics as shown
compared to the gain from reducing cache contention. As the in Table III. In these synthetic datasets, we changed the
distribution of matrices varies highly, it is difficult to find an following important factors: number of nodes (S: scalability),
optimal point for each matrix. In this study, limiting factor is skewness (P: power-law), and sparsity (SP).
set to a constant value of 4 × 6144 to show fair performance 1) Scalability (dataset S): The first four matrices (s1-s4) in
gain. Consequently, L2 cache read and write throughputs Figure 16 (a) show the performance changes when changing
increase by 1.49x and 1.52x on average, respectively. the matrix size. When the matrix is very small, cuSPARSE
934row-product outer-product cuSPARSE CUSP bhSPARSE MKL Block-Reorganizer row-product outer-product
2.5 cuSPARSE CUSP
1.5 bhSPARSE MKL
Normalized Perf.
Normalized Perf.
2 Block-Reorganizer
1.5 1
(a) (b)
1
0.5
0.5
0 0
s1 s2 s3 s4 p1 p2 p3 p4 sp1 sp2 sp3 sp4 15 16 17 18
Scalability(dataset S) Skewness(dataset P) Sparsity(dataset SP)
Fig. 16: (a) Speedup of spGEMM libraries and Block Reorganizer normalized to the row-product baseline on synthetic datasets
on C = A2 operations, and (b) speedup on C = AB operations.
shows the best performance. However, as the matrices be- generate denser output matrix as from C = A2 operations [32].
come larger, its performance drops significantly and eventually Therefore, block-gathering is an effective optimization be-
shows the lowest performance among others. In contrast, Block cause most thread blocks are categorized into underloaded
Reorganizer shows low performance in small matrices as the blocks with a few overloaded blocks. Consequently, Block-
execution time for matrix multiplication is insufficient, and the Reorganizer achieves an average performance gain of 1.09x
performance is mainly affected by preprocessing overheads. over the baseline, that is the best of the given techniques. The
However, as the matrices become larger, it shows the best gain also appears scalable as the input size increases.
performance over all other methods. VII. R ELATED W ORKS
2) Skewness (dataset P): The next four matrices (p1-
p4) in Figure 16 (a) show the performance changes when There have been many previous studies for spGEMM.
increasing the matrix skewness. The X-axis represents the NVIDIA and Intel provide libraries to support fast spGEMM
matrices used for the evaluation, and the Y-axis represents [10], [11], [33]. Furthermore, several optimized techniques
the normalized performance to the baseline. With an increase have been also proposed [13], [34]–[42].
in the skewness level, cuSPARSE and bhSPARSE exhibit For more details, regularization [35], input categoriza-
performance degradation similar to real datasets. In contrast, tion [36], and resource optimization [37] techniques are pro-
Block Reorganizer shows substantial performance gains for posed for spGEMM on GPUs. From the perspective of load
all cases owing to the wide coverage. Notably, block-splitting balancing, lbGEMM [13] highly improved the performance by
and block-limiting improve performance mainly for highly introducing outer-product scheme to solve thread level load
skewed data by solving the load imbalance and high resource balancing problem. AC-spGEMM [39] also improved overall
contention problems. performance highly by using thread-level load balancing on
3) Sparsity (datasets SP): The last four matrices (sp1- row-product-based spGEMM. Akbudak [40] also improved
sp4) in Figure 16 (a) show the performance changes when merging performance via increasing the matrix locality by
decreasing the matrix density. bhSPARSE shows high per- orchestrating partitioned and permutated workloads in or-
formance over other spGEMMs for relatively dense matrices. der to reduce communication overheads between processors.
However, as the matrices become sparser, Block Reorganizer Kernert [41] and Patwary [42] improved cache locality using
outperforms all other methods by mainly applying block- adaptive tiling of target matrices.
gathering. However, as discussed in Section III, these techniques
are not optimally suitable for matrix multiplication for SNS
D. Evaluation on Synthetic Datasets (C = AB) analysis due to no consideration of power-law degree distri-
To prove the generality of our approach, we also evaluated bution [10], [11], [33], SM-level load balancing, or in-SM re-
the performance of Block Reorganizer for C = AB cases, source utilization problems [13], [35]–[37]. Our outer-product-
in addition to C = A2 . As shown in Table III, the last four based approach also shows stable performance gain across
sets of input matrix pairs of (A, B) are synthetically generated various target matrices by resolving thread-level load imbal-
with two parameters of scale and edge-factor. The size of the ance problem natively without introducing complex per-row-
target matrix is set to (2scale ), and the number of non-zero level load balancing techniques, which often require additional
entries is set to (edge-factor×2scale). The performance data control overhead to secure per-row linked list structures [39].
are evaluated by increasing the scale parameter from 15 to We propose three novel techniques for better load balancing
18 when the edge-factor parameter is fixed to 16 as shown in and resource utilization. Several related studies also have been
Graphulo [32]. proposed [17], [18], [43], [44]. Thread tailor [43] adjusted
Figure 16 (b) shows the normalized performance of Block the number of threads by combining multiple CPU threads
Reorganizer for C = AB cases. The X-axis represents the into a merged thread based on profile results. Lee [18] and
4 spGEMM of matrix pairs, and the Y-axis represents the Kayiran [17] showed that allocating maximum number of TB
relative performance normalized to the row-product baseline. on GPUs does not always guarantee the best performance,
As shown in the figure, Block Reorganizer shows fair speedups and suggested hardware-level approaches for finding and al-
across all input matrix pairs. C = AB operations do not locating optimal number of TBs. Ho [44] introduced threads
935pairing, which merged two threads into a thread to vectorize [16] S. K. Raman et al., “Implementing Streaming SIMD Extensions on the
Pentium III Processor,” IEEE Micro, vol. 20, no. 4, pp. 47–57, 2000.
operations in GPUs. These approaches are partially related to [17] O. Kayiran et al., “Neither more nor less: Optimizing thread-level
our approach. parallelism for gpgpus,” in Proceedings of the 22Nd International
Conference on Parallel Architectures and Compilation Techniques, ser.
VIII. C ONCLUSION PACT ’13. Piscataway, NJ, USA: IEEE Press, 2013, pp. 157–166.
[Online]. Available: http://dl.acm.org/citation.cfm?id=2523721.2523745
This work proposed a novel optimization pass called Block [18] M. Lee et al., “Improving gpgpu resource utilization through alternative
thread block scheduling,” in 2014 IEEE 20th International Symposium
Reorganizer for outer-product-based spGEMM with three on High Performance Computer Architecture (HPCA). IEEE, 2014, pp.
block-level optimizing techniques of B-Splitting, B-Gathering, 260–271.
[19] F. G. Gustavson, “Two fast algorithms for sparse matrices: Multiplica-
and B-Limiting. Block Reorganizer first identifies overloaded tion and permuted transposition,” ACM Transactions on Mathematical
Software (TOMS), vol. 4, no. 3, pp. 250–269, 1978.
and underloaded thread blocks and then applies different [20] Y. Yu et al., “A compiler-based approach for GPGPU performance
calibration using TLP modulation (WIP paper).”
techniques to them. It solves SM level load imbalance problem [21] NVIDIA, “NVIDIA Titan Xp Graphics Cards,” 2017,
by splitting overloaded blocks into multiple small blocks https://www.nvidia.com/en-us/titan/titan-xp/.
[22] NVIDIA, “Nvidia dgx station,” 2017,
using B-Splitting. For underloaded blocks, it increases in-SM https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-
station/nvidia-dgx-station-datasheet.pdf.
computing unit utilization by gathering multiple underloaded [23] INTEL, “Intel xeon e5-2600 model specification,” 2016,
blocks into a single block using B-Gathering. It also limits the https://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-
brief.html.
number of allocated thread blocks on an SM using B-Limiting, [24] INTEL, “Intel gold 5115 model specification,” 2017,
https://ark.intel.com/content/www/kr/ko/ark/products/120484/intel-
when overloaded rows exist in the merging process. Based on xeon-gold-5115-processor-13-75m-cache-2-40-ghz.html.
[25] NVIDIA, “NVIDIA Tesla V100,” 2017,
the three optimization techniques, it shows an average speedup https://images.nvidia.com/content/volta-architecture/pdf/volta-
of 1.43x on execution time compared to the baseline for total architecture-whitepaper.pdf.
[26] NVIDIA, “NVIDIA RTX 2080 Ti Graphics Cards,” 2018,
28 real-world datasets on a target server-class GPU. https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2080-ti/.
[27] T. A. Davis and Y. Hu, “The university of florida sparse matrix
IX. ACKNOWLEDGMENTS collection,” ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1:1–1:25, Dec.
2011. [Online]. Available: http://doi.acm.org/10.1145/2049662.2049663
Thanks to Myung-Hwan Jang and Hyuck-Moo Gwon for [28] “Stanford large network dataset collection,”
http://snap.stanford.edu/data.
all their help and feedback. We also thank the anonymous [29] D. Chakrabarti et al., “R-mat: A recursive model for graph mining,”
in Proceedings of the 2004 SIAM International Conference on Data
reviewers who provided good suggestions for improving the Mining. SIAM, 2004, pp. 442–446.
quality of this work. This work was supported by Samsung [30] D. Zheng et al., “Flashgraph: Processing billion-node graphs on an array
of commodity ssds,” in 13th USENIX Conference on File and Storage
Research Funding & Incubation Center of Samsung Electron- Technologies (FAST 15), 2015, pp. 45–58.
[31] Profiler User’s guide, NVIDIA, 2018,
ics under Project Number SRFC-IT1901-03. Yongjun Park is http://docs.nvidia.com/cuda/pdf/CUDA profiler Users Guide.pdf.
the corresponding author. [32] D. Hutchison et al., “Graphulo implementation of server-side sparse ma-
trix multiply in the accumulo database,” in 2015 IEEE High Performance
R EFERENCES Extreme Computing Conference (HPEC), Sep. 2015, pp. 1–7.
[33] Intel, “Intel Math Kernel Library,” 2003,
[1] D.-H. Bae et al., “Constructing seminal paper genealogy,” in Proceed- https://software.intel.com/en-us/mkl.
ings of the 20th ACM international conference on Information and [34] B. Xie et al., “Cvr: Efficient vectorization of spmv on x86 processors,” in
knowledge management. ACM, 2011, pp. 2101–2104. Proceedings of the 2018 International Symposium on Code Generation
[2] G. He et al., “Parallel simrank computation on large graphs with iterative and Optimization. ACM, 2018, pp. 149–162.
aggregation,” in Proceedings of the 16th ACM SIGKDD international [35] J. Zhang and L. Gruenwald, “Regularizing irregularity: bitmap-based
conference on Knowledge discovery and data mining. ACM, 2010, pp. and portable sparse matrix multiplication for graph data on gpus,” in
543–552. Proceedings of the 1st ACM SIGMOD Joint International Workshop
[3] Y. Cai et al., “Efficient algorithm for computing link-based similarity in
real world networks,” in 2009 Ninth IEEE International Conference on on Graph Data Management Experiences & Systems (GRADES) and
Data Mining. IEEE, 2009, pp. 734–739. Network Data Analytics (NDA). ACM, 2018, p. 4.
[4] Y. Dong et al., “Link prediction and recommendation across heteroge- [36] C. Hong et al., “Efficient sparse-matrix multi-vector product on gpus,” in
neous social networks,” in 2012 IEEE 12th International conference on Proceedings of the 27th International Symposium on High-Performance
data mining. IEEE, 2012, pp. 181–190. Parallel and Distributed Computing. ACM, 2018, pp. 66–79.
[5] Y. Koren et al., “Matrix factorization techniques for recommender [37] J. Liu et al., “Register-based implementation of the sparse
systems,” Computer, no. 8, pp. 30–37, 2009. general matrix-matrix multiplication on gpus,” in Proceedings
[6] J. Nickolls et al., “NVIDIA CUDA software and GPU parallel comput- of the 23rd ACM SIGPLAN Symposium on Principles and
ing architecture,” in Microprocessor Forum, May 2007. Practice of Parallel Programming, ser. PPoPP ’18. New
[7] KHRONOS Group, “OpenCL - the open standard for parallel program- York, NY, USA: ACM, 2018, pp. 407–408. [Online]. Available:
ming of heterogeneous systems,” 2010, http://www.khronos.org. http://doi.acm.org/10.1145/3178487.3178529
[8] J. Leskovec et al., “Graph evolution: Densification and shrinking diam- [38] F. Gremse et al., “Gpu-accelerated sparse matrix-matrix multiplication
eters,” ACM Transactions on Knowledge Discovery from Data (TKDD), by iterative row merging,” SIAM Journal on Scientific Computing,
vol. 1, no. 1, p. 2, 2007. vol. 37, pp. C54–C71, 01 2015.
[9] C. W. Keler and C. Smith, “The SPARAMAT Approach to Automatic [39] M. Winter et al., “Adaptive sparse matrix-matrix multiplication on the
Comprehension of Sparse Matrix Computations,” in Proceedings of the gpu,” in Proceedings of the 24th Symposium on Principles and Practice
Seventh International Workshop on Program Comprehension. IEEE of Parallel Programming. ACM, 2019, pp. 68–81.
Computer Society, 1999, pp. 200–207. [40] K. Akbudak and C. Aykanat, “Simultaneous input and output matrix par-
[10] “NVIDIA cuSPARSE Library,” http://developer.nvidia.com/cusparse. titioning for outer-product–parallel sparse matrix-matrix multiplication,”
[11] S. Dalton et al., “CUSP: Generic parallel algorithms for sparse matrix SIAM Journal on Scientific Computing, vol. 36, no. 5, pp. C568–C590,
and graph computations,” 2014, version 0.5.0. [Online]. Available: 2014.
[41] D. Kernert et al., “Topology-aware optimization of big sparse matrices
http://cusplibrary.github.io/ and matrix multiplications on main-memory systems,” in 2016 IEEE
[12] W. Liu and B. Vinter, “An efficient gpu general sparse matrix-matrix 32nd International Conference on Data Engineering (ICDE). IEEE,
multiplication for irregular data,” in 2014 IEEE 28th International 2016, pp. 823–834.
Parallel and Distributed Processing Symposium, May 2014, pp. 370– [42] M. M. A. Patwary et al., “Parallel efficient sparse matrix-matrix multi-
381. plication on multicore platforms,” in International Conference on High
[13] Y.-Y. Jo et al., “Efficient sparse matrix multiplication on gpu for large
social network analysis,” in Proceedings of the 24th ACM International Performance Computing. Springer, 2015, pp. 48–57.
on Conference on Information and Knowledge Management. ACM, [43] J. Lee et al., “Thread tailor: dynamically weaving threads together for
2015, pp. 1261–1270. efficient, adaptive parallel applications,” in Proc. of the 37th Annual
[14] S. Pal et al., “Outerspace: An outer product based sparse matrix International Symposium on Computer Architecture, 2010, pp. 270–279.
multiplication accelerator,” 02 2018, pp. 724–736. [44] N.-M. Ho and W.-F. Wong, “Exploiting half precision arithmetic in
[15] J. J. Elliott and C. M. Siefert, “Low thread-count gustavson: A multi- nvidia gpus,” in 2017 IEEE High Performance Extreme Computing
threaded algorithm for sparse matrix-matrix multiplication using perfect Conference (HPEC). IEEE, 2017, pp. 1–7.
hashing,” in 2018 IEEE/ACM 9th Workshop on Latest Advances in
Scalable Algorithms for Large-Scale Systems (scalA), Nov 2018, pp.
57–64.
936You can also read