INTRODUCTION TO GPU COMPUTING - Jeff Larkin, June 28, 2018
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Power for CPU-only Power for the Bay Area, CA
Exaflop Supercomputer = (San Francisco + San Jose)
HPC’s Biggest Challenge: Power
33CUDA ECOSYSTEM 2018
CUDA CUDA
GTC
DOWNLOADS REGISTERED
ATTENDEES
IN 2017 DEVELOPERS
8,000+
3,500,000 800,000
7CUDA APPLICATION ECOSYSTEM
From Ease of Use to Specialized Performance
CUDA-C++
CUDA Fortran
Directives and Specialized
Applications Frameworks Libraries
Standard Languages Languages
8ACCELERATED COMPUTING
10X PERFORMANCE & 5X ENERGY EFFICIENCY FOR HPC
GPU Accelerator
CPU Optimized for
Optimized for Parallel Tasks
Serial Tasks
10ACCELERATED COMPUTING
10X PERFORMANCE & 5X ENERGY EFFICIENCY FOR HPC
GPU CPU Strengths
Accelerator
CPU Optimized for
• Very large main memory
Parallel Tasks
Optimized for • Very fast clock speeds
Serial Tasks • Latency optimized via large caches
• Small number of threads can run
very quickly
CPU Weaknesses
• Relatively low memory bandwidth
• Cache misses very costly
• Low performance/watt
11ACCELERATED COMPUTING
10X PERFORMANCE & 5X ENERGY EFFICIENCY FOR HPC
GPU Strengths GPU Accelerator
CPU Optimized for
• High bandwidth main memory Parallel Tasks
Optimized
• Latency tolerant for
via parallelism
• SignificantlySerial
more Tasks
compute
resources
• High throughput
• High performance/watt
GPU Weaknesses
• Relatively low memory capacity
• Low per-thread performance
12ACCELERATED COMPUTING
10X PERFORMANCE & 5X ENERGY EFFICIENCY FOR HPC
GPU Accelerator
CPU Optimized for
Optimized for Parallel Tasks
Serial Tasks
13Speed v. Throughput
Speed Throughput
Which is better depends on your needs…
*Images from Wikimedia Commons via Creative Commons 14Accelerator Nodes
CPU and GPU have distinct
RAM RAM memories
• CPU generally larger and slower
• GPU generally smaller and faster
Execution begins on the CPU
• Data and computation are
offloaded to the GPU
CPU and GPU communicate via PCIe
• Data must be copied between
these memories over PCIe
• PCIe Bandwidth is much lower
than either memories
PCIe
15HOW GPU ACCELERATION WORKS
Application Code
Compute-Intensive Functions
Rest of Sequential
Small % of CPU Code
GPU Code CPU
+ 163 WAYS TO PROGRAM GPUS
Applications
OpenACC Programming
Libraries
Directives Languages
“Drop-in” Easily Accelerate Maximum
Acceleration Applications Flexibility
17SIMPLICITY & PERFORMANCE
Simplicity Accelerated Libraries
Little or no code change for standard libraries; high performance
Limited by what libraries are available
Compiler Directives
High Level: Based on existing languages; simple and familiar
High Level: Less control over performance
Parallel Language Extensions
Expose low-level details for maximum performance
Often more difficult to learn and more time consuming to implement
Performance
18CODE FOR SIMPLICITY & PERFORMANCE
• Implement as much as possible using
Libraries portable libraries.
• Use directives to rapidly
Directives accelerate your code.
Languages • Use lower level languages
for important kernels.
19GPU DEVELOPER ECO-SYSTEM
Debuggers GPU Compilers Auto-parallelizing
Numerical
Packages & Profilers & Cluster Tools Libraries
C
cuda-gdb C++ BLAS
MATLAB NV Visual Profiler Fortran OpenACC FFT
Mathematica Parallel Nsight Java mCUDA LAPACK
NI LabView Visual Studio Python OpenMP NPP
pyCUDA Allinea Ocelot Video
TotalView Imaging
GPULib
Consultants & Training OEM Solution Providers
ANEO GPU Tech
20GPU LIBRARIES
21LIBRARIES: EASY, HIGH-QUALITY ACCELERATION
EASE OF USE Using libraries enables GPU acceleration without in-depth
knowledge of GPU programming
“DROP-IN” Many GPU-accelerated libraries follow standard APIs, thus
enabling acceleration with minimal code changes
QUALITY Libraries offer high-quality implementations of functions
encountered in a broad range of applications
PERFORMANCE NVIDIA libraries are tuned by experts
22GPU ACCELERATED LIBRARIES
“Drop-in” Acceleration for Your Applications
Linear Algebra NVIDIA
cuFFT,
FFT, BLAS, cuBLAS,
SPARSE, Matrix cuSPARSE
Numerical & Math
RAND, Statistics NVIDIA
Math Lib NVIDIA cuRAND
Data Struct. & AI
Sort, Scan, Zero Sum GPU AI – Board GPU AI –
Games Path Finding
Visual Processing NVIDIA
Image & Video NVIDIA Video
NPP Encode
23DROP-IN ACCELERATION
In Two Easy Steps
int N = 1DROP-IN ACCELERATION
With Automatic Data Management
int N = 1DROP-IN ACCELERATION
With Automatic Data Management
int N = 1DROP-IN ACCELERATION
With Explicit Data Management
int N = 1EXPLORE CUDA LIBRARIES
developer.nvidia.com/gpu-accelerated-libraries
28OPENACC DIRECTIVES
29OpenACC is a directives- Add Simple Compiler Directive
based programming approach
main()
to parallel computing {
designed for performance
#pragma acc kernels
{
and portability on CPUs
}
}
and GPUs for HPC.
30University of Illinois
main() PowerGrid- MRI Reconstruction
{
#pragma acc kernels
//automatically runs on GPU
{
OpenACC }
}
70x Speed-Up
2 Days of Effort
Simple | Powerful | Portable
RIKEN Japan
NICAM- Climate Modeling
Fueling the Next Wave of
8000+
Scientific Discoveries in HPC Developers
using OpenACC
7-8x Speed-Up
5% of Code Modified
http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf
http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway
http://on-demand.gputechconf.com/gtc/2015/presentation/S5297-Hisashi-Yashiro.pdf 31
31
http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7SINGLE CODE FOR MULTIPLE PLATFORMS
OpenACC - Performance Portable Programming Model for HPC
AWE Hydrodynamics CloverLeaf mini-App, bm32 data set
80x 77x
PGI OpenACC
POWER Intel OpenMP
60x
Speedup vs Single Haswell Core
IBM OpenMP
52x
Sunway
x86 CPU 40x
x86 Xeon Phi
20x
NVIDIA GPU 9x 10x 10x 11x 11x
9x
PEZY-SC
0x Dual Haswell 1 Tesla 1 Tesla
Dual Broadwell Dual POWER8
P100 V100
Systems: Haswell: 2x16 core Haswell server, four K80s, CentOS 7.2 (perf-hsw10), Broadwell: 2x20 core Broadwell server, eight P100s (dgx1-prd-01), Minsky: POWER8+NVLINK, four P100s,
RHEL 7.3 (gsn1).
Compilers: Intel 17.0, IBM XL 13.1.3, PGI 16.10.
Benchmark: CloverLeaf v1.3 downloaded from http://uk-mac.github.io/CloverLeaf the week of November 7 2016; CloverlLeaf_Serial; CloverLeaf_ref (MPI+OpenMP); CloverLeaf_OpenACC
(MPI+OpenACC)
Data compiled by PGI November 2016, Volta data collected June 2017
32OpenACC COMPILER DIRECTIVES
Parallel C Code Parallel Fortran Code
void saxpy(int n, subroutine saxpy(n, a, x, y)
float a, real :: x(:), y(:), a
float *x, integer :: n, i
float *y) !$acc kernels
{ do i=1,n
y(i) = a*x(i)+y(i)
#pragma acc kernels
enddo
for (int i = 0; i < n; ++i) !$acc end kernels
y[i] = a*x[i] + y[i]; end subroutine saxpy
}
... ...
// Perform SAXPY on 1M elements ! Perform SAXPY on 1M elements
saxpy(1PROGRAMMING
LANGUAGES
34Standard C CUDA C Parallel C
__global__
void saxpy(int n, float a, void saxpy(int n, float a,
float *x, float *y) float *x, float *y)
{ {
for (int i = 0; i < n; ++i) int i = blockIdx.x*blockDim.x + threadIdx.x;
y[i] = a*x[i] + y[i]; if (i < n) y[i] = a*x[i] + y[i];
} }
int N = 1THRUST C++ TEMPLATE LIBRARY
Serial C++ Code
with STL and Boost
Parallel C++ Code
int N = 1CUDA FORTRAN
Standard Fortran Parallel Fortran
module mymodule contains module mymodule contains
subroutine saxpy(n, a, x, y) attributes(global) subroutine saxpy(n, a, x, y)
real :: x(:), y(:), a real :: x(:), y(:), a
integer :: n, i integer :: n, i
do i=1,n attributes(value) :: a, n
y(i) = a*x(i)+y(i) i = threadIdx%x+(blockIdx%x-1)*blockDim%x
enddo if (iStandard Python
PYTHON Numba Parallel Python
import numpy as np
import numpy as np from numba import vectorize
@vectorize(['float32(float32, float32,
def saxpy(a, x, y): float32)'], target='cuda')
return [a * xi + yi def saxpy(a, x, y):
for xi, yi in zip(x, y)] return a * x + y
x = np.arange(2**20, dtype=np.float32) N = 1048576
y = np.arange(2**20, dtype=np.float32)
# Initialize arrays
A = np.ones(N, dtype=np.float32)
B = np.ones(A.shape, dtype=A.dtype)
cpu_result = saxpy(2.0, x, y) C = np.empty_like(A, dtype=A.dtype)
# Add arrays onGPU
C = saxpy(2.0, X, Y)
38
http://numpy.scipy.org https://numba.pydata.orgENABLING ENDLESS WAYS TO SAXPY
CUDA New Language
C, C++, Fortran Support
• Build front-ends for Java, Python, R, DSLs
• Target other processors like ARM, FPGA, GPUs,
x86 LLVM Compiler
For CUDA
CUDA Compiler Contributed to NVIDIA x86 New Processor
GPUs CPUs Support
Open Source LLVM
39CUDA TOOLKIT – DOWNLOAD TODAY!
Everything You Need to Accelerate Applications
CUDA DOCUMENTATION GETTING STARTED RESOURCES INDUSTRY APPLICATIONS
Installation Best Practices
Guide Guide
Programming CUDA Tools
Guide Guide
API Reference Samples
developer.nvidia.com/cuda-toolkit
40You can also read