A DEEP DIVE INTO THE LATEST HPC SOFTWARE - Robbie Searles | Solutions Architect, NVIDIA Corporation

Page created by Bobby Alvarado
 
CONTINUE READING
A DEEP DIVE INTO THE LATEST HPC SOFTWARE - Robbie Searles | Solutions Architect, NVIDIA Corporation
A DEEP DIVE INTO THE
LATEST HPC SOFTWARE
Robbie Searles | Solutions Architect, NVIDIA Corporation
PROGRAMMING THE NVIDIA PLATFORM
                                                   CPU, GPU and Network

    Accelerated Standard Languages                 Incremental Portable Optimization            Platform Specialization

std::transform(par, x, x+n, y, y,                                                       __global__
    [=](float x, float y){ return y + a*x;                                              void saxpy(int n, float a,
}                                            #pragma acc data copy(x,y) {                          float *x, float *y) {
);                                                                                        int i = blockIdx.x*blockDim.x +
                                             ...                                                  threadIdx.x;
                                                                                          if (i < n) y[i] += a*x[i];
                                             std::transform(par, x, x+n, y, y,          }
do concurrent (i = 1:n)                          [=](float x, float y){
   y(i) = y(i) + a*x(i)                              return y + a*x;                    int main(void) {
enddo                                        });                                          ...
                                                                                          cudaMemcpy(d_x, x, ...);
                                             ...                                          cudaMemcpy(d_y, y, ...);
import legate.numpy as np
…                                            }                                           saxpy(...);
def saxpy(a, x, y):
    y[:] += a*x                                                                          cudaMemcpy(y, d_y, ...);

          Core                  Math             Communication         Data Analytics      AI                 Quantum

                                                         Acceleration Libraries
                                                                                                                            2
NVIDIA HPC SDK
                Available at developer.nvidia.com/hpc-sdk, on NGC, via Spack, and in the Cloud

                                              DEVELOPMENT                                                                   ANALYSIS

    Programming                                    Core                  Math             Communication
                            Compilers                                                                           Profilers              Debugger
       Models                                    Libraries             Libraries            Libraries

                                                                                                 HPC-X
Standard C++ & Fortran   nvcc           nvc       libcu++          cuBLAS    cuTENSOR                            Nsight                cuda-gdb
                                                                                                  MPI
                                                                                           UCX          SHMEM

 OpenACC & OpenMP               nvc++             Thrust          cuSPARSE   cuSOLVER     SHARP         HCOLL   Systems                  Host

                                                                                             NVSHMEM
                                                                                                                Compute                 Device
        CUDA                nvfortran              CUB             cuFFT      cuRAND
                                                                                                 NCCL

                                                 Develop for the NVIDIA Platform: GPU, CPU and Interconnect
                                                  Libraries | Accelerated C++ and Fortran | Directives | CUDA
                                                             7-8 Releases Per Year | Freely Available

                                                                                                                                                  3
NVIDIA PERFORMANCE LIBRARIES
                                    Math Library Directions

    Seamless Acceleration                     Scaling Up                 Composability
Tensor Cores, Enhanced L2$ & SMEM   Multi-GPU and Multi-Node Libraries   Device Functions

                                                                                            4
IMPROVED PERFORMANCE ON KEY HPC KERNELS
                                                Sparse Matrix x Vector and Sparse Triangular Solver

                            cusparseSpMV vs. cusparseCsrmvEx (merge-path)                                      cusparseSpSV vs. cusparseXcsrsv2
                      1.4                                                                           32

                      1.3
                                                                                                    16

                                                                             Speed-up on A100 GPU
Speedup on A100 GPU

                      1.2
                                                                                                     8

                      1.1

                                                                                                     4
                       1

                                                                                                     2
                      0.9

                      0.8                                                                            1
                             579 SuiteSparse matrices sorted by speed-up                                 261 SuiteSparse matrices sorted by speed-up
                                                                                                                                                  5
cuSOLVER DISTRIBUTED MULTI-GPU LU SOLVER
                                                          Scalability on Top 5 supercomputers
                                         cuSOLVERMp LU with pivoting
                       10000

                                                                                                   Supports standard 2D block cyclic distribution
                       1000                                                                        between processes
Perfromance [TFLOPS]

                                                                                                   > 2.2X speed-up A100 over V100

                        100                                                                        Early Access available
                                                                DGX A100 SuperPOD
                                                                                                   https://developer.nvidia.com/cudamathlibraryea
                                                                Summit Supercomputer V100

                         10

                          1
                               16   32      64            128             256        512    1024
                                                   Number of GPUs

Data shown is for double complex LU factorization for 27882 rows/cols per device                                                             6
HPL-AI UPDATE
                               Available on NGC as part of the HPC Benchmarks Container

                                                                                             HPL-AI vs. HPL Performance on a DGX SuperPOD
                                                                                                       HPL-AI 20.10         HPL-AI 21.04     HPL
                                                                                       180

                                                                                       160
 First HPC Benchmarks release 2020.10 on NGC
 provided: HPL, HPL-AI, HPCG benchmarks with                                           140

                                                                  PetaFLOPS (HPL-AI)
 optimized performance on NVIDIA accelerated                                           120
 HPC systems                                                                                                                                              2.6x
                                                                                       100
 2021.04 release delivers a 2.6X speed-up for
                                                                                        80
 HPL-AI on a DGX SuperPOD
                                                                                        60

https://ngc.nvidia.com/catalog/containers/nvidia:hpc-benchmarks                         40
                                                                                                                                                           10.9x
                                                                                        20

                                                                                         0
                                                                                             0   128      256         384     512      640    768   896          1024
                                                                                                                      # of A100 80GB GPUs
                                                                                                                                                      7
PROGRAMMING THE NVIDIA PLATFORM
                                             CPU, GPU and Network

    Accelerated Standard Languages

std::transform(par, x, x+n, y, y,
    [=](float x, float y){ return y + a*x;
}
);

do concurrent (i = 1:n)
   y(i) = y(i) + a*x(i)
enddo

import legate.numpy as np
…
def saxpy(a, x, y):
    y[:] += a*x

                                                                    8
static inline
void CalcHydroConstraintForElems(Domain &domain, Index_t length,

                                                                                PARALLEL C++
                   Index_t *regElemlist, Real_t dvovmax, Real_t& dthydro)
{
#if _OPENMP
  const Index_t threads = omp_get_max_threads();
  Index_t hydro_elem_per_thread[threads];
  Real_t dthydro_per_thread[threads];
#else
  Index_t threads = 1;
  Index_t hydro_elem_per_thread[1];
  Real_t dthydro_per_thread[1];
                                                                                Ø      Composable, compact and elegant
#endif
#pragma omp parallel firstprivate(length, dvovmax)
  {                                                                             Ø      Easy to read and maintain
    Real_t dthydro_tmp = dthydro ;
    Index_t hydro_elem = -1 ;
#if _OPENMP
    Index_t thread_num = omp_get_thread_num();                                  Ø      ISO Standard
#else
    Index_t thread_num = 0;
#endif
#pragma omp for
                                                                                Ø      Portable – nvc++, g++, icpc, MSVC, …
    for (Index_t i = 0 ; i < length ; ++i) {
      Index_t indx = regElemlist[i] ;

       if (domain.vdov(indx) != Real_t(0.)) {
         Real_t dtdvov = dvovmax / (FABS(domain.vdov(indx))+Real_t(1.e-20)) ;

           if ( dthydro_tmp > dtdvov ) {
             dthydro_tmp = dtdvov ;                                             static inline void CalcHydroConstraintForElems(Domain &domain, Index_t length,
             hydro_elem = indx ;                                                                                                    Index_t *regElemlist,
           }                                                                                                                        Real_t dvovmax,
       }                                                                                                                            Real_t &dthydro)
     }                                                                          {
     dthydro_per_thread[thread_num] = dthydro_tmp ;                               dthydro = std::transform_reduce(
     hydro_elem_per_thread[thread_num] = hydro_elem ;                               std::execution::par, counting_iterator(0), counting_iterator(length),
    }                                                                               dthydro, [](Real_t a, Real_t b) { return a < b ? a : b; },
    for (Index_t i = 1; i < threads; ++i) {                                         [=, &domain](Index_t i)
      if(dthydro_per_thread[i] < dthydro_per_thread[0]) {                         {
        dthydro_per_thread[0] = dthydro_per_thread[i];                                Index_t indx = regElemlist[i];
        hydro_elem_per_thread[0] = hydro_elem_per_thread[i];                          if (domain.vdov(indx) == Real_t(0.0)) {
      }                                                                                 return std::numeric_limits::max();
    }                                                                                 } else {
    if (hydro_elem_per_thread[0] != -1) {                                               return dvovmax / (std::abs(domain.vdov(indx)) + Real_t(1.e-20));
      dthydro = dthydro_per_thread[0] ;                                               }
    }                                                                             });

                                                                                                                  Parallel C++17
    return ;                                                                    }
}                                 C++ with OpenMP
                                                                                                                                                                 9
LULESH PERFORMANCE
                             Relative Performance – Higher is Better

        16
        14                                                                              13.57
NVC++
        12
GCC
        10
        8
        6
        4
                                                                        2.08
        2                         1.03             1.53
                  1
        0
             OpenMP on 64c    OpenMP on 64c   Standard C++ on     Standard C++ on   Standard C++ on
               EPYC 7742        EPYC 7742      64c EPYC 7742       64c EPYC 7742         A100

                                                                Same ISO C++ Code
                                                                                                      10
C++ STANDARD PARALLELISM
                                                                                    MAIA

                           MAIA CPU Performance: Dual Socket EPYC 7742                                                                MAIA A100 Performance
                       9                                                                                          70

                       8
                                                                                                                  60
                       7
                                                                                                                  50
Relative Performance

                                                                                           Relative Performance
                       6

                       5                                                                                          40

                       4                                                                                          30

                       3
                                                                                                                  20
                       2
                                                                                                                  10
                       1

                       0                                                                                           0
                           gcc OpenMP       gcc Standard C++   nvc++ Standard C++                                      nvc++ Standard C++ 2x Epyc 7742   nvc++ Standard C++ A100

                                                 Same Standard C++ Code                                                                   Same Standard C++ Code

                                                                                                                                                                                   11
FORTRAN STANDARD PARALLELISM
                                        NWChem and GAMESS with DO CONCURRENT

                 NWChem TCE CCSD(T) Kernel Speedup                                                   GAMESS Performance on V100 (NERSC Cori GPU)
25                                                                                   3

20                                                                                  2.5

                                                                                     2

                                                                   Time (seconds)
15

                                                                                    1.5
10

                                                                                     1
 5

                                                                                    0.5
 0
     Standard Fortran       Standard Fortran     OpenACC Fortran                     0
     2s 20c Xeon 6148            A100                 A100                                         W8                  W16                  W32             W64             W80
                                                                                                                                  Water Cluster Size
                                                                                                               OpenMP Target Offload                Standard Fortran
         Same Standard Fortran Code
                                                                                          GAMESS results from Melisa Alkan and Gordon Group, Iowa State
                                                                                                                                                                       12
PROGRAMMING THE NVIDIA PLATFORM
                                                      CPU, GPU and Network

import numpy as np
import pandas as pd

size = num_rows_per_gpu * num_gpus                                                        import cudart
key_l = np.arange(size)
                                                                                          alloc1 = cudart.cudaMalloc(10)
payload_l = np.random.randn(size) * 100.0
lhs = pd.DataFrame({"key": key_l, "payload": payload_l})                                  alloc2 = cudart.cudaMalloc(10)
                                                                                          cudart.cudaMemset(alloc1, 5, 10)
key_r = key_l // 3 * 3                                                                    cudart.cudaMemcpy(alloc2, alloc1, 10,
payload_r = np.random.randn(size) * 100.0                                                                   cudart.cudaMemcpyDeviceToDevice)
rhs = pd.DataFrame({"key": key_r, "payload": payload_r})

out = lhs.merge(rhs, on="key")

                          Accelerated Standards                                                           Platform Specialization

     Core                        Math               Communication        Data Analytics                   AI                  Quantum

                                                           Acceleration Libraries
                                                                                                                                          13
PYTHON STANDARD PARALLELISM
                                                          Introducing Legate
                                    Legate

Legate transparently accelerates and scales existing Numpy
and Pandas workloads

Program from the edge to the supercomputer in Python by
changing 1 import line

Pass data between Legate libraries without worrying about
distribution or synchronization requirements

Alpha release available at github.com/nv-legate

# Import numpy as np
import legate.numpy as np
for _ in range(iter):
    un = u.copy()

    vn = v.copy()
    b = build_up_b(rho, dt, dx, dy, u, v)
    p = pressure_poisson_periodic(b, nit, p, dx, dy)

…

                                                                                14
LEGATE NUMPY
Results from “CFD Python”
https://github.com/barbagroup/CFDPython                                                                                          Distributed NumPy Performance
import legate.numpy as np
                                                                                                                                          (weak scaling)
                                                                                                                                            cuPy     Legate
for _ in range(iter):
    un = u.copy()                                                                                                  150

    vn = v.copy()
    b = build_up_b(rho, dt, dx, dy, u, v)
    p = pressure_poisson_periodic(b, nit, p, dx, dy)

                                                                                                  Time (seconds)
    u[1:-1, 1:-1] = (                                                                                              100
        un[1:-1, 1:-1]
        - un[1:-1, 1:-1] * dt      / dx * (un[1:-1, 1:-1] - un[1:-1, 0:-2])
        - vn[1:-1, 1:-1] * dt      / dy * (un[1:-1, 1:-1] - un[0:-2, 1:-1])
        - dt / (2 * rho * dx)      * (p[1:-1, 2:] - p[1:-1, 0:-2])
        + nu
        * (                                                                                                        50
             dt
              / dx ** 2
             * (un[1:-1, 2:] -     2 * un[1:-1, 1:-1] + un[1:-1, 0:-2])
             + dt
              / dy ** 2
             * (un[2:, 1:-1] -     2 * un[1:-1, 1:-1] + un[0:-2, 1:-1])                                             0
        )                                                                                                                1   2     4    8    16    32    64     128   256   512   1024
        + F * dt
    )                                                                                                                                   Relative dataset size
                                                                                                                                          Number of GPUs

  Extracted from “CFD Python” course at https://github.com/barbagroup/CFDPython
  Barba, Lorena A., and Forsyth, Gilbert F. (2018). CFD Python: the 12 steps to Navier-Stokes equations.
  Journal of Open Source Education, 1(9), 21, https://doi.org/10.21105/jose.00021
                                                                                                                                                                                   15
ACCELERATED STANDARDS
          Parallel performance for wherever you code needs to run

     std::transform(std::execution::par, x, x+n, y, y,
       [=] (auto xi, auto yi) { return y + a*xi; });

                    do concurrent (i = 1:n)
                      y(i) = y(i) + a*x(i)
                    enddo

                   CPU                         GPU

  nvc++ –stdpar=multicore                       nvc++ –stdpar=gpu
nvfortran –stdpar=multicore                   nvfortran –stdpar=gpu
                                                                      16
THE NVIDIA HPC SDK IS READY FOR ARM

               Auto Vectorization                                     Estimated SPEC OMP® 2012, SPEC CPU® 2017 Floating Point
                                                                          Speed and Rate Geomean on 80 core Ampere Altra
 Vector Intrinsics                                                400.0

float32x2_t vadd_f32 (float32x2_t a, float32x2_t b);              350.0

float32x4_t vdupq_n_f32 (float32_t value);
                                                                  300.0
float32x2_t vget_high_f32 (float32x4_t a);
                                                                  250.0

                                                       Time (s)
                                                                  200.0
 Host Parallelism
                               FP Models
        Standard                                                  150.0
        Parallelism            § Precise
                                                                  100.0

                               § Fast                              50.0

                               § Relaxed                            0.0
                                                                             OMP 2012                    FP Speed 2017                    FP Rate 2017
                                                                                      GNU 11.1.0          NVHPC 21.5          NVHPC 21.7

                                                                             SPEC OMP and SPEC CPU are registered trademarks of the Standard Performance
                                                                                                                                                    17
                                                                                               Evaluation Corporation (www.spec.org)
NSIGHT VISUAL STUDIO CODE EDITION

Visual Studio Code extensions that provides:

•   CUDA code syntax highlighting

•   CUDA code completion

•   Build warning/errors
                                            Variables view

•   Debug CPU & GPU code

•   Remote connection support via SSH

•   Available on the VS Code Marketplace now!

                                            CPU & GPU                                            Exec debugger
                                             registers                                             commands
                                                                                                                               CUDA Call
                                            Watch CPU                                                                            Stack
                                            & GPU vars

                                         Session status
                                                                                                                              CUDA focus

                                                                                                                                           18
                                                             https://developer.nvidia.com/nsight-visual-studio-code-edition
NVIDIA HPC SDK
                Available at developer.nvidia.com/hpc-sdk, on NGC, via Spack, and in the Cloud

                                              DEVELOPMENT                                                                   ANALYSIS

    Programming                                    Core                  Math             Communication
                            Compilers                                                                           Profilers              Debugger
       Models                                    Libraries             Libraries            Libraries

                                                                                                 HPC-X
Standard C++ & Fortran   nvcc           nvc       libcu++          cuBLAS    cuTENSOR                            Nsight                cuda-gdb
                                                                                                  MPI
                                                                                           UCX          SHMEM

 OpenACC & OpenMP               nvc++             Thrust          cuSPARSE   cuSOLVER     SHARP         HCOLL   Systems                  Host

                                                                                             NVSHMEM
                                                                                                                Compute                 Device
        CUDA                nvfortran              CUB             cuFFT      cuRAND
                                                                                                 NCCL

                                                 Develop for the NVIDIA Platform: GPU, CPU and Interconnect
                                                  Libraries | Accelerated C++ and Fortran | Directives | CUDA
                                                             7-8 Releases Per Year | Freely Available

                                                                                                                                                  19
You can also read