XGC with Kokkos/Cabana: Plasma Physics on Summit and Beyond - 2020 Performance, Portability ...

Page created by Carrie Phillips
 
CONTINUE READING
XGC with Kokkos/Cabana: Plasma Physics on Summit and Beyond - 2020 Performance, Portability ...
XGC with Kokkos/Cabana: Plasma Physics on Summit and Beyond

             A. Scheinberg1,4, S. Ethier1, G. Chen2, S. Slattery3, R.
                  Bird2, E. D’Azevedo3, S.-H. Ku1, CS Chang1
                              In collaboration with ECP-CoPA

                               P3HPC 2020, Sept. 2

1Princeton Plasma Physics Laboratory
2Las Alamos National Laboratory
3Oak Ridge National Laboratory
4Jubilee Development
XGC with Kokkos/Cabana: Plasma Physics on Summit and Beyond - 2020 Performance, Portability ...
WDMApp Requires Exascale Computers and Beyond

                                                                                                                 Beyond
                                                                                  Exaflops                       A WDMApp that
                                                                                                                 includes necessary
                                                                                  Core-edge coupled 5D           engineering reactor
                                                                                  electromagnetic study of       components, and
Gigaflops                      Teraflops                  Petaflops               whole-device ITER,
                                                                                  including ion-scale            applicable to leading
5-D electrostatic ion          Core: 5D ion-scale         Core: 5D ion scale      turbulence + local electron-   alternate concepts
physics in simplified          electromagnetic physics    Maxwellian ion +        scale turbulence, profile      (including
circular cylindrical           in torus                   electron                evolution, large-scale         stellarators); and
geometry                                                  electromagnetic         instability, plasma-material   possibly
                               Edge: ion+neutral                                  interaction, rf heating, and   6D whole device
                               electrostatic physics in   Edge: non-              energetic particles            modeling
                               torus                      Maxwellian plasma,
                                                          electrostatic physics
2 Exascale Computing Project
XGC with Kokkos/Cabana: Plasma Physics on Summit and Beyond - 2020 Performance, Portability ...
XGC outline
  • Gyrokinetic (i.e. 5D) particle-in-cell code on an unstructured grid

                               Field solve

         Charge
         scatter
                                                             Electron push
                                                                 (x60)

                                                                 Ion push

                Collisions                     Transfer
                Sources                      particle data
               Diagnostics                    between
                                              compute
                                                nodes

3 Exascale Computing Project
XGC with Kokkos/Cabana: Plasma Physics on Summit and Beyond - 2020 Performance, Portability ...
XGC outline
  • Gyrokinetic (i.e. 5D) particle-in-cell code on an unstructured grid

                                                                             RK Step 1
                               Field solve                                                Electron
                                                                                           push

         Charge
         scatter
                                                             Electron push
                                                                 (x60)

                                                                             RK Step 2
                                                                 Ion push                 Electron
                                                                                           push

                Collisions                     Transfer                                  Collisions
                Sources                      particle data
               Diagnostics                    between
                                              compute
                                                nodes

4 Exascale Computing Project
XGC with Kokkos/Cabana: Plasma Physics on Summit and Beyond - 2020 Performance, Portability ...
Why adopt Cabana and Kokkos?
  • Portability!
  • Let Kokkos and Cabana handle data management and kernel execution for
    easy portability between architectures
  • Reduce compiler dependencies (e.g. only PGI on Summit)

                                OpenACC         Cuda Fortran

                               GNU        PGI        IBM

  • Provide an easy/flexible framework for porting more kernels to GPU
  • Avoid code duplication
     – 3 previous versions: original, vectorized, Cuda

5 Exascale Computing Project
XGC with Kokkos/Cabana: Plasma Physics on Summit and Beyond - 2020 Performance, Portability ...
Why adopt Cabana and Kokkos?
  • Portability!
  • Let Kokkos and Cabana handle data management and kernel execution for
    easy portability between architectures
  • Reduce compiler dependencies (e.g. only PGI on Summit)

                                OpenACC         Cuda Fortran

                               GNU        PGI        IBM

  • Provide an easy/flexible framework for porting more kernels to GPU
  • Avoid code duplication
     – 3 previous versions: original, vectorized, Cuda, Cabana?
       4?
6 Exascale Computing Project
XGC with Kokkos/Cabana: Plasma Physics on Summit and Beyond - 2020 Performance, Portability ...
Old implementation of XGC

                                        KEY:   Fortran   Old setup:
   Main program
      Setup                                    C++       • All Fortran/Cuda Fortran

                               Deposition                How does a Fortran code adopt a
           Main Loop
                                                         C++ programming model?
                               Field solve

                               Push (Cuda)

                               Collisions (OpenACC)

7 Exascale Computing Project
XGC with Kokkos/Cabana: Plasma Physics on Summit and Beyond - 2020 Performance, Portability ...
A Kokkos implementation of XGC

                                        KEY:   Fortran    New setup:
   Main program
      Setup                                    C++        • Keep Fortran main and kernels
         Setup
                                                          • C++ interface
                                                            – “Light touch:” Localized modification
                               Kokkos interface
                                    Deposition (Kokkos)     – Gradual implementation
           Main Loop
                               Field solve                • Unified, optimized code base

                               Kokkos interface
                                    Push (Kokkos)

                               Collisions (OpenACC)

8 Exascale Computing Project
XGC with Kokkos/Cabana: Plasma Physics on Summit and Beyond - 2020 Performance, Portability ...
Data layout with Cabana
  • Cabana (ECP-CoPA): a library for particle-based applications
        – Built on Kokkos
        – Provides AoSoA (array of structures of arrays) for versatile layout

      C++
       // Define Cabana structure type                     phase,    constants, global id
       using ParticleDataTypes = Cabana::MemberDataTypes< double[6], double[3], int >;

9 Exascale Computing Project
XGC with Kokkos/Cabana: Plasma Physics on Summit and Beyond - 2020 Performance, Portability ...
Executing the Kokkos parallel_for
   • Kernel is called in a Kokkos parallel_for
C++
 Kokkos::RangePolicy range_policy( 0, n_items ); // n_ptl on GPU, n_structs on CPU

 // Execute parallel_for
 Kokkos::parallel_for(“my_operation”, range_policy_vec, KOKKOS_LAMBDA( const int idx )
 {
   push_f(p_loc+idx, idx);
 });

      • Must cast Cabana array into Fortran             • Inner loops for vectorization on CPU
        type for use in Fortran kernels                Fortran
      Fortran                                           subroutine push_f(particle_vec, i_vec) BIND(C,name=’push_f')
                                                          USE, INTRINSIC :: ISO_C_BINDING
      module ptl_module
                                                          type(ptl_type) :: particle_vec
        use, intrinsic :: ISO_C_BINDING
                                                          integer(C_INT), value :: i_vec
        type, BIND(C) :: ptl_type
           real (C_DOUBLE) :: ph(vector_length,6)
                                                          do i=1, simd_size ! 32 on CPU, 1 on GPU
           real (C_DOUBLE) :: ct(vector_length,3)
                                                            ... ! Vectorizable loop that advances particle positions
           integer (C_INT) :: gid(vector_length)
                                                          end do
        end type ptl_type
      end module
                                                        end subroutine

10 Exascale Computing Project
Timing on Summit (256 nodes)
                                                                   CPU             GPU
   • Overall speed-up: 15x CPU only                              Ion scatter

                                                     RK Step 1
                                                                               Electron scatter
   • CPU-GPU communication costs low                             Ion push       Electron push
         – Actual electron transfer time shown                    Ion shift
         – Favors simple approach to communication                                                 Cuda
                                                                 Ion scatter                        via
                                                                               Electron scatter   Kokkos

                                                     RK Step 2
                                                                 Ion push
  CPU-only                  CPU+GPU                                             Electron push

      Electron                                                     Ion shift
       push

                                                                   Other

      Electron
       push                                                                       Collisions      OpenACC

                                                                    Other

11 Exascale Computing Project
Summit performance comparison and scaling

                     Old CPU      Old specialized   Cabana
12                    version
     Exascale Computing Project
                                   Cuda version     version
Cori KNL performance comparison and scaling

                Old CPU         Old specialized   Cabana
13 Exascale Computing Project
                version          CPU version      version
Transition to C++
       • Diverse architectures coming up in the near future
             – Challenges lie ahead for portability

                    Supercomputer           Year         Petaflops   Architecture   Language
                    Summit                  2019         200         Nvidia GPUs    Cuda

                    Perlmutter              2020         100         Nvidia GPUs    Cuda

                    Aurora                  2021 (?)     1,000       Intel GPUs     SYCL

                    Frontier                2021         1,500       AMD GPUs       HIP

                    Fugaku                  2021         1,000       ARM            Fortran/C++

             – Support generally better for C++ than Fortran
       • Easier use of Kokkos/Cabana if code is in C++

14 Exascale Computing Project
The Cabana Fortran implementation of XGC

                                         KEY:   Fortran    The ”Cabana Fortran” implementation
    Main program
       Setup                                    C++        • Keep Fortran main and kernels
          Setup
                                                           • C++ interface
                                                             – “Light touch:” Localized modification
                                Kokkos interface
                                     Deposition (Kokkos)     – Gradual implementation
            Main Loop
                                Field solve                • Unified, optimized code base
                                                           • Downsides:
                                Kokkos interface
                                     Push (Kokkos)           – Inflexible macros
                                                             – No HIP/SYCL support
                                Collisions (OpenACC)         – Tedious data transfer

15 Exascale Computing Project
Kokkos C++ Implementation of XGC

                                         KEY:    Fortran   Current setup:
    Main program
          Setup
                                                 C++       • Transition to C++ continues
          Setup                                               – Main loop in C++ for easier memory
                                                                management
                                 Deposition (Kokkos)       • Integrated Kokkos/Cabana
            Main Loop
                                Field solve
                                                           • Arrays (field etc.) passed from
                                                             Fortran, copied to Kokkos views

                                 Push (Kokkos)             • No explicit cuda or OpenMP
                                                           • Ready for any architectures with
                                                             Kokkos and OpenACC
                                Collisions (OpenACC)
                                                              – In theory

16 Exascale Computing Project
Converting the collision kernel to Kokkos

                                         KEY:   Fortran   Motivation:
    Main program
          Setup
                                                C++       • Pitfalls of multiple programming
                                                            models (Kokkos and OpenACC)
          Setup
                                                             – Memory management
                                Deposition (Kokkos)          – Compiler compatibility
            Main Loop                                        – More opportunities for something to go
                                Field solve                    wrong

                                                          • Converting to C++ anyway
                                Push (Kokkos)

                                Collisions (Kokkos)

17 Exascale Computing Project
Converting the collision kernel to Kokkos
       • Problem: Collisions computed separately for each mesh node
            – ~5,000 mesh nodes per GPU
            – ~20 Kokkos kernels each

            – Kernels loop over ~1,000 elements -> GPUs underutilized

            – Some calculations still on CPU (harder to port) -> More GPU idle time

18 Exascale Computing Project
Converting the collision kernel to Kokkos
       • Approach: Multiple streams
            – Already done in our OpenACC implementation
                                                                     1 stream
            – Kokkos also supports Cuda streams
            – OpenMP parallel region, each OpenMP thread gets its
              own Cuda stream                                                   2

       • Result:
                                                                                          4
            – GPU usage much higher
                                                                                    1/n               OpenACC
            – 25% speed-up from OpenACC Fortran version                                _t         8
                                                                                         hr
                                                                                           ea
            – Still room for improvement (2-4x)?                                             ds
                                                                                                       12 14
       • Downside: Possible portability challenges
            – Will multiple streams be a viable option for various
              Kokkos back-ends and architectures?

19 Exascale Computing Project
Summary
 • XGC with Kokkos/Cabana is performing well on Summit and Cori KNL
 • All major kernels offloaded to GPU with Kokkos
       – Electron push, collisions; also charge deposition, sorting

 • More compiler flexibility (no longer tied to PGI on Summit)

   Future challenges
 • Moving more XGC kernels to Cabana framework
       – More optimization possible

 • GPU-GPU communication
       – Potentially rely on Cabana for this

 • Ensuring diverging developments can benefit
       – ECP-WDM projects (coupling with GENE, GEM, HPIC for whole-device modeling) and other science goals
         are on different branches

20 Exascale Computing Project
You can also read