Performance Engineering - Basic skills and knowledge - RRZE Moodle

Page created by Cory Mason
Performance Engineering

Basic skills and knowledge
Optimizing code: The big Picture
                                                                                                1 Reduce algorithmic work

                                                                                                2 Minimize processor work
                                                                        Instruction code

                                                                               Distribute work and data for optimal
                                                                           3 utilization of parallel resources
                                                   Memory                                              Memory

                                                            4 Avoid slow data paths

                                  L3         L3         L3         L3                   L3        L3        L3       L3

                                                                                              6 Avoid bottlenecks
                                  L2         L2         L2         L2                   L2        L2        L2      L2

                                            Use most effective
                                  L1    5    L1       L1        L1                      L1       L1         L1      L1
                                            execution units on chip
                                 SIMD       SIMD       SIMD      SIMD                  SIMD     SIMD       SIMD   SIMD
                                 FMA        FMA        FMA        FMA                  FMA       FMA       FMA      FMA

                                 core       core       core       core                 core      core      core   core

Performance Engineering Basics                                                                                       (c) NHR@FAU 2021   2
Focus on time to solution
  Metrics often used in PE publications:
       •   MFlops/s
       •   Cache miss rate
       •   Speedup

 Time to solution is all that matters!

Performance Engineering Basics              (c) NHR@FAU 2021   3
Runtime contributions and critical path
     Every activity adds a runtime contribution
     Simplest case: all runtime contributions accumulate to time to solution
     Due to concurrency runtime contributions can overlap with each other
     Critical path is the series of runtime contributions, that do not overlap and
      form the total runtime

 Anything that takes time is only relevant for optimization if it appears on the
 critical path!

Performance Engineering Basics                                  (c) NHR@FAU 2021      4
Performance Engineering process

                                               Runtime profile

                          Algorithm/Code        Application           HPM performance
                             Analysis          Benchmarking               profile
                                               For every hotspot
                       Performance Model                             Traces/HW metrics
                                  Optional                                                  Iteratively

                                       Identify performance issues
                                       Develop performance expectation

                                  Change runtime                 Optimize
                                   configuration              implementation

Performance Engineering Basics                                                    (c) NHR@FAU 2021        5
Runtime profiling with gprof
 Instrumentation based with gprof
 Compile with –pg switch:
 icc -pg -O3 -c myfile1.c

 Execute the application. During execution a file gmon.out is generated.
 Analyze the results with:
 gprof ./a.out | less

 The output contains three parts: A flat profile, the call graph, and an alphabetical
 index of routines.

 The flat profile is what you are usually interested in.

Performance Engineering Basics                                      (c) NHR@FAU 2021    6
Runtime profile with gprof: Flat profile
          Time spent in            How often was             How much time
           routine itself            it called              was spent per call

                                 Output is sorted according to total time spent in routine.
Performance Engineering Basics                                                                (c) NHR@FAU 2021   7
Sampling-based runtime profile with perf
                       Call executable with perf:   Advantages vs. gprof:
                       perf record –g ./a.out        Works on any binary without
                       Analyze the results with:     Also captures OS and runtime
                       perf report                    symbols

Performance Engineering Basics                                              (c) NHR@FAU 2021   8
Command line version of Intel Amplifier

 Works out of the box for MPI/OpenMP parallel applications.

 Example usage with MPI:
 mpirun -np 2 amplxe-cl -collect hotspots -result-dir myresults -- a.out

  Compile with debugging symbols
  Can also resolve inlined C++ routines
  Many more collect modules available including
   hardware performance monitoring metrics

Performance Engineering Basics                                             (c) NHR@FAU 2021   9
Application benchmarking
 Performance measurements have to be accurate, deterministic and reproducible.

 Components for application benchmarking:

                                 Timing   Documentation     Affinity control


 Always run benchmarks on an EXCLUSIVE SYSTEM!

Performance Engineering Basics                                        (c) NHR@FAU 2021   10
Timing within program code
 For benchmarking, an accurate wallclock timer (end-to-end stop watch) is required:
  clock_gettime(), POSIX compliant timing function
  MPI_Wtime() and omp_get_wtime(), standardized programming-model-
   specific timing routines for MPI and OpenMP

 #include                                           Usage:
 #include                                             double S, E;
                                                              S = getTimeStamp();
 double getTimeStamp()                                        /* measured code region */
 {                                                            E = getTimeStamp();
     struct timespec ts;                                      return E-S;
     clock_gettime(CLOCK_MONOTONIC, &ts);
     return (double)ts.tv_sec + (double)ts.tv_nsec * 1.e-9;


Performance Engineering Basics                                                (c) NHR@FAU 2021   11
System configuration and clock
                                                                                Prefetcher settings
                                                                                Transparent huge pages
                                                                                Memory configuration
                                                                                NUMA balancing

                                                   Memory        Memory
                          Turbo mode
                          Frequency control
                                                    Socket       Socket

                                                                               Uncore clock
                                                                               QPI snoop mode

                                       Tool for system state dump (requires Likwid tools):

Performance Engineering Basics                                                          (c) NHR@FAU 2021   12
Turbo mode and frequency control

               Query turbo mode steps with         Query and set frequencies with
               likwid-powermeter -i                likwid-setFrequencies

                                                   Always measure the actual frequency with
             Above steps are valid for scalar or   a HPM tool. Using likwid-perfctr,
             SSE code; refer to vendor docs for    any performance group reports clock
             AVX and AVX512 steps.                 frequencies.

Performance Engineering Basics                                             (c) NHR@FAU 2021   13
Benchmark planning
  Two main variations:
  Core count                                               Dataset size
              Scale across
                                 Scale across

        Scale within
       memory domain

                                       Scaling baseline:
                                          one core


                                                            Measure with one process (to start with)
                      Choosing the right                    Scan dataset size in fine steps
                       scaling baseline                     Verify the data volumes with a HPM tool
Performance Engineering Basics                                                  (c) NHR@FAU 2021    14
Best practices for Performance profiling
 Focus on resource utilization and instruction decomposition!
 Metrics to measure:
         Operation throughput (Flops/s)                 Data volumes and bandwidths to
         Overall instruction throughput (CPI)            main memory (GB and GB/s)
         Instruction breakdown:                         Data volumes and bandwidth to
                                                          different cache levels (GB and
             FP instructions
             loads and stores
             branch instructions
                                                        Useful diagnostic metrics are:
             other instructions
                                                         Clock frequency (GHz)
         Instruction breakdown to SIMD
                                                         Power (W)
          width (scalar, SSE, AVX, AVX512
          for X86). (only arithmetic instruction
          on most architectures)

 All above metrics can be acquired using performance groups:
 MEM_DP, MEM_SP, BRANCH, DATA, L2,                 L3
Performance Engineering Basics                                                  (c) NHR@FAU 2021   15
The Performance Logbook
  Manual and knowledge collection how to build, configure and run application

  Document activities and results in a structured way

  Learn about best practice guidelines for performance engineering

  Serve as a well defined and simple way to exchange and hand over performance

 The logbook consists of a single markdown document, helper scripts, and directories
 for input, raw results, and media files.

Performance Engineering Basics                                               (c) NHR@FAU 2021   16
HPC Wiki
  Generic HPC-specific documentation Wiki

  Open for participation, login using SSO via eduGAIN

  Covers specifically performance engineering topics and skills

  Still work in progress, but improving!


Performance Engineering Basics                            (c) NHR@FAU 2021   17
You can also read