MUCOSIM WS 2020/2021 (SEE ALSO UNIVIS) - RRZE MOODLE

Page created by Sylvia Byrd
 
CONTINUE READING
MUCOSIM WS 2020/2021 (SEE ALSO UNIVIS) - RRZE MOODLE
MuCoSim WS 2020/2021
http://tiny.cc/MuCoSim (see also univis)

Prof. G. Wellein
9.11.2020
MUCOSIM WS 2020/2021 (SEE ALSO UNIVIS) - RRZE MOODLE
MuCoSim Seminar WS 2020/1

Time & place
§ Monday 4pm – 5:30pm
§ Zoom meetings until in person seminar is allowed
§ Updates: moodle

Requirements
§ 1+1 talks
§ 1 written report

What you get
§ 5 ECTS credits
§ Invaluable insights J

 2
MUCOSIM WS 2020/2021 (SEE ALSO UNIVIS) - RRZE MOODLE
Mission

 We
 care
 about
 performance!

 MuCoSim SS 2020 3
MUCOSIM WS 2020/2021 (SEE ALSO UNIVIS) - RRZE MOODLE
Mission

Optimization & Parallelization on all modern compute architectures

àBenchmarking & Performance Measurement

àUnderstand interaction between code & hardware
 This is
 ours!
àPerformance modelling: Roofline model & ECM model

àPerformance tools:

 likwid – lightweight performance tools
 (https://github.com/RRZE-HPC/likwid)
 kerncraft - Loop Kernel Analysis & Performance Modeling Toolkit
 (https://github.com/RRZE-HPC/kerncraft)

 MuCoSim SS 2020 4
MUCOSIM WS 2020/2021 (SEE ALSO UNIVIS) - RRZE MOODLE
What to do in the seminar – two groups of projects
Performance Measurement, Analysis and Optimization

• Get familiar with some code (C/C++/Fortran)
• Carefully measure and report (performance) numbers for
 (various) modern compute device(s)
• Implement (small) code modifications and measure their impact
• Do (simple) performance model if necessary/possible

Performance Tools (likwid, kerncraft, OSACA)

• Analyse and/or extend feature set of tools
• Compare with other tools

 MuCoSim SS 2020 5
MUCOSIM WS 2020/2021 (SEE ALSO UNIVIS) - RRZE MOODLE
What we expect

 • Basic knowledge of C, C++, or Fortran
 • Basic knowledge of Linux shell usage incl. editing
 • Basic knowledge of OpenMP and/or MPI parallelization (some
 projects)
 • Basic knowledge of Python or some other capable scripting
 language
 • Nice, but not strictly required: PTfS lecture (summer term)

 • You need to actively participate in two hands-on sessions
 where you learn
 • how to access and use our machines,
 • how to compile and run a code,
 • how to use our benchmarking and analysis tool likwid

 MuCoSim SS 2020 6
MUCOSIM WS 2020/2021 (SEE ALSO UNIVIS) - RRZE MOODLE
Performance Measurement, Analysis and
Optimization
Coupled oscillators as a model for parallel execution (Georg Hager)
• Synchronization phenomena with coupled oscillators are an intensely
 studied subject
 $
 ! 
 = ! + ) sin( % − ! )

 https://en.wikipedia.org/wiki/Kuramoto_model
 
 !"#

• Parallel, communicating
 processes can be modeled
 as coupled oscillators
 • Compute-communicate
 phases are like oscillation
 • Communication acts as
 coupling
MuCoSim WS 2020/2021 16.11.20 8
Coupled oscillators as a model for parallel execution cont’d (Georg Hager)
 • Synchronization and desynchronization play important roles in
 parallel computing
 • Task: Simulate a (modified)
 Kuramoto model and adjust
 parameters to mimic parallel
 execution of coupled processes

MuCoSim WS 2020/2021 16.11.20 9
Analyze dense matrix-vector multiplication (Thomas Gruber)
 § Dense MVM is a common operation in HPC
 § Often part of HPC courses

 Task:
 § Establish simple performance model(s) for dMVM
 § Perform hardware measurements using LIKWID on different CPUs
 § Compare results with model and make refinements
 § Propose optimizations for naïve algorithm

MuCoSim WS 2020/2021 16.11.20 10
Analyze branch prediction systems of modern architectures (Thomas
 Gruber)

 § Common codes contain a lot of conditions → branches
 § CPUs try to predict outcome to speculatively execute code
 sections
 § CPUs provide measurement facilities for branching

 Task:
 § Analyze how detailed branching can be analyzed
 § How does mispredictions limit code execution (stalls, pipeline
 drains, …)

MuCoSim WS 2020/2021 16.11.20 11
HPCG (Christie L. Alappat)
 HPCG is a supercomputing benchmark
 (https://www.hpcg-benchmark.org) used to rank world’s
 most powerful supercomputers. The benchmark solves
 a linear system of equations using multigrid
 preconditioned conjugate-gradient (CG) algorithm.

• Experiment with SpMV kernel and do a layer condition analysis of HPCG
 matrix in CRS format. (Code given)
• Understand SymGS kernel, ist dependency problem and implement level
 scheduling to parallelise the code. (Code for level scheduling will be given).
• Now try to vectorize the SymGS code with the level scheduling scheme.
• If successful run the entire analysis on world‘s most powerful CPU (A64FX).

MuCoSim WS 2020/2021 16.11.20 12
Study the caching behaviour of Intel YASK code (Christie L. Alappat)

 YASK is a stencil DSL framework developed by Intel.
 YaskSite is an in-house library build on top of YASK
 to support performance modeling of YASK generated
 Pic source : https://software.intel.com/en-
 stencils. The topic concerns on analyzing YaskSite’s us/articles/eight-optimizations-for-3-
 dimensional-finite-difference-3dfd-code-

 performance model using pycachesim cache with-an-isotropic-iso

 simulator.

• Understand pycachesim and YASK interface and couple them.
• Test the deviations with the analytical predictions for different star shaped
 stencils, especially long range ones.
• If deviating tell which component is missing in analytical model.
• Study impact of spatial and temporal blocking.

MuCoSim WS 2020/2021 16.11.20 13
Porting a MD force kernel to CUDA (Jan Eitzinger)
§ Target code MD-Bench (in-house Mini-App)
§ Sequential C re-implementation of Mantevo Mini-MD
§ Less than 1000 loc

Task
§ Port the force calculation kernel to GPU using CUDA
§ Optional: Profile and Analyse Performance
§ Optional: Optimize the Performance

 https://github.com/RRZE-HPC/MD-Bench

MuCoSim WS 2020/2021 16.11.20 14
Performance Tools
Analyze MinApps and Kernels with Intel Advisor (Georg Hager)
Intel Advisor provides insights into hardware utilization of
applications and advice for code optimization – including a roofline
analysis (wow!)

1. Get familiar with the tool
2. Analyze several (existing) kernels
 and applications
3. Compare Intel results with existing
 performance models and knowledge
 about bottlenecks.

MuCoSim WS 2020/2021 16.11.20 16
Adding and testing PAPI to likwid-bench (Thomas Gruber)

 § PAPI provides an abstraction layer for various measurement
 facilities (e.g. hardware performance counter)
 § likwid-bench is a micro-benchmarking suite with assembly
 kernels

 Task:
 § Add PAPI calls to likwid-bench for common measurement groups
 (L2, L3, FLOPS_DP, FLOPS_SP, …)
 § Compare measurements of PAPI with LIKWID measurements

MuCoSim WS 2020/2021 16.11.20 17
OSACA for Rasberry Pi 4 (Julian Hammer)

 § Create OSACA in-core execution model and validate for
 ARM Cortex-A72 architecture

 Software and techniques involved:
 § assembly, OSACA, asmbench, ibench, Python

MuCoSim WS 2020/2021 16.11.20 18
Validation Fuzzer for OSACA \w asmbench (Julian Hammer)

 § Create random benchmarks with fuzzing techniques using
 asmbench tool

 § Compare results with IACA, OSACA and LLVM-MCA

 Software and techniques involved:
 § Python, fuzzing, llvm, llvm-ir, assembly, git

MuCoSim WS 2020/2021 16.11.20 19
Kernel Explorer (Julian Hammer)

 § Build a website where Kerncraft can be used in the browser

 § Think compiler explorer (godbolt.org), but with Kerncraft

 Software and techniques involved:
 § Python, $webframework (e.g., django, flask), JS, HTML, CSS,
 Docker
MuCoSim WS 2020/2021 16.11.20 20
Performance Analysis with Paraver (Ayesha Afzal)
Paraver – offline trace analysis tool (timelines, 2/3D tables -statistics)
Dimemas – message passing simulator
Extrae – instrumentation

Tasks
• First talk: Getting familiar with Paraver and tool exploration with simpler test cases
 • Downloads: sources / binaries, Linux / windows / MAC
 • Documentation: training guides, tutorial slides

• Second talk: Analysis of composite distributed applications with tool provided features
 • Analyzing variability: time, IPC, Instructions, cache misses ratio, …
 • Trace manipulation: filtering, cutting, …
 • Play around with latency and bandwidth parameters: network sensitivity, ideal machine, …
 • Through clustering: identify structure, track scability, …
 • ….
Required skills
• Basic knowledge of C/C++ and code parallelization with MPI

Provided material
• MPI parallelized benchmarks and algorithms
 (e.g., spMvM (irregular matrices), Jacobi (regular), ray tracer (load imbalances), etc.,) Parallel efficiency = LB eff * Comm eff
 Parallel efficiency refinement: LB * μLB * Tr
 MuCoSim WS 2020/2021 16.11.20 21
Open Talks from last semester
 Student (Tutor) Topic (state)

 Michael Holzmann (Hager) Stencils on Tsubasa (2nd talk pending)

 Maniranam (Dominik Ernst) Modern Languages (1. Talk)

 Maniranam (Dominik Ernst) Modern Languages (2. Talk)

 Matthias König (TG) Dense matrix transpose (2. Talk)

 Ravi Chandra (TG) Threading models (1. Talk, 30.11.)

 Ravi Chandra (TG) Threading models (2. Talk)
MuCoSim WS 2020/2021 16.11.20 22
Time schedule

 Date Topic

 23.11.2020 Kerncraft on RasPi4
 (T. Auerochs, 2. talk)
 27.11.2020 Mandatory Intro Hands-On (Part 1)

 30.11.2020 Threading models in modern
 programming languages
 (R. Chandra, 1. talk)

 07.12.2020 Mandatory Intro Hands-On (Part 2)

MuCoSim WS 2020/2021 16.11.20 23
You can also read