Correctness Tools: Ocelot Emulator - Andrew Kerr, Si Li, Magda Slawinska, Sudhakar Yalamanchili, Georgia Tech

Page created by Ralph Sutton
 
CONTINUE READING
Correctness Tools: Ocelot Emulator - Andrew Kerr, Si Li, Magda Slawinska, Sudhakar Yalamanchili, Georgia Tech
Correctness Tools: Ocelot Emulator
Andrew Kerr, Si Li, Magda Slawinska, Sudhakar Yalamanchili,
Georgia Tech
Correctness Tools: Ocelot Emulator - Andrew Kerr, Si Li, Magda Slawinska, Sudhakar Yalamanchili, Georgia Tech
2

Overview
• Base Infrastructure: Ocelot Dynamic Execution
  Environment
• The Ocelot Emulator
  – A research tool for detailed instruction level
    analysis
• Writing correctness checkers
Correctness Tools: Ocelot Emulator - Andrew Kerr, Si Li, Magda Slawinska, Sudhakar Yalamanchili, Georgia Tech
Ocelot Dynamic Execution Infrastructure
Correctness Tools: Ocelot Emulator - Andrew Kerr, Si Li, Magda Slawinska, Sudhakar Yalamanchili, Georgia Tech
Ocelot Vision: Multiplatform Dynamic
                                                                 4

Compilation
 esd.lbl.gov

                                  Data Parallel IR

               Language
               Front-End

                                                      R. Domingo &
                                                      D. Kaeli (NEU)

           Just-in-time code
            generation and
         optimization for data
         intensive applications

• Environment for i) compiler research, ii) architecture
  research, and iii) productivity tools

                                                                 4
Correctness Tools: Ocelot Emulator - Andrew Kerr, Si Li, Magda Slawinska, Sudhakar Yalamanchili, Georgia Tech
5

      Ocelot Structure1
                                               PTX Kernel

      CUDA Application

                             nvcc

       Ocelot           depends on NVCC’s PTX compilation flow
                 Analysis and transformations on PTX kernels

         Execute              stock CUDA applications without modification
1G.Diamos, A. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk Synchronous Applications in Heterogeneous
Systems,” PACT, September 2010. .
Correctness Tools: Ocelot Emulator - Andrew Kerr, Si Li, Magda Slawinska, Sudhakar Yalamanchili, Georgia Tech
The PTX Emulator
7

 Execution Model
NVIDIA’s PTX Execution Model

                                               Array of multiprocessors each
                                                executing a CTA (coarse grain)
    Parallel thread Execution (PTX)           SIMD multiprocessors (fine grain)
    Explicit memory hierarchy                 Single instruction multiple thread
    Cooperative thread arrays (CTAs) and       (SIMT) execution
     ordering constraints                      Memory bandwidth optimizations
8

Emulator Trace Generator Interface

Emulator broadcasts events to trace generators during execution
•   Events describe instruction execution, memory addresses, etc.
•   Used for error checking, instrumentation, and simulation
9

Trace Generator Interface
                 //! Base class for generating traces
                 class TraceGenerator {
                 public:
                     TraceGenerator();
                     virtual ~TraceGenerator();

                      // called when a traced kernel is launched to
                      // retrieve some parameters from the kernel
                      virtual void initialize(
                          const executive::ExecutableKernel& kernel);

                      // Called whenever an event takes place.
                      virtual void event(const TraceEvent & event);

                      // called when an event is committed
                      virtual void postEvent(const TraceEvent & event);

                      // Called when a kernel is finished. There will
                      //   be no more events for this kernel.
                      virtual void finish();
                 };
10

TraceEvent Object
• Captures execution of a dynamic instruction,
  including
  – PC and internal representation of PTX instruction
  – Set of active threads executing instruction
  – Kernel grid and CTA dimensions
  – Memory addresses and size of transfer
  – Branch target(s)
11

Using Trace Generators
• Implement TraceGenerator interface
  – override methods:
     • Initialize( ), event( ), postEvent( ), finish( )

• Add to Ocelot runtime
  – explicitly: ocelot::addTraceGenerator( )
  – or, add to trace::TraceConfiguration
• Online analysis or serialize event traces
• Link applications with libocelotTrace
PTX Emulator – CUDA Debugging- Illegal                                           12

Memory Accesses
 // file: memoryCheck.cu
 __global__ void badMemoryReference(int *A) {
   A[threadIdx.x] = 0; // line 3 - faulting store
 }
 int main() {
   int *invalidPtr = 0x0234;     // arbitrary pointer does not refer
                                 //   to an existing memory allocation

     int *validPtr = 0;   cudaMalloc((void **)&validPtr, sizeof(int)*64);

     badMemoryReference>( invalidPtr );
     return 0;
 }

 ==Ocelot== Ocelot PTX Emulator failed to run kernel
 "_Z18badMemoryReferencePi" with exception:
 ==Ocelot== [PC 5] [thread 0] [cta 0] st.global.s32 [%r4 + 0], %r0
     - Global memory access 0x234 is not within any allocated or mapped range.
 ==Ocelot==
 ==Ocelot== Nearby Device Allocations
 ==Ocelot== [0x12fa2e0] - [0x12fa3e0] (256 bytes)
 ==Ocelot==
 ==Ocelot== Near memoryCheck.cu:3:0
PTX Emulator – CUDA Debugging –                                                      13

Race Detection
// file: raceCondition.cu
__global__ void raceCondition(int *A) {
  __shared__ int SharedMem[64];

 SharedMem[threadIdx.x] = A[threadIdx.x];

  // no synchronization barrier!

  A[threadIdx.x] = SharedMem[64 - threadIdx.x];        // line 9 - faulting load
}
. . .
raceCondition>( validPtr );
. . .

==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z13raceConditionPi" with
    exception:
==Ocelot== [PC 15] [thread 0] [cta 0] ld.shared.s32 %r14, [%r13 + 252]
  - Shared memory race condition, address 0xfc was previously written by thread 63
    without a memory barrier in between.

==Ocelot== Near raceCondition.cu:9:0
14

PTX Emulator – Interactive Debugger
• Interactive command-line debugger implemented using
  TraceGenerator interface
$ ./TestCudaSequence                             •   Enables
A_gpu = 0x16dcbe0
(ocelot-dbg) Attaching debugger to kernel             • Inspection of application
'sequence'
(ocelot-dbg)                                            state
(ocelot-dbg) watch global address 0x16dcbe4 s32[3]
set #1: watch global address 0x16dcbe4 s32[3]         • Single-stepping of
  - 12 bytes
(ocelot-dbg)
                                                        instructions
(ocelot-dbg) continue
st.global.s32 [%r11 + 0], %r7
                                                      • Breakpoints and
watchpoint #1 - CTA (0, 0)                              watchpoints
  thread (1, 0, 0) - store to 0x16dcbe4 4 bytes
  old value = -1
  new value = 2
  thread (2, 0, 0) - store to 0x16dcbe8 4 bytes  •   Faults in MemoryChecker
  old value = -1
  new value = 4                                      and RaceDetector invoke
  thread (3, 0, 0) - store to 0x16dcbec 4 bytes
  old value = -1
                                                     ocelot-dbg automatically
  new value = 6
break on watchpoint
(ocelot-dbg)
15

Current Trace Generators
  Trace Generator                           Function
                           Measures control flow uniformity and branch
Branch
                                           divergence
Instruction                   Static and dynamic instruction count
Integrated Debugger                     GDB-like interface
Kernel Dimension                Kernel grid and block dimensions
Machine Attributes          Observe and record machine characteristics
Memory                 Working set size, memory intensity, memory efficiency
                          i) Bounds checks, ii) alignment checks, and iii)
Memory Checker           uninitialized loads (except from texture or global
                                              memory)
Memory Race Detector            Race conditions on shared memory
Parallelism                      MIMD and SIMD parallelism limits
Performance Bound                Compute and memory throughput
Shared Computation              Extent of data flow among threads
Warp Synchronous        Hot-paths/regions for warp synchronous execution
Hands-On Exercises
17

Running Ocelot on Keeneland
• Available as a pre-built binary
   – /lustre/medusa/smagg/tutorial-ocelot.tgz
       • ocelot/
       • keeneland-demo\
            – applications\
            – trace-generators\
       • readme.txt
   – Assumes:
       • 64-bit Linux platform
       • NVIDIA GPUs are available
       • PTX Emulator fallback
18

Configuring & Executing Applications
                                                                             trace: {
•   Edit configure.ocelot                                                        memoryChecker: {
     –   Controls Ocelot’s initial state                                          enabled: true,
     –   Located in application’s startup directory                               checkInitialization: false
     –   trace specifies which trace generators are initially                    },
         attached                                                                raceDetector: {
     –   executive controls device properties                                     enabled: false,

•   trace:                                                                        ignoreIrrelevantWrites: true
                                                                                 },
     –   memoryChecker – ensures
                                                                                 debugger: {
     –   raceDetector - enforces synchronized access to .shared
                                                                                  enabled: false,
     –   debugger - interactive debugger
                                                                                  kernelFilter: "_Z13scalarProdGPUPfS_S_ii",
•   executive:                                                                    alwaysAttach: true
     –   devices:                                                                },
             •   List of Ocelot backend devices that are enabled         },
             •   nvidia - NVIDIA GPU backend                                 executive: {
             •   emulated – Ocelot PTX emulator (trace generators)               devices: [ "emulated" ],
     –   not listed:                                                         }
             •   llvm – efficient execution of PTX on multicore CPU      }
             •   amd – translation to AMD IL for PTX on AMD RADEON GPU
Writing Custom Trace Generators
           Instruction Level Analysis Tools
20

Trace Generator Interface
                 //! Base class for generating traces
                 class TraceGenerator {
                 public:
                     TraceGenerator();
                     virtual ~TraceGenerator();

                      // called when a traced kernel is launched to
                      // retrieve some parameters from the kernel
                      virtual void initialize(
                          const executive::ExecutableKernel& kernel);

                      // Called whenever an event takes place.
                      virtual void event(const TraceEvent & event);

                      // called when an event is committed
                      virtual void postEvent(const TraceEvent & event);

                      // Called when a kernel is finished. There will
                      //   be no more events for this kernel.
                      virtual void finish();
                 };
21

    Sample Trace Generator: SIMD Utilization
   Implement TraceGenerator interface                        //
                                                              // keeneland-demo\trace-generators\ActivityFactor.h
        Override callback methods as needed                  //
        Add counters and other data structures               class ActivityFactorGenerator: public TraceGenerator {
        Example:                                             public:
                                                               ActivityFactorGenerator();
                  compute SIMD utilization for variety of     virtual ~ActivityFactorGenerator();
                   warp sizes
        ActivityFactorGenerator.{h, cpp}                      virtual void initialize(const executive::ExecutableKernel& kernel);
                                                               virtual void event(const TraceEvent & event);
   Compile into static library                                virtual void postEvent(const TraceEvent & event);
            Global constructor adds to applications during    virtual void finish();
             startup
                                                              protected:
            Selectively link with applications to test
                                                               const executive::EmulatedKernel *emulatedKernel;
                                                               size_t invocation;

                                                                size_t dynamicInstructionCount;
                                                                std::vector< size_t > warpInstructionCount;
                                                                std::vector< size_t > warpParticipatingThreadCount;
                                                              };
22

Sample Trace Generator: initialize()
•   Override callbacks
     –        TraceGenerator::initialize() permits querying kernel launch configuration
     –        In this example, initialize counters according to number of participating threads
     //
     // keeneland-demo\trace-generators\ActivityFactor.cpp
     //
     void ActivityFactorGenerator::initialize(const executive::ExecutableKernel& kernel) {
         emulatedKernel = static_cast(&kernel);

         dynamicInstructionCount = 0;
         warpInstructionCount.clear();
         warpParticipatingThreadCount.clear();

         for (size_t ws = 1; ws
23

Sample Trace Generator: event()
•   Override callbacks
     –               TraceGenerator::event() called before instruction is executed
     –               In this example, increment dynamic instruction counters
     void ActivityFactorGenerator::event(const TraceEvent & event) {
         ++dynamicInstructionCount;
         for (size_t lgWs = 0; lgWs < warpInstructionCount.size(); ++lgWs) {
             size_t ws = (1
24

Sample Trace Generator: finish()
•   Override callbacks
     –           TraceGenerator::finish() called when kernel execution complete
     –           In this example, save computed warp sizes to JSON document for post-processing
         void ActivityFactorGenerator::finish() {
             std::stringstream ss;
             ss
25

Sample Trace Generator: Installation
•   Add via Ocelot API at program initialize time
     –         Global constructor invokes ocelot::addTraceGenerator() API call
     –         ActivityFactorGenerator is included whenever CUDA programs are executed by PTX emulator
    //
    // TraceConfiguration.cpp
    //
    #include "./ActivityFactor.h"
    #include 
    namespace trace {
    class ExternalTraceConfiguration {
    protected:
        ActivityFactorGenerator activityFactor;

    public:
        ExternalTraceConfiguration() {
            ocelot::addTraceGenerator(activityFactor, false);
        }
    };
    }
    static trace::ExternalTraceConfiguration singleton; // Global constructor called here
26

Sample Trace Generator: Build & Execute
•   Compile CUDA programs with NVCC
     –       Link with libocelot.so
     –       Link with libOcelotTrace.a to include custom trace generators
               •   Built-in trace generators (MemoryChecker, RaceDetector, Debugger) always included

         #
         # keeneland-demo\Makefile
         #

         libOcelotTraceSources = trace-generators/ActivityFactor.cpp trace-generators/TraceConfiguration.cpp
         libOcelotTrace.a: $(libOcelotTraceSources)
             g++ -c -I$(OcelotIncPath) $(libOcelotTraceSources) -Wall -std=c++0x
             ar rcs libOcelotTrace.a TraceConfiguration.o ActivityFactor.o

         #
         Deps = libOcelotTrace.a libsdk.a
         LIBFLAGS = -L. -L../ocelot/lib -lsdk -locelot -lOcelotTrace

         VectorAdd: $(Deps) applications/vectorAdd/vectorAdd.cu
             nvcc -o VectorAdd $(CXXFLAGS) applications/vectorAdd/vectorAdd.cu -arch=sm_20 $(LIBFLAGS)
Thank You
 http://code.google.com/p/gpuocelot/

Questions?
You can also read