PERFORMANCE AND ENERGY ANALYSIS WITH THE SNIPER MULTI- CORE SIMULATOR

Page created by Jon Price

Arts & Entertainment

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

PERFORMANCE AND ENERGY ANALYSIS WITH THE SNIPER MULTI- CORE SIMULATOR

PERFORMANCE	
  AND	
  ENERGY	
  ANALYSIS	
  
WITH	
  THE	
  SNIPER	
  MULTI-‐CORE	
  SIMULATOR	
  

          TREVOR	
  E.	
  CARLSON,	
  WIM	
  HEIRMAN,	
  IBRAHIM	
  HUR	
  
                     KENZO	
  VAN	
  CRAEYNEST,	
  MATHIJS	
  ROGIERS	
  
                                              AND	
  LIEVEN	
  EECKHOUT	
  
                                                               HTTP://WWW.SNIPERSIM.ORG	
  
                                                      WEDNESDAY,	
  SEPTEMBER	
  4TH,	
  2013	
  
                                 7TH	
  PARALLEL	
  TOOLS	
  WORKSHOP,	
  DRESDEN,	
  GERMANY

INTEL	
  EXASCIENCE	
  LAB	
  
• SoEware	
  and	
  hardware	
  for	
  ExaFLOPS-‐scale	
  machines	
  
• CollaboraWon	
  between	
  Intel,	
  imec	
  
  and	
  5	
  Flemish	
  universiWes	
  
• Study	
  Space	
  Weather	
  as	
  an	
  HPC	
  workload	
  

        Space Weather
                                        Visualization
        Modeling

                   Simulation Toolkit

                Architectural Simulation                Hardware design

                                                                          2

GAINING	
  APPLICATION	
  INSIGHT	
  –	
  OVERVIEW	
  
	
  
• Sniper	
  Overview	
  

• Interval	
  SimulaWon	
  and	
  CPI	
  Stacks	
  

• Accuracy	
  and	
  ValidaWon	
  
	
  
• HW/SW	
  Co-‐opWmizaWon	
  

• Sniper	
  Internals	
  
     –   Running	
  Sniper	
  
     –   ValidaWon	
  Details	
  
     –   Sampled	
  SimulaWon	
  Details	
  
     –   Interval	
  SimulaWon	
  Details	
  

                                                         3

THE	
  SNIPER	
  MULTI-‐CORE	
  SIMULATOR	
  
                                    OVERVIEW	
  

TREVOR	
  E.	
  CARLSON,	
  WIM	
  HEIRMAN,	
  IBRAHIM	
  HUR,	
  
    KENZO	
  VAN	
  CRAEYNEST	
  AND	
  LIEVEN	
  EECKHOUT	
  
                                                                                            	
  
                                                           HTTP://WWW.SNIPERSIM.ORG	
  
                                                   WEDNESDAY,	
  SEPTEMBER	
  4TH,	
  2013	
  
                              7TH	
  PARALLEL	
  TOOLS	
  WORKSHOP,	
  DRESDEN,	
  GERMANY

TRENDS	
  IN	
  PROCESSOR	
  DESIGN:	
  CORES	
  

Number	
  of	
  cores	
  per	
  node	
  is	
  increasing	
  
    – 2001:	
  Dual-‐core	
  POWER4	
  
    – 2005:	
  Dual-‐core	
  AMD	
  Opteron	
  
    – 2011:	
  10-‐core	
  Intel	
  Xeon	
  Westmere-‐EX	
  
    – 2012:	
  Intel	
  MIC	
  Knights	
  Corner	
  (60+	
  cores)	
  

               Westmere-‐EX,	
  Source:	
  Intel	
     Xeon	
  Phi	
  (MIC),	
  Source:	
  Intel	
     5

DEMANDS	
  ON	
  SIMULATION	
  ARE	
  INCREASING	
  
                                                                                                       LLC	
  Cache	
  Sizes	
  

• Increasing	
  cache	
  sizes	
                                                         60	
  
                                                                                         50	
  

    – SimulaWon	
  requires	
  realisWc	
                                                40	
  

                                                                            Mbytes	
  
                                                                                         30	
  
      applicaWon	
  working	
  sets	
                                                    20	
  
                                                                                         10	
  
    – Scaled-‐down	
  applicaWons	
  might	
  not	
                                      0	
  

      exhibit	
  the	
  same	
  behavior	
  
                                                                                          Jan-‐93	
   Jul-‐98	
   Jan-‐04	
   Jul-‐09	
   Dec-‐14	
  

                                                                                                                IPF	
     x86	
  

• Increasing	
  core	
  counts	
  

• MulW-‐threaded	
  workloads	
                  Xeon	
  Phi,	
  Source:	
  Intel	
  

• New	
  soluWons	
  are	
  needed	
  

                                                                                                                                               6

NODE-‐COMPLEXITY	
  IS	
  INCREASING	
  
	
  
• Signiﬁcant	
  HPC	
  node	
  architecture	
  changes	
  
   – Increases	
  in	
  core	
  counts	
  
       • More,	
  lower-‐power	
  cores	
  (for	
  energy	
  eﬃciency)	
  
   – Increases	
  in	
  thread	
  (SMT)	
  counts	
  
   – Cache-‐coherent	
  NUMA	
  

• OpWmizing	
  for	
  eﬃciency	
  
                                                                          Source:	
  Wikimedia	
  Commons	
  
   – How	
  do	
  we	
  analyze	
  our	
  current	
  soEware?	
  
   – How	
  do	
  we	
  design	
  our	
  next-‐generaWon	
  soEware?	
  

                                                                                                    7

OPTIMIZING	
  TOMORROW’S	
  SOFTWARE	
  

• Design	
  tomorrow’s	
  processor	
  	
  
  using	
  today’s	
  hardware	
  
• OpWmize	
  tomorrow’s	
  soEware	
  for	
  tomorrow’s	
  
  processors	
  
• SimulaWon	
  is	
  one	
  promising	
  soluWon	
  
   – Obtain	
  performance	
  characterisWcs	
  	
  
     for	
  new	
  architectures	
  
   – Architectural	
  exploraWon	
  
   – Early	
  soEware	
  opWmizaWon	
  
                                                          8

UPCOMING	
  CHALLENGES	
  
• Direct	
  hardware	
  execuWon	
  does	
  not	
  provide	
  insight	
  
    – Reasons	
  for	
  a	
  loss	
  in	
  performance	
  is	
  now	
  always	
  easy	
  to	
  determine	
  
    – Does	
  not	
  allow	
  for	
  performance	
  predicWon	
  
• Future	
  systems	
  will	
  be	
  diverse	
  
    – Varying	
  processor	
  speeds	
  
    – Varying	
  failure	
  rates	
  for	
  diﬀerent	
  components	
  
    – Homogeneous	
  applicaWons	
  show	
  heterogeneous	
  performance	
  
• SoEware	
  and	
  hardware	
  soluWons	
  are	
  needed	
  to	
  
  solve	
  these	
  challenges	
  
    – Handle	
  heterogeneity	
  (reacWve	
  load	
  balancing)	
  
    – Handle	
  fault	
  tolerance	
  
    – Improve	
  power	
  eﬃciency	
  at	
  the	
  algorithmic	
  level	
  
      (extreme	
  data	
  locality)	
  
• Hard	
  to	
  model	
  accurately	
  with	
  analyWcal	
  models	
  
                                                                                                               9

FAST	
  AND	
  ACCURATE	
  SIMULATION	
  IS	
  NEEDED	
  
• SimulaWon	
  use	
  cases	
  
   – Pre-‐silicon	
  soEware	
  opWmizaWon	
  
   – Architecture	
  exploraWon	
  
• Cycle-‐accurate	
  simulaWon	
  is	
  too	
  slow	
  for	
  
  exploring	
  mulW/many-‐core	
  design	
  space	
  and	
  
  soEware	
  
• Key	
  quesWons	
  
   – Can	
  we	
  raise	
  the	
  level	
  of	
  abstracWon?	
  
   – What	
  is	
  the	
  right	
  level	
  of	
  abstracWon?	
  
   – When	
  to	
  use	
  these	
  abstracWon	
  models?	
  
                                                                    10

FAST	
  OR	
  ACCURATE	
  SIMULATION?	
  
   Cycle-‐accurate	
  simulator	
                    Higher-‐abstracWon	
  level	
  
                                                               simulator	
  
  performance	
  

                                                      performance	
  
                                  ?	
   ?	
   ?	
  

                    A	
   B	
   C	
   D	
   E	
                         A	
   B	
   C	
   D	
   E	
  
                      architecture	
                                      architecture	
  

                                                                                                        11

THE	
  ARCHITECTURE	
  DESIGN	
  WATERFALL	
  

                                    AnalyWcal	
  models	
  
                                                              High-‐level	
  simulaWon	
  
                                                                                                 Cycle-‐accurate	
  
                         1010	
                                                                    simulaWon	
  
                                               105	
  
#	
  architectures	
  

                                                               1000	
  
     considered	
  

                                                                                              10	
  
                                                                                                                     1	
  

                                                                                                  Traces	
  /	
  
                                                                RepresentaWve	
               Microbenchmarks	
  
/applicaWons	
  
benchmarks	
  

                                                                 applicaWons	
  
                            Program	
  characterisWcs	
  
                                                                                      Pre-‐silicon	
  soEware	
  
                                                                                     opWmizaWon,	
  co-‐design	
  

                                                                                          design	
  process	
  (Wme)	
  
                                                                                                                     12

SNIPER:	
  A	
  FAST	
  AND	
  ACCURATE	
  SIMULATOR	
  
• Hybrid	
  simulaWon	
  approach	
  
    – AnalyWcal	
  interval	
  core	
  model	
  
    – Micro-‐architecture	
  structure	
  simulaWon	
  
        • branch	
  predictors,	
  caches	
  (incl.	
  coherency),	
  NoC,	
  etc.	
  
• Hardware-‐validated,	
  Pin-‐based	
  
• Models	
  mulW/many-‐cores	
  running	
  mulW-‐
  threaded	
  and	
  mulW-‐program	
  workloads	
  
• Parallel	
  simulator	
  scales	
  with	
  the	
  number	
  of	
  
  simulated	
  cores	
  
• Available	
  at	
  http://snipersim.org	
  
                                                                                         13

TOP	
  SNIPER	
  FEATURES	
  
•   Interval	
  SimulaWon	
  Core	
  Model	
  
•   MulW-‐threaded	
  ApplicaWon	
  Sampling	
  
•   CPI	
  Stacks	
  and	
  InteracWve	
  VisualizaWon	
  
•   Parallel	
  MulWthreaded	
  Simulator	
  
•   x86-‐64	
  and	
  SSE2	
  support	
  
•   Validated	
  against	
  Core2,	
  Nehalem	
  
•   Thread	
  scheduling	
  and	
  migraWon	
  
•   Full	
  DVFS	
  support	
  
•   Shared	
  and	
  private	
  caches	
  
•   Modern	
  branch	
  predictor	
  
•   Supports	
  pthreads	
  and	
  OpenMP,	
  TBB,	
  OpenCL,	
  MPI,	
  …	
  
•   SimAPI	
  and	
  Python	
  interfaces	
  to	
  the	
  simulator	
  
•   Many	
  ﬂavors	
  of	
  Linux	
  supported	
  (Redhat,	
  Ubuntu,	
  etc.)	
  
                                                                                     14

SNIPER	
  LIMITATIONS	
  
• User-‐level	
  
    – Perfect	
  for	
  HPC	
  
    – Not	
  the	
  best	
  match	
  for	
  workloads	
  with	
  signiﬁcant	
  OS	
  
      involvement	
  
• FuncWonal-‐directed	
  
    – No	
  simulaWon	
  /	
  cache	
  accesses	
  along	
  false	
  paths	
  
• High-‐abstracWon	
  core	
  model	
  
    – Not	
  suited	
  to	
  model	
  all	
  eﬀects	
  of	
  core-‐level	
  changes	
  
    – Perfect	
  for	
  memory	
  subsystem	
  or	
  NoC	
  work	
  
• x86	
  only	
  
                                                                                           15

SNIPER	
  HISTORY	
  
•   November,	
  2011:	
  SC’11	
  paper,	
  ﬁrst	
  public	
  release	
  
•   May	
  2012,	
  version	
  3.0:	
  Heterogeneous	
  architectures	
  
•   November	
  2012,	
  version	
  4.0:	
  Thread	
  scheduling	
  and	
  migraWon	
  
•   December	
  2012,	
  version	
  4.1:	
  VisualizaWon	
  (2D	
  and	
  3D)	
  
•   April	
  2013,	
  version	
  5.0:	
  MulW-‐threaded	
  applicaWon	
  sampling	
  
•   June	
  2013,	
  version	
  5.1:	
  Advanced	
  visualizaWon	
  
•   Today:	
  400+	
  downloads	
  from	
  45	
  countries	
  

                                                                                          16

THE	
  SNIPER	
  MULTI-‐CORE	
  SIMULATOR	
  
  INTERVAL	
  SIMULATION	
  &	
  CPI	
  STACKS	
  

TREVOR	
  E.	
  CARLSON,	
  WIM	
  HEIRMAN,	
  IBRAHIM	
  HUR,	
  
    KENZO	
  VAN	
  CRAEYNEST	
  AND	
  LIEVEN	
  EECKHOUT	
  
                                                                                            	
  
                                                           HTTP://WWW.SNIPERSIM.ORG	
  
                                                   WEDNESDAY,	
  SEPTEMBER	
  4TH,	
  2013	
  
                              7TH	
  PARALLEL	
  TOOLS	
  WORKSHOP,	
  DRESDEN,	
  GERMANY

NEEDED	
  DETAIL	
  DEPENDS	
  ON	
  FOCUS	
  

                             Single-‐event	
                Required	
  
Component	
  
                              Jme	
  scale	
                 sim	
  Jme	
  

 RTL	
                       single	
  clock	
  cycle	
   millions	
  of	
  cycles	
           Too	
  slow	
  

 OOO	
  execuWon	
  

 Core	
  memory	
  ops	
  

 L1	
  cache	
  access	
  

 LLC	
  access	
  

 Oﬀ-‐socket	
                 microseconds	
                    seconds	
  
                                                                                         Not	
  accurate	
  
                                                                                         enough	
  
                                                                                                                 18

INTERVAL	
  MODEL	
  
                                  Out-‐of-‐order	
  core	
  performance	
  model	
  with	
  	
  
                                              in-‐order	
  simulaWon	
  speed	
  
                                                                      branch	
  mispredicWon	
  
                                              I-‐cache	
  miss	
                                                   long-‐latency	
  load	
  miss	
  
eﬀecWve	
  dispatch	
  rate	
  

                                       interval	
  1	
                 interval	
  2	
                            interval	
  3	
                  Wme	
  

                                                                                                                D.	
  Genbrugge	
  et	
  al.,	
  HPCA’10	
  
                                                                                                   S.	
  Eyerman	
  et	
  al.,	
  ACM	
  TOCS,	
  May	
  2009	
  
                                                                                           T.	
  Karkhanis	
  and	
  J.	
  E.	
  Smith,	
  ISCA’04,	
  ISCA’07	
  19

KEY	
  BENEFITS	
  OF	
  THE	
  INTERVAL	
  MODEL	
  

•      Models	
  superscalar	
  OOO	
  execuWon	
  
•      Models	
  impact	
  of	
  ILP	
  
•      Models	
  second-‐order	
  eﬀects:	
  MLP	
  
	
  
•      Allows	
  for	
  construcWng	
  CPI	
  stacks	
  

                                                           20

CYCLE	
  STACKS	
                                         CPI	
  

• Where	
  did	
  my	
  cycles	
  go?	
  
• CPI	
  stack	
  
    – Cycles	
  per	
  instrucWon	
  
    – Broken	
  up	
  in	
  components	
  
• Normalize	
  by	
  either	
  
    – Number	
  of	
  instrucWons	
  (CPI	
  stack)	
  
    – ExecuWon	
  Wme	
  (Wme	
  stack)	
  
• Diﬀerent	
  from	
  miss	
  rates:	
  	
                          L2	
  cache	
  
                                                                    I-‐cache	
  
  cycle	
  stacks	
  directly	
  quanWfy	
  	
                      Branch	
  
  the	
  eﬀect	
  on	
  performance	
                               Base	
  

                                                                                 21

CYCLE	
  STACKS	
  FOR	
  PARALLEL	
  APPLICATIONS	
  
By	
  thread:	
  heterogeneous	
  behavior	
  	
  
  in	
  a	
  homogeneous	
  applicaWon?	
  

                                      L1	
     L1	
            L1	
      L1	
                L1	
     L1	
              L1	
         L1	
  

                                      L2	
     L2	
            L2	
      L2	
                L2	
     L2	
              L2	
         L2	
  

                                                        L3	
  data	
                                           L3	
  

                                                                                  DRAM	
  
                                                                                                                                 22

USING	
  CYCLE	
  STACKS	
  TO	
  EXPLAIN	
  SCALING	
  
BEHAVIOR	
  

                                                           23

USING	
  CYCLE	
  STACKS	
  TO	
  EXPLAIN	
  SCALING	
  
BEHAVIOR	
  
• Scale	
  input:	
  applicaWon	
  becomes	
  DRAM	
  bound	
  

                                                                  24

USING	
  CYCLE	
  STACKS	
  TO	
  EXPLAIN	
  SCALING	
  
BEHAVIOR	
  
• Scale	
  input:	
  applicaWon	
  becomes	
  DRAM	
  bound	
  
• Scale	
  core	
  count:	
  sync	
  losses	
  increase	
  to	
  20%	
  

                                                                           25

THE	
  SNIPER	
  MULTI-‐CORE	
  SIMULATOR	
  
                        SIMULATOR	
  ACCURACY	
  
                AND	
  HARDWARE	
  VALIDATION	
  

TREVOR	
  E.	
  CARLSON,	
  WIM	
  HEIRMAN,	
  IBRAHIM	
  HUR,	
  	
  
    KENZO	
  VAN	
  CRAEYNEST	
  AND	
  LIEVEN	
  EECKHOUT	
  
                                                                                              	
  
                                                         HTTP://WWW.SNIPERSIM.ORG	
  
                                                    WEDNESDAY,	
  SEPTEMBER	
  4TH,	
  2013	
  
                                7TH	
  PARALLEL	
  TOOLS	
  WORKSHOP,	
  DRESDEN,	
  GERMANY

HARDWARE	
  VALIDATION	
  
• Why	
  validaWon?	
  
   – Debugging	
  
   – Verifying	
  modeling	
  assumpWons	
  
   – Balance	
  between	
  accuracy	
  and	
  generality	
  
       • e.g.:	
  loop	
  buﬀer	
  in	
  Nehalem/Westmere;	
  	
  
         	
  	
  	
  	
  	
  	
  	
  	
  	
  uop-‐cache	
  in	
  Sandy	
  Bridge	
  
• Current	
  status:	
  
   – Validated	
  against	
  Core2	
  (internal,	
  results	
  @	
  SC’11)	
  
   – Nehalem	
  ongoing	
  (public	
  version)	
  

                                                                                        27

EXPERIMENTAL	
  SETUP:	
  ARCHITECTURE	
  

           L1                      L1                      L1                      L1
 L1I	
                   L1I	
  
                           L1                    L1I	
  
                                                   L1                    L1I	
  
                                                                           L1                      L1
           D	
   L1I	
             D	
   L1I	
  
                                           L1              D	
   L1I	
  
                                                                   L1              D	
   L1I	
  
                                                                                           L1                     L1
                           D	
   L1I	
             D	
   L1I	
  
                                                           L1              D	
   L1I	
  
                                                                                   L1              D	
   L1I	
  
                                                                                                           L1                     L1
                                           D	
   L1I	
             D	
   L1I	
             D	
   L1I	
            D	
   L1I	
  
                                                           D	
                     D	
                     D	
                    D	
  
                  L2	
                                            L2	
  
                                  L2	
                                            L2	
  
                                                  L2	
                                            L2	
  
                                                                  L2	
                                           L2	
  
                                          L3	
  
                                                          L3	
  
                                                                          L3	
  
                                                                                          L3	
  

                                                               DRAM	
  
                                                                                                                                          28

INTERVAL	
  PROVIDES	
  NEEDED	
  ACCURACY	
  

The	
  interval	
  core	
  model	
  
provides	
  consistent	
  accuracy	
  
of	
  25%	
  avg.	
  abs.	
  error,	
  
with	
  a	
  minimal	
  slowdown	
  
                                                   29

APPLICATION	
  OPTIMIZATION	
  
• Splash2-‐Raytrace	
  shows	
  very	
  bad	
  scaling	
  behavior	
  
• CPI	
  stack	
  shows	
  why:	
  heavy	
  lock	
  contenWon	
  
• Conversion	
  to	
  use	
  locked	
  increment	
  instrucWon	
  helps	
  

                                                                              30

SIMULATOR	
  PERFORMANCE	
  

            Sniper	
  currently	
  scales	
  to	
  2	
  MIPS	
  

           Typical	
  simulators	
  run	
  at	
  
           10s-‐100s	
  KIPS,	
  without	
  scaling	
  

                                                                   31

MANY-‐CORE	
  SIMULATIONS	
  
High	
  simulaWon	
  speed	
  up	
  to	
  1024	
  simulated	
  cores	
  
     – Eﬃcient	
  simulaWon:	
  L1-‐based	
  benchmarks	
  execute	
  faster	
  
     – Host	
  system:	
  dual-‐socket	
  Xeon	
  X5660	
  (6-‐core	
  Westmere),	
  96	
  GB	
  RAM	
  

                                                                                                        32

POWER-‐AWARE	
  HW/SW	
  
                       CO-‐OPTIMIZATION	
  
          TREVOR	
  E.	
  CARLSON,	
  WIM	
  HEIRMAN,	
  
KENZO	
  VAN	
  CRAEYNEST	
  AND	
  LIEVEN	
  EECKHOUT	
  
                                                                                     	
  
                                                HTTP://WWW.SNIPERSIM.ORG	
  
                                           WEDNESDAY,	
  SEPTEMBER	
  4TH,	
  2013	
  
                       7TH	
  PARALLEL	
  TOOLS	
  WORKSHOP,	
  DRESDEN,	
  GERMANY

POWER-‐AWARE	
  HW/SW	
  CO-‐OPTIMIZATION	
  
  • Hooked	
  up	
  McPAT	
  (MulW-‐Core	
  Power,	
  Area,	
  Timing	
  framework)	
  to	
  
    Sniper’s	
  output	
  staWsWcs	
  
  • Evaluate	
  diﬀerent	
  architecture	
  direcWons	
  (45nm	
  to	
  22nm)	
  with	
  
    near-‐constant	
  area	
  
  • Compare	
  performance,	
  energy	
  eﬃciency	
   [Heirman	
  et	
  al.,	
  PACT	
  2012]	
  
 core	
  
              cache	
  

                                          8	
  cores	
  
                                                               16	
  cores,	
  no	
  L3,	
  stacked	
  DRAM	
  

baseline:	
  2x	
  quad-‐core	
  

                                     16	
  slow	
  cores	
                 16	
  thin	
  cores	
            34

POWER-‐AWARE	
  HW/SW	
  CO-‐OPTIMIZATION	
  

                     Baseline	
                 8-‐core	
         3D	
               Low-‐frequency	
       Dual-‐issue	
  

#Cores	
             2x	
  4	
                  8	
                16	
               16	
                    16	
  

Frequency	
          2.66	
  GHz	
              3.059	
  GHz	
     3.059	
  GHz	
     1.8	
  GHz	
            3.059	
  GHz	
  

Voltage	
            1.2	
  V	
                 1.2	
  V	
         1.2	
  V	
         1.025	
  V	
            1.2	
  V	
  

Issue	
  width	
     4	
                        4	
                4	
                4	
                     2	
  

ROB	
  size	
        128	
                      128	
              128	
              128	
                   32	
  

L2/core	
            256	
  KB	
                512	
  KB	
        256	
  KB	
        256	
  KB	
             256	
  KB	
  

L3	
                 2x	
  8	
  MB	
            32	
  MB	
         -‐	
              2	
  x	
  8	
  MB	
     2	
  x	
  8	
  MB	
  

Area	
               2	
  x	
  243	
  mm2	
     151	
  mm2	
       181	
  mm2	
       208	
  mm2	
            187	
  mm2	
  

Max.	
  power	
      2	
  x	
  99	
  W	
        80	
  W	
          130	
  W	
         58	
  W	
               102	
  W	
  
                                                                                                                                      35

ore for all of the selected benchmarks. How-
g the large input, 3D is considerably more                                                          3

           POWER-‐AWARE	
  HW/SW	
  CO-‐OPTIMIZATION	
  
 or three out of five of the applications con-
hows that architecture studies should take
ing reduced input sizes.                                                                        s
                                                                                                    2

                                                                                                    1

 ARE/SOFTWARE CO-DESIGN
           • Heat	
  transfer:	
  stencil	
  on	
  regular	
  grid	
  
ne step further and we use Sniper/McPAT
                                                                                                    0
                                                                                                                                           B
                                                                                                    0                             0
 tudy in which    Used	
  
               – we        in	
  the	
  
                        optimize    bothExaScience	
  
                                           hardware Lab	
  as	
  component	
  of	
  Space	
  WBeather	
  modeling	
  
 e do this for an important scientific kernel,
omputation. –Our  Important	
       kernel,	
  part	
  
                    kernel implementation         al- of	
  Berkeley	
  
                                                               Figure 7:    Dwarfs	
       (structured	
  
                                                                                Illustration                     grid)	
  
                                                                                                          of three     iterations of the heat
                                                               transfer simulation, applied to an 8 8 tile. To sat
         •   Improve	
      m    emory	
  
 off data locality with redundant computa-
 les finding an optimum software configura-
                                            l ocality:	
   W ling	
  
                                                               isfy othever	
  data
                                                                                mulWple	
  
                                                                                        dependencies  Wme	
  steps	
  
                                                                                                                  up to the third step o
               – Trade	
  
hardware setting,   and viceoﬀ	
  versa.
                                   locality	
  with	
  redundant	
       computaWon	
  
                                                               the stencil,         redundant computations (on the darker
                                                               dots) are performed at the boundaries of the tile
ansfer application
               – OpWmum	
  depends	
  on	
  relaWve	
          Fromcost	
   (performance	
  &	
  energy)	
  
                                                                         [11].
                  of	
  computaWon,	
  
mputation benchmark                           data	
  transfer	
  à	
  requires	
  integrated	
  simulator	
  
                            models heat transfer
 2D grid over a number of time steps. The
 ation involves stencil computation in which                          32Performance (GFLOP/s)
                                                                                                                   peak floating-point performance
a given point in time at each grid location
 nation
   3     of the temperatures of that location                         16                                 th              re
                                                                                                    d wid                   du
s at the previous time step.                                                                     a n                           nd
                                                                                              y b                                 an
                                                                                          o r                                        tc
mentation of the heat transfer equation com-                           8            m e m                                               om
   2                                                                                                                                       pu
ime step at a time, and iterates over the                                        ak                                                           tat
                                                                             pe                                                                   ion
  apply the stencil operation at each grid lo-
                                                                       4
ment).
   1     This implementation has very poor                              1/2              1                  2           4                  8                16
 se each data element is used only once per                                                      Arithmetic intensity (FLOP/byte)
the time a data element is used again —B                                                                       Total performance
   0
e step — the  0  processor     will   have   touched
                                              0
                                                                                                                             2
                                                                                                  Useful performance (256 tiles)
ements, and hence, when     B simulating a large                                                  Useful performance (1282 tiles)                    36

POWER-‐AWARE	
  HW/SW	
  CO-‐OPTIMIZATION	
  
                                   • Match	
  Wle	
  size	
  to	
  L2	
  size,	
  ﬁnd	
  opWmum	
  between	
  locality	
  
                                     and	
  redundant	
  work	
  –	
  depending	
  on	
  their	
  (performance/
                                     energy)	
  cost	
  
                                   • Isolated	
  opWmizaWon:	
  
                                                         (a) Performance (simulated time steps per second)

                     300                    – Fix	
  HW	
  architecture,	
  explore	
  SW	
  parameters	
  
                                                8-core
                                                                                                300
                                                                                                                             3D
                                                                                                                                                                           300
                                                                                                                                                                                                  low-frequency
                                                                                                                                                                                                                                                      300
                                                                                                                                                                                                                                                                               dual-issue

                     250                                                                        250                                                                        250                                                                        250
Steps/time (1/s)

                                                                           Steps/time (1/s)

                                                                                                                                                      Steps/time (1/s)

                                                                                                                                                                                                                                 Steps/time (1/s)
                     200
                     150
                                            – Fix	
  SW	
  parameters,	
  explore	
  HW	
  architecture	
  
                                                                                                200
                                                                                                150
                                                                                                                                                                           200
                                                                                                                                                                           150
                                                                                                                                                                                                                                                      200
                                                                                                                                                                                                                                                      150

                                   • Co-‐opWmizaWon	
  yields	
  1.66x	
  more	
  performance,	
  or	
  1.25x	
  
                     100                                                                        100                                                                        100                                                                        100
                      50                                                                         50                                                                         50                                                                         50
                           0                                                                          0                                                                          0                                                                          0

                                     more	
  energy	
  eﬃciency,	
  than	
  isolated	
  opWmizaWon	
  
                               0         1         2        3          4                                  0         1         2        3          4                                  0         1         2        3          4                                  0         1         2        3          4
                                   Arithmetic intensity (FLOP/byte)                                           Arithmetic intensity (FLOP/byte)                                           Arithmetic intensity (FLOP/byte)                                           Arithmetic intensity (FLOP/byte)
                       32              64      128       256     512                              32              64      128      256      512                              32              64      128      256      512                              32              64      128        256    512

                                                                                                (b) Energy e⌅ciency (simulated time steps per Joule)
                                               8-core                                                                       3D                                                                    low-frequency                                                               dual-issue
                     2.5                                                                        2.5                                                                        2.5                                                                        2.5
Steps/Energy (1/J)

                                                                           Steps/Energy (1/J)

                                                                                                                                                      Steps/Energy (1/J)

                                                                                                                                                                                                                                 Steps/Energy (1/J)
                     2.0                                                                        2.0                                                                        2.0                                                                        2.0
                     1.5                                                                        1.5                                                                        1.5                                                                        1.5
                     1.0                                                                        1.0                                                                        1.0                                                                        1.0
                     0.5                                                                        0.5                                                                        0.5                                                                        0.5
                     0.0                                                                        0.0                                                                        0.0                                                                        0.0
                           0             1         2        3          4                              0             1         2        3          4                              0             1         2        3          4                              0             1         2        3          4
                                   Arithmetic intensity (FLOP/byte)                                           Arithmetic intensity (FLOP/byte)                                           Arithmetic intensity (FLOP/byte)                                           Arithmetic intensity (FLOP/byte)
                       32             64      128        256     512                              32             64      128      256       512                              32             64      128      256       512                              32             64      128       37	
  
                                                                                                                                                                                                                                                                                        256       512

SAMPLED	
  SIMULATION	
  OF	
  
MULTI-‐THREADED	
  APPLICATIONS	
  

  TREVOR	
  E.	
  CARLSON,	
  WIM	
  HEIRMAN,	
  
                         LIEVEN	
  EECKHOUT	
  
                                                                            	
  
                                       HTTP://WWW.SNIPERSIM.ORG	
  
                                  WEDNESDAY,	
  SEPTEMBER	
  4TH,	
  2013	
  
              7TH	
  PARALLEL	
  TOOLS	
  WORKSHOP,	
  DRESDEN,	
  GERMANY

OVERVIEW	
  

• How	
  can	
  we	
  create	
  a	
  representaWve	
  sample	
  of	
  
  a	
  mulW-‐threaded	
  applicaWon?	
  

• Prior	
  Work	
  
• Key	
  ContribuWons	
  of	
  this	
  Work	
  
• Results	
  and	
  EvaluaWon	
  

                                                                    39

WORKLOAD	
  REDUCTION	
  IS	
  THE	
  KEY	
  
• Many	
  workload	
  reducWon	
  techniques	
  exist	
  
  today	
  
   – ReducWon	
  
       • Smaller	
  input	
  sizes	
  
       • Reduced	
  numbers	
  of	
  iteraWons	
  
   – Sampling:	
  only	
  part	
  of	
  the	
  workload	
  needs	
  to	
  be	
  
     simulated	
  in	
  detail,	
  whole-‐program	
  performance	
  
     can	
  be	
  extrapolated	
  
       • SimPoint	
  
       • SMARTS	
  
       • FlexPoints	
  

                                                                                   40

SAMPLING	
  MULTI-‐THREADED	
  WORKLOADS	
  
• Deﬁne:	
  synchronizing	
  mulW-‐threaded	
  applicaWon	
  
   – Use	
  locks	
  (mutexes),	
  barriers,	
  etc.	
  
   – ApplicaWon	
  where	
  mulWple	
  threads	
  are	
  working	
  to	
  
     solve	
  a	
  problem	
  together	
  
• MulW-‐threaded	
  applicaWon	
  complexiWes	
  
   – We	
  want	
  to	
  determine	
  applicaWon	
  runWme,	
  not	
  CPI	
  
   – Can	
  be	
  diﬀerent	
  performance	
  per	
  thread	
  (e.g.	
  NUMA,	
  
     load	
  imbalance)	
  
   – InstrucWon	
  count	
  cannot	
  be	
  used	
  to	
  determine	
  fast-‐
     forward	
  length	
  (per-‐thread	
  CPI,	
  thread	
  idle	
  Wme)	
  

                                                                              41

MULTI-‐THREADED	
  SAMPLING	
  
• Goal	
  
    – Reduce	
  mulW-‐threaded	
  applicaWon	
  simulaWon	
  Wme	
  
    – Accurately	
  predict	
  applicaWon	
  runWme	
  
• Key	
  ContribuWons	
  
    – Sampling	
  in	
  Wme	
  is	
  a	
  requirement	
  for	
  sampling	
  
      simulaWon	
  of	
  mulW-‐threaded	
  applicaWons	
  
    – Take	
  into	
  account	
  thread	
  details	
  during	
  fast-‐
      forwarding	
  
        • Thread	
  synchronizaWon	
  (mutexes,	
  barriers,	
  etc.)	
  
        • Per-‐thread	
  CPI	
  
    – ApplicaWon	
  phase	
  behavior	
  is	
  criWcal	
  for	
  accurate	
  
      sampling	
  

                                                                                42

CURRENT	
  SAMPLING	
  SOLUTIONS	
  
• Current	
  mulW-‐threaded	
  soluWons	
  are	
  not	
  suﬃcient	
  
    – Flex	
  Points	
  
         • Speciﬁcally	
  designed	
  for	
  non-‐synchronizing	
  throughput	
  (server)	
  
           workloads	
  
         • Issue:	
  Assumes	
  no	
  correlaWon	
  between	
  threads	
  
    – COTSon’s	
  Dynamic	
  Sampling	
  (Argollo	
  et	
  al.,	
  Ryckbosch	
  et	
  al.)	
  
         • Issue:	
  Doesn’t	
  properly	
  handle	
  synchronizaWon	
  during	
  fast-‐forwarding	
  

                                                                                                     Argollo	
  et	
  al.,	
  ACM	
  SIGOPS	
  
                                               Wenisch,	
  et	
  al.,	
  IEEE	
  MICRO	
  2006	
            	
  OperaWng	
  Systems	
  Review	
  
                                                                                                                                           43

MULTITHREADED	
  FAST-‐FORWARDING	
  
               • Use	
  Wme	
  as	
  the	
  base	
  unit	
  for	
  sampling	
  
               • Propagate	
  Wme	
  from	
  waker	
  to	
  waiter	
  (as	
  in	
  detailed)	
  
               • Use	
  instrucWon	
  count	
  as	
  a	
  low-‐overhead	
  fast-‐forwarding	
  
                 method	
  
               • Use	
  per-‐thread	
  non-‐idle	
  CPI	
  from	
  recent	
  detailed	
  interval	
  
                                                    wait	
  
IPC	
  1	
  

                                                          wake	
  
IPC	
  0	
  

                                                                                                    Wme	
  
                                                                                                   44	
  
                    detailed	
                         fast-‐forward	
                      detailed

APPLICATIONS	
  ARE	
  PERIODIC	
  

                 npb-‐E,	
  class	
  A,	
  8	
  threads	
  
                                                               45

IDENTIFY	
  PERIODICITIES	
  
• ApplicaWon	
  periodiciWes	
  are	
  idenWﬁed	
  in	
  a	
  
  micro-‐architectural	
  independent	
  manner	
  

    BBV	
  AutocorrelaJon	
  
    npb-‐E,	
  class	
  A,	
  8	
  threads,	
  with	
  550k	
  
    and	
  1.14M	
  insn	
  periodiciWes	
                         OMP	
  Call	
  Structure	
  
                                                                   npb-‐lu,	
  class	
  A,	
  8	
  threads	
  
                                                                   with	
  high	
  variability	
  (not	
  used)	
  
                                                                                                                46

SAMPLING	
  PROCESS	
  
• Sampling	
  suﬃciently	
  above	
  or	
  below	
  the	
  
  period	
  will	
  minimize	
  error	
  
            PeriodiciWes	
                         RunWme	
   limit	
  
                                                    Best	
  Region	
  

    Good	
  Region	
  

                                                                          47

RESULTS	
  
• Predicted	
  Most-‐Accurate	
  Results	
  
   – Average	
  absolute	
  error	
  of	
  3.5%	
  
   – Average	
  speedup	
  of	
  2.9x,	
  maximum	
  of	
  5.8x	
  	
  

                                                                          48

THE	
  SNIPER	
  MULTI-‐CORE	
  SIMULATOR	
  
                               RUNNING	
  SIMULATIONS	
  	
  
                          AND	
  PROCESSING	
  RESULTS	
  

                          WIM	
  HEIRMAN,	
  TREVOR	
  E.	
  CARLSON,	
  	
  
KENZO	
  VAN	
  CRAEYNEST,	
  IBRAHIM	
  HUR	
  AND	
  LIEVEN	
  EECKHOUT	
  
                                                                                                       	
  
                                                                      HTTP://WWW.SNIPERSIM.ORG	
  
                                                              WEDNESDAY,	
  SEPTEMBER	
  4TH,	
  2013	
  
                                         7TH	
  PARALLEL	
  TOOLS	
  WORKSHOP,	
  DRESDEN,	
  GERMANY

OVERVIEW	
  

•   Obtain	
  and	
  compile	
  Sniper	
  
•   Running	
  
•   ConﬁguraWon	
  
•   SimulaWon	
  results	
  
•   InteracWng	
  with	
  the	
  simulaWon	
  
    – SimAPI:	
  applicaWon	
  
    – Python	
  scripWng	
  

                                                 50

RUNNING	
  SNIPER	
  
• Download	
  Sniper	
  
   – hup://snipersim.org/w/Download	
  
            • Download	
  tar.gz	
  
            • Git	
  clone	
  
	
   	
  ~/sniper$	
  export	
  SNIPER_ROOT=$(pwd)	
  #optional	
  
	
   	
  ~/sniper$	
  make	
  
• Running	
  an	
  applicaWon	
  
   ~/sniper$	
  ./run-‐sniper	
  -‐-‐	
  /bin/true	
  
   ~/sniper/test/fft$	
  make	
  run	
  

                                                                51

RUNNING	
  SNIPER	
  
• Integrated	
  benchmarks	
  distribuWon	
  
   – hup://snipersim.org/w/Download_Benchmarks	
  
   ~/benchmarks$	
  export	
  BENCHMARKS_ROOT=$(pwd)	
  
   ~/benchmarks$	
  make	
  
   ~/benchmarks$	
  ./run-‐sniper	
  –p	
  splash2-‐fft	
  \	
  
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  –i	
  small	
  –n	
  4	
  
• Standardizes	
  input	
  sets	
  and	
  command	
  lines	
  
• Includes	
  SPLASH-‐2,	
  PARSEC	
  

                                                                                                                                            52

REGION	
  OF	
  INTEREST	
  
• Skip	
  benchmark	
  iniWalizaWon	
  and	
  cleanup	
  
• Mark	
  code	
  with	
  ROI	
  begin	
  /	
  end	
  markers	
  
    – SimRoiStart()	
  /	
  SimRoiEnd()	
  in	
  your	
  own	
  
      applicaWon	
  
    – $	
  ./run-‐sniper	
  -‐-‐roi	
  -‐-‐	
  test/fft/fft	
  
• Already	
  done	
  in	
  benchmarks	
  distribuWon	
  
    – benchmarks/run-‐sniper	
  implies	
  -‐-‐roi	
  
    – Use	
  -‐-‐no-‐roi	
  to	
  override	
  
• Cache	
  warming	
  during	
  pre-‐ROI	
  period	
  
    – Use	
  -‐-‐no-‐cache-‐warming	
  to	
  override	
  
                                                                       53

MPI	
  WORKLOADS	
  
• Supports	
  single-‐node	
  shared-‐memory	
  MPI	
  
   – MPICH2	
  and	
  derivaWves	
  
     (Intel	
  MPI,	
  etc.)	
  
• Add	
  -‐-‐mpi	
  to	
  the	
  Sniper	
  
  opWons	
  when	
  running	
  mpirun	
  
   – Example:	
  
                                                                      Intel	
  MPI,	
  Source:	
  Intel	
  

     ~/sniper/test/mpi$	
  ../../run-‐sniper	
  -‐-‐mpi	
  -‐n	
  4	
  \	
  
     	
  	
  	
  	
  -‐c	
  gainestown	
  -‐-‐	
  mpirun	
  -‐np	
  4	
  ./pi	
  
• Hybrid	
  single-‐node	
  MPI+OpenMP	
  applicaWons	
  
  are	
  also	
  supported	
  

                                                                                                              54

SIMULATION	
  RESULTS	
  
• Files	
  created	
  aEer	
  each	
  simulaWon:	
  
    – sim.cfg:	
  all	
  conﬁguraWon	
  opWons	
  used	
  for	
  this	
  run	
  
      (includes	
  defaults,	
  all	
  -‐c	
  and	
  -‐g	
  opWons)	
  
    – sim.out:	
  basic	
  staWsWcs	
  (number	
  of	
  cycles,	
  instrucWons	
  
      per	
  core,	
  cache	
  access	
  and	
  miss	
  rates,	
  …)	
  
    – sim.stats[.sqlite3]:	
  complete	
  set	
  of	
  all	
  recorded	
  
      staWsWcs	
  at	
  key	
  points	
  in	
  the	
  simulaWon	
  	
  
      (start,	
  roi-‐begin,	
  roi-‐end,	
  stop)	
  
• Use	
  the	
  sniper_lib	
  Python	
  package	
  for	
  parsing	
  

                                                                                   55

SIMULATION	
  RESULTS	
  
sniper_lib.get_results()	
  parses	
  sim.cfg,	
  sim.stats	
  and	
  
returns	
  conﬁguraWon	
  and	
  staWsWcs	
  	
  
(roi-‐end	
  –	
  roi-‐begin)	
  for	
  all	
  cores	
  
	
  
~/sniper/tools$	
  python	
  
>	
  import	
  sniper_lib	
  
>	
  results	
  =	
  sniper_lib.get_results(resultsdir	
  =	
  ‘..’)	
  
>	
  print	
  results	
  
	
  	
  {‘config’:	
  {‘general/total_cores’:	
  ‘64’,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ‘perf_model/core/frequency’:	
  ‘2.66’,	
  …},	
  
	
  	
  	
  ‘results’:	
  {‘performance_model.instruction_count’:[123],	
  
	
  	
  	
  ‘performance_model.elapsed_time’:	
  [23000000],	
  …}}	
  

                                                                                                   56

SIMULATION	
  RESULTS	
  
• Let’s	
  compute	
  the	
  IPC	
  for	
  core	
  0	
  
• Core	
  frequency	
  is	
  variable	
  (DVFS)	
  	
  
  so	
  cycle	
  count	
  has	
  to	
  be	
  computed	
  
     – Time	
  is	
  in	
  femtoseconds,	
  frequency	
  in	
  GHz	
  
	
  
>	
  instrs	
  =	
  results[‘results’]	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  [‘performance_model.instruction_count’][0]	
  
>	
  cycles	
  =	
  results[‘results’]	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  [‘performance_model.elapsed_time’][0]	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  *	
  float(results[‘config’][‘perf_model/core/frequency’])	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  *	
  1e-‐6	
  	
  #	
  femtoseconds	
  -‐>	
  nanoseconds	
  
>	
  ipc	
  =	
  instrs	
  /	
  cycles	
  
2.0	
  
                                                                                          57

SIMULATION	
  RESULTS	
  
• CPI	
  stacks	
  (user	
  of	
  sniper_lib)	
  
$	
  ./tools/cpistack.py	
  [-‐-‐time|-‐-‐cpi|-‐-‐abstime]	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  CPI	
  	
  	
  	
  	
  	
  CPI	
  %	
  	
  	
  	
  	
  Time	
  %	
  
Core	
  0	
  
	
  	
  depend-‐int	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.20	
  	
  	
  	
  	
  23.42%	
  	
  	
  	
  	
  23.42%	
  
	
  	
  depend-‐fp	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.16	
  	
  	
  	
  	
  18.94%	
  	
  	
  	
  	
  18.94%	
  
	
  	
  branch	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.12	
  	
  	
  	
  	
  14.04%	
  	
  	
  	
  	
  14.04%	
  
	
  	
  ifetch	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.04	
  	
  	
  	
  	
  	
  4.16%	
  	
  	
  	
  	
  	
  4.16%	
  
	
  	
  mem-‐l1d	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.21	
  	
  	
  	
  	
  24.41%	
  	
  	
  	
  	
  24.41%	
  
	
  	
  mem-‐l3	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.02	
  	
  	
  	
  	
  	
  2.72%	
  	
  	
  	
  	
  	
  2.72%	
  
	
  	
  mem-‐dram	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.05	
  	
  	
  	
  	
  	
  5.73%	
  	
  	
  	
  	
  	
  5.73%	
  
	
  	
  sync-‐mutex	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.02	
  	
  	
  	
  	
  	
  2.59%	
  	
  	
  	
  	
  	
  2.59%	
  
	
  	
  sync-‐cond	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.03	
  	
  	
  	
  	
  	
  3.01%	
  	
  	
  	
  	
  	
  3.01%	
  
	
  	
  other	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.01	
  	
  	
  	
  	
  	
  0.97%	
  	
  	
  	
  	
  	
  0.97%	
  
	
  
	
  	
  total	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.84	
  	
  	
  	
  100.00%	
  	
  	
  	
  	
  	
  0.00s	
  
Core	
  1	
  
	
  	
  depend-‐int	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.20	
  	
  	
  	
  	
  23.92%	
  	
  	
  	
  	
  23.92%	
  
	
  	
  depend-‐fp	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.16	
  	
  	
  	
  	
  18.79%	
  	
  	
  	
  	
  18.79%	
  
	
  	
  branch	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.12	
  	
  	
  	
  	
  13.72%	
  	
  	
  	
  	
  13.72%	
  
	
  	
  mem-‐l1d	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.20	
  	
  	
  	
  	
  24.06%	
  	
  	
  	
  	
  24.06%	
  
	
  	
  mem-‐l3	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.06	
  	
  	
  	
  	
  	
  6.79%	
  	
  	
  	
  	
  	
  6.79%	
  
	
  	
  sync-‐mutex	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.04	
  	
  	
  	
  	
  	
  5.22%	
  	
  	
  	
  	
  	
  5.22%	
  
	
  	
  sync-‐cond	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.05	
  	
  	
  	
  	
  	
  5.60%	
  	
  	
  	
  	
  	
  5.60%	
  
	
  	
  other	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.02	
  	
  	
  	
  	
  	
  1.89%	
  	
  	
  	
  	
  	
  1.89%	
  
	
  
	
  	
  total	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0.85	
  	
  	
  	
  100.00%	
  	
  	
  	
  	
  	
  0.00s	
  
	
                                                                                                                                                                     58

INTERACTING	
  WITH	
  SNIPER	
  
                                input/	
  
                  binary	
     cmdline	
       conﬁguraWon	
  

                       applicaWon	
  

                               SimAPI	
  
    Python	
  
    scripts	
  
                               Sniper	
  simulator	
  

                                    staWsWcs	
  

                                  visualizaWon	
                 59

APPLICATION	
  SIMAPI	
  
• Calling	
  simulator	
  API	
  funcWons	
  from	
  your	
  C	
  program	
  
    #include	
  	
  
    – SimInSimulator()	
  
        • Return	
  1	
  when	
  running	
  inside	
  Sniper,	
  0	
  when	
  running	
  naWvely	
  
    – SimGetProcId()	
  
        • Return	
  processor	
  number	
  of	
  caller	
  
    – SimRoiStart()	
  /	
  SimRoiEnd()	
  
        • Start/end	
  detailed	
  mode	
  (when	
  using	
  ./run-‐sniper	
  -‐-‐roi)	
  
    – SimSetFreqMHz(proc,	
  mhz)	
  /	
  SimGetFreqMHz(proc)	
  
        • Set	
  /	
  get	
  processor	
  frequency	
  (integer,	
  in	
  MHz)	
  
    – SimUser(cmd,	
  arg)	
  
        • User-‐deﬁned	
  funcWon	
                                                            60

PYTHON	
  SCRIPTING	
  
• Low-‐level	
  script	
  
• Execute	
  “foo”	
  at	
  each	
  barrier	
  synchronizaWon	
  

import	
  sim_hooks	
  
def	
  foo(t):	
  
	
  	
  print	
  'The	
  time	
  is	
  now',	
  t	
  
sim_hooks.register(sim_hooks.HOOK_PERIODIC,	
  foo)	
  

                                                                61

PYTHON	
  SCRIPTING	
  
• Access	
  conﬁguraWon,	
  staWsWcs,	
  DVFS	
  
• Live	
  periodic	
  IPC	
  trace:	
  
         – See	
  scripts/ipctrace.py	
  for	
  a	
  more	
  complete	
  example	
  
	
  
class	
  IPCTracer:	
  
	
  	
  def	
  setup(self,	
  args):	
  
	
  	
  	
  	
  sim.util.Every(1*sim.util.Time.US,	
  self.periodic)	
  
	
  	
  	
  	
  self.instrs_prev	
  =	
  0	
  
	
  	
  def	
  periodic(self,	
  t,	
  t_delta):	
  
	
  	
  	
  	
  freq	
  =	
  sim.dvfs.get_frequency(0)	
  
	
  	
  	
  	
  cycles	
  =	
  t_delta	
  *	
  freq	
  *	
  1e-‐9	
  	
  #	
  fs	
  *	
  MHz	
  -‐>	
  cycles	
  
	
  	
  	
  	
  instrs	
  =	
  long(sim.stats.get('performance_model',	
  0,	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  'instruction_count'))	
  
	
  	
  	
  	
  print	
  'IPC	
  =',	
  (instrs	
  –	
  self.instrs_prev)	
  /	
  cycles	
  
	
  	
  	
  	
  self.instrs_prev	
  =	
  instrs	
                                                                                                         62

INTERVAL	
  CORE	
  SIMULATION	
  DETAILS	
  

        TREVOR	
  E.	
  CARLSON,	
  WIM	
  HEIRMAN,	
  
                               LIEVEN	
  EECKHOUT	
  
                                                                                  	
  
                                             HTTP://WWW.SNIPERSIM.ORG	
  
                                        WEDNESDAY,	
  SEPTEMBER	
  4TH,	
  2013	
  
                    7TH	
  PARALLEL	
  TOOLS	
  WORKSHOP,	
  DRESDEN,	
  GERMANY

OVERVIEW	
  
• SimulaWon	
  Methodologies	
  
   – Trace,	
  Integrated,	
  FuncWonal-‐directed	
  
• Core	
  Models	
  
   – One-‐IPC	
  
   – Interval	
  
• Interval	
  Model	
  and	
  SimulaWon	
  Detail	
  
• CPI-‐Stacks	
  

                                                         64

ONE-‐IPC	
  MODELING	
  –	
  TOO	
  SIMPLE?	
  
• Simple	
  high-‐abstracWon	
  model,	
  oEen	
  used	
  in	
  
  uncore	
  studies	
  
• AlternaWve	
  for	
  memory	
  access	
  traces	
  
    – Aims	
  to	
  provide	
  more-‐realisWc	
  access	
  pauerns	
  
    – Allows	
  for	
  Wming	
  feedback	
  
• But:	
  One-‐IPC	
  core	
  models	
  do	
  not	
  exhibit	
  ILP/MLP	
  
    – Memory	
  request	
  rates	
  are	
  not	
  as	
  accurate	
  as	
  more	
  
      detailed	
  simulators	
  
    – #	
  outstanding	
  requests	
  incorrect:	
  underesWmate	
  
      required	
  queue	
  sizes	
  
    – No	
  latency	
  is	
  hidden:	
  overesWmate	
  runWme	
  
      improvements	
  
                                                                                     65

CONSTRUCTING	
  CPI	
  STACKS	
                              CPI	
  

• Interval	
  simulaWon:	
  	
  
  track	
  why	
  Wme	
  is	
  advanced	
  
    – No	
  miss	
  events	
  
        • Dispatch	
  instrucWons	
  at	
  base	
  CPI	
  
        • Increment	
  base	
  component	
  
    – Miss	
  event	
  
        • Fast-‐forward	
  Wme	
  by	
  X	
  cycles	
  
        • Increment	
  component	
  by	
  X	
                          L2	
  cache	
  
                                                                       I-‐cache	
  
                                                                       Branch	
  
                                                                       Base	
  

                                                                                    66

INTERVAL	
  SIMULATION	
  FROM	
  30,000	
  FEET	
  
                                                       DRAM	
  
                                                   Cache	
  Hierarchy	
  
I$	
     BP	
                                              	
   LSQ	
  
     	
                                Issue	
  
                      Decode	
                         ExecuWon	
           Commit	
  
   Fetch	
                             Queue	
  
                                                         Units	
  

                                                           ROB	
  

    Interval	
  simulaWon	
  considers	
  instrucWons	
  (in-‐order)	
  at	
  dispatch	
  
    • dispatch	
  not	
  possible	
  
               –   InstrucWon	
  cache	
  /	
  TLB	
  miss	
  
               –   Branch	
  mispredicWon	
  (not	
  dispatching	
  useful	
  instrucWons)	
  
               –   Front-‐end	
  reﬁll	
  aEer	
  mispredicWon	
  
               –   ROB	
  full:	
  long-‐latency	
  miss	
  at	
  head	
  of	
  ROB	
  
                                                                                                 67

INTERVAL	
  SIMULATION	
  FROM	
  30,000	
  FEET	
  
                                                          DRAM	
  
                                                      Cache	
  Hierarchy	
  
I$	
     BP	
                                                 	
   LSQ	
  
     	
                                   Issue	
  
                       Decode	
                           ExecuWon	
           Commit	
  
   Fetch	
                                Queue	
  
                                                            Units	
  

                                                              ROB	
  

    Interval	
  simulaWon	
  considers	
  instrucWons	
  (in-‐order)	
  at	
  dispatch	
  
    • dispatch	
  not	
  possible	
  
    • dispatch	
  possible:	
  at	
  rate	
  governed	
  by	
  ROB	
  
               – Liule’s	
  law:	
  progress	
  rate	
  =	
  #elements	
  /	
  Wme	
  spent	
  in	
  queue	
  
               – Computed	
  using	
  ROB	
  ﬁll	
  and	
  criWcal	
  path	
  through	
  ROB	
  
                     • Computed	
  using	
  dynamic	
  instrucWon	
  dependencies	
  and	
  latencies	
  
                                                                                                                 68

LONG	
  BACK-‐END	
  MISS	
  EVENTS	
  
ISOLATED	
  LONG-‐LATENCY	
  LOAD	
  

                                         S.	
  Eyerman	
  et	
  al.,	
  ACM	
  TOCS,	
  May	
  2009	
  

                                                                                                      69

LONG	
  BACK-‐END	
  MISS	
  EVENTS	
  
OVERLAPPING	
  LONG-‐LATENCY	
  LOADS	
  

                                         S.	
  Eyerman	
  et	
  al.,	
  ACM	
  TOCS,	
  May	
  2009	
  

                                                                                                      70

CORE	
  MODELS	
  (ONGOING)	
  
• Key	
  quesWon:	
  required	
  accuracy	
  vs.	
  simulaWon	
  speed	
  /	
  
  simulator	
  complexity	
  
    – Cycle-‐accurate	
  memory	
  request	
  stream	
  for	
  uncore	
  studies	
  
    – Accurate	
  performance	
  impact	
  (overlap)	
  of	
  memory	
  latency	
  
    – ImplementaWon	
  complexity	
  when	
  making	
  changes	
  (research)	
  
• Interval	
  model	
  
    – Issue	
  contenWon	
  (structural	
  hazards)	
  
    – Cycle-‐accurate	
  memory	
  hierarchy	
  support	
  
• ROB-‐based	
  model	
  
    – Free	
  issue	
  contenWon	
  and	
  cycle-‐driven	
  memory	
  support	
  
    – Higher	
  accuracy,	
  slower	
  (~2x	
  total)	
  

                                                                                        71

SAMPLED	
  SIMULATION	
  OF	
  
MULTI-‐THREADED	
  APPLICATIONS	
  

  TREVOR	
  E.	
  CARLSON,	
  WIM	
  HEIRMAN,	
  
                         LIEVEN	
  EECKHOUT	
  
                                                                            	
  
                                       HTTP://WWW.SNIPERSIM.ORG	
  
                                  WEDNESDAY,	
  SEPTEMBER	
  4TH,	
  2013	
  
              7TH	
  PARALLEL	
  TOOLS	
  WORKSHOP,	
  DRESDEN,	
  GERMANY

MULTITHREADED	
  FAST-‐FORWARDING	
  
               • Use	
  Wme	
  as	
  the	
  base	
  unit	
  for	
  sampling	
  
                    – Time	
  is	
  common	
  across	
  threads,	
  unlike	
  instrucWons	
  
               • Use	
  instrucWon	
  count	
  as	
  a	
  low-‐overhead	
  fast-‐forwarding	
  
                 method	
  
                    – FuncWonal-‐execuWon	
  only	
  provides	
  instrucWon	
  count,	
  but	
  we	
  sWll	
  
                      require	
  Wme	
  for	
  fast-‐forwarding	
  
               • Use	
  per-‐thread	
  non-‐idle	
  CPI	
  from	
  previous	
  detailed	
  interval	
  
IPC	
  0	
  

                                                                                                           Wme	
  
                                                                                                          73	
  
                    detailed	
                           fast-‐forward	
                           detailed

SAMPLE	
  SELECTION	
  

It	
  is	
  possible	
  to	
  get	
  good	
  accuracy	
  
at	
  high	
  speed,	
  but	
  not	
  reliably	
  

   Detailed	
  (D)	
                                        fast-‐forward	
  (F/D)	
     74

MAIN	
  PROBLEM:	
  ALIASING	
  
• When	
  applicaWon	
  exhibits	
  periodicity	
  near	
  detailed	
  interval	
  
  length,	
  aliasing	
  errors	
  
                                                            near	
  one	
  period:	
  average	
  not	
  OK	
  
   IPC	
  

                                                            exactly	
  one	
  period:	
  average	
  OK	
  
	
           detailed	
  
• New	
  problem	
  to	
  mulW-‐threaded	
  sampling:	
  
      – SMARTS	
  uses	
  >10,000	
  sampling	
  units:	
  average	
  IPC	
  is	
  obtained	
  
      – SimPoint	
  sampling	
  units	
  can	
  sWll	
  alias	
  applicaWon	
  periods	
  
      – Key	
  insight:	
  we	
  need	
  single	
  sample	
  accuracy	
  for	
  fast-‐forward	
  IPC	
  
• Sampling	
  parameters	
  determined	
  by	
  applicaWon	
  periodicity	
  

                                                                                                                 75

IDENTIFY	
  PERIODICITIES	
  
• We	
  do	
  this	
  in	
  an	
  architecture-‐independent	
  way	
  
• Sampling	
  suﬃciently	
  above	
  or	
  below	
  the	
  period	
  
  will	
  minimize	
  error	
  

                                       D	
  =	
  Detailed	
  period	
  
                                       F	
  =	
  Fast-‐forward	
  (mulWple	
  of	
  D)	
     76

EXPERIMENTAL	
  SETUP	
  
• Sniper	
  MulW-‐core	
  Simulator	
  
   – Nehalem-‐style	
  architecture	
  
       • 2	
  sockets,	
  4	
  cores	
  per	
  socket	
  
       • 2.66	
  GHz,	
  128-‐entry	
  ROB	
  
       • 32	
  KB	
  L1-‐I,	
  32KB	
  L1-‐D,	
  256	
  KB	
  L2/core,	
  8MB	
  L3/4	
  cores	
  
• Benchmarks	
  
   – NAS	
  Parallel	
  Benchmarks	
  3.3.1,	
  class	
  A	
  inputs	
  
   – Parsec	
  2.1,	
  simlarge	
  input	
  set	
  
   – SPEC	
  OMP2001,	
  train	
  input	
  set	
  

                                                                                                   77

THREAD	
  SYNCHRONIZATION	
  COMPARISON	
  

Even	
  with	
  oracle	
  per-‐thread	
  CPI	
  knowledge	
  
   up-‐front,	
  our	
  proposed	
  methodology	
  
    provides	
  a	
  more	
  accurate	
  soluWon	
  

                                                                 78

MULTI-‐THREADED	
  SAMPLING	
  
• Key	
  ContribuWons	
  
   – Sampling	
  in	
  Wme	
  is	
  a	
  requirement	
  for	
  sampling	
  
     simulaWon	
  of	
  mulW-‐threaded	
  applicaWons	
  
   – Take	
  into	
  account	
  thread	
  details	
  during	
  fast-‐
     forwarding	
  
       • Thread	
  synchronizaWon	
  
       • Per-‐thread	
  CPI	
  
   – Taking	
  into	
  account	
  applicaWon	
  phase	
  behavior	
  is	
  
     criWcal	
  for	
  accurate	
  sampling	
  
• Predicted	
  Most-‐Accurate	
  Results	
  
   – Average	
  absolute	
  error	
  of	
  3.5%	
  across	
  applicaWons	
  
   – Average	
  speedup	
  of	
  2.9x,	
  maximum	
  of	
  5.8x	
  

                                                                               79

MULTI-‐THREADED	
  SAMPLING	
  RELEASE	
  

     • Sniper	
  5.0	
  Release	
  
             	
  
             – MulW-‐threaded	
  sampling	
  infrastructure	
  
             – Available	
  from:	
  
                      • hup://snipersim.org	
  

Interval	
  core	
  model,	
  CPI-‐stacks,	
  advanced	
  visualizaWon	
  support,	
  automaWc	
  topology	
  generaWon,	
  parallel	
  mulW-‐threaded	
  
simulator,	
  mulW-‐program	
  and	
  mulW-‐threaded	
  applicaWon	
  support,	
  x86	
  and	
  x86-‐64	
  support,	
  hardware	
  validated,	
  full	
  
DVFS	
  support,	
  shared	
  and	
  private	
  cache	
  support,	
  scheduling	
  support,	
  heterogeneous	
  conﬁguraWon,	
  modern	
  branch	
  
predictor,	
  OpenMP,	
  MPI,	
  TBB,	
  OpenCL,	
  integrated	
  benchmarks,	
  SPLASH-‐2,	
  most	
  of	
  Parsec,	
  McPAT	
  integraWon,	
  SimAPI,	
  
Python	
  scripWng,	
  single-‐opWon	
  debugging,	
  modern	
  OS	
  support,	
  Pin-‐based,	
  staWsWcs	
  database,	
  stackable	
  conﬁguraWons	
  

                                                                                                                                                   80

RESULTS	
  
• Predicted	
  Fastest	
  Results	
  
   – Average	
  speedup	
  of	
  3.8x,	
  maximum	
  of	
  8.4x	
  	
  
   – Average	
  absolute	
  error	
  of	
  5.1%	
  

                                                                          81

THE	
  SNIPER	
  MULTI-‐CORE	
  SIMULATOR	
  
                        SIMULATOR	
  ACCURACY	
  
                AND	
  HARDWARE	
  VALIDATION	
  

TREVOR	
  E.	
  CARLSON,	
  WIM	
  HEIRMAN,	
  IBRAHIM	
  HUR,	
  	
  
    KENZO	
  VAN	
  CRAEYNEST	
  AND	
  LIEVEN	
  EECKHOUT	
  
                                                                                              	
  
                                                         HTTP://WWW.SNIPERSIM.ORG	
  
                                                    WEDNESDAY,	
  SEPTEMBER	
  4TH,	
  2013	
  
                                7TH	
  PARALLEL	
  TOOLS	
  WORKSHOP,	
  DRESDEN,	
  GERMANY

EXPERIMENTAL	
  SETUP	
  
• Benchmarks	
  
   – Complete	
  SPLASH-‐2	
  suite	
  
       • 1	
  to	
  16	
  threads	
  
       • Linux	
  pthreads	
  API	
  
   – Extensive	
  use	
  of	
  microbenchmarks	
  to	
  tune	
  
     parameters	
  and	
  track	
  down	
  problems	
  
• Hardware	
  
   – Four-‐socket	
  Intel	
  Xeon	
  X7460	
  machine	
  
   – Core2	
  (45nm,	
  Penryn)	
  with	
  6	
  cores/socket	
  

                                                                   83

INTERVAL:	
  GOOD	
  OVERALL	
  ACCURACY	
  

Good	
  accuracy	
  for	
  the	
  
enWre	
  benchmark	
  suite	
  

                                                   84

INTERVAL:	
  BETTER	
  RELATIVE	
  ACCURACY	
  
• ApplicaWon	
  scalability	
  is	
  aﬀected	
  by	
  memory	
  bandwidth	
  
• Interval	
  model	
  provides	
  more	
  realisWc	
  memory	
  request	
  
  streams,	
  which	
  results	
  in	
  a	
  more	
  accurate	
  scaling	
  predicWon	
  

                                                                                            85

VALIDATING	
  FOR	
  NEHALEM	
  
                                                          2	
  
                                                        1.8	
  
       HW	
  Measurement	
  /	
  Sniper	
  Result	
  

                                                        1.6	
  
                                                        1.4	
  
                                                        1.2	
  
                                                          1	
  
                                                        0.8	
  
                                                        0.6	
  
                                                        0.4	
  
                                                        0.2	
  
                                                          0	
  

                                                          2	
  
                                                        1.8	
  
HW	
  Measurement	
  /	
  Sniper	
  Result	
  

                                                        1.6	
  
                                                        1.4	
  
                                                        1.2	
  
                                                          1	
  
                                                        0.8	
  
                                                        0.6	
  
                                                        0.4	
  
                                                        0.2	
  
                                                          0	
  

                                                                                           86

REFERENCES	
  
• Sniper	
  website	
  
   – hup://snipersim.org/	
  
• Download	
  
   – hup://snipersim.org/w/Download	
  
   – hup://snipersim.org/w/Download_Benchmarks	
  
• Ge€ng	
  started	
  
   – hup://snipersim.org/w/Ge€ng_Started	
  
• QuesWons?	
  
   – hup://groups.google.com/group/snipersim	
  
   – hup://snipersim.org/w/Frequently_Asked_QuesWons	
  

                                                       87

GAINING	
  INSIGHT	
  INTO	
  
                     PROGRAM	
  PERFORMANCE	
  

TREVOR	
  E.	
  CARLSON,	
  WIM	
  HEIRMAN,	
  IBRAHIM	
  HUR	
  
    KENZO	
  VAN	
  CRAEYNEST	
  AND	
  LIEVEN	
  EECKHOUT	
  
                                                         HTTP://WWW.SNIPERSIM.ORG	
  
                                                WEDNESDAY,	
  SEPTEMBER	
  4TH,	
  2013	
  
                           7TH	
  PARALLEL	
  TOOLS	
  WORKSHOP,	
  DRESDEN,	
  GERMANY

You can also read