Running NIMROD on PlayStation 3 - Initial Efforts Ping Zhu K. Germaschewski (UNH)

Page created by Carol Jackson
 
CONTINUE READING
Running NIMROD on PlayStation 3 - Initial Efforts Ping Zhu K. Germaschewski (UNH)
Running NIMROD on PlayStation 3
               Initial Efforts

                  Ping Zhu
       University of Wisconsin-Madison

             in collaboration with
        K. Germaschewski (UNH)
A. Hammond and C. R. Sovinec (UW-Madison)

       NIMROD Summer Meeting (GA)
           August 27-28, 2008
Running NIMROD on PlayStation 3 - Initial Efforts Ping Zhu K. Germaschewski (UNH)
Roadrunner: fastest supercomputer (as of June
2008) [http://www.top500.org]

  Roadrunner breaks petaflops barrier for first time (05/25/2008)
Running NIMROD on PlayStation 3 - Initial Efforts Ping Zhu K. Germaschewski (UNH)
CBE: Cell Broadband Engine

                         I   1 PPE: Power
                             Processing Element
                         I   8 SPEs: Synergistic
                             Processing Elements
                         I   Asymmetric Multicore
                             Architecture
Hybrid computing system has made petaflops
happen

   I   Traditional cluster difficult to reach PF
         I   Processor core performance;
         I   limits on network size;
         I   programming challenges
   I   Hybrid architecture: traditional cluster + accelerators
   I   Roadrunner:
         I   Accelerator: CBE (IBM PowerXCell 8i)
         I   6,480 dual-core AMD Opteron, 12,960 IBM PowerXCell
         I   1.026 petaflops peak (http://www.lanl.gov/orgs/hpc/roadrunner)
Roadrunner hybrid architecture

                          (http://en.wikipedia.org)

  Hybrid computing platform: a pathway to sustained petascale
  performance
PS3: An accessible CBE platform

                I   PS3: 1 CBE with 6 usable SPEs, $399
                    (40GB) (as of June 2008)
                I   IBM BladeCenter QS21: 2 CBEs,
                    $6,995 (as of August 2008)
                I   Roadrunner (based on QS22): 12,960
                    CBEs, $100 (millions) (LANL/NNSA)
                I   PS3: an affordable CBE platform for
                    scientific computing
PS3 Cluster and Emerging Petascale Computing
Systems

   I   PS3 Clusters: Prototype for hybrid petascale system
         I   North Carolina State University: first 8-PS3 cluster (Mueller,
             2007)
         I   UNH: OpenGGCM-C x25 overall acceleration reached
             (Germaschewski 2008)
         I   UNH: NSF (PetaApps) 4 year, 1.5 million grant (Raeder
             2008) – 40-PS3 cluster (8 teraflops total in theory)
         I   Other places and other codes (MIT, VPIC, GS2, etc)
   I   Emerging petascale computing systems
         I   RoadRunner (LANL/IBM, funded by DOE/NNSA, delivered
             2008): 1st classified petascale system
         I   Blue Waters (NCSA/IBM, funded by NSF, to deliver 2011):
             1st open scientific research petascale system (Power7,
             multicore but not CBE)
Porting NIMROD to PS3

   I   Goal: prepare NIMROD for CBE-based petascale system
   I   Opportunity:
         I   CBE system and application (numerical) paradigms and
             libraries are quickly emerging (e.g. IBM, PETSc, and many
             papers)
         I   Still beginning of the trend.
   I   Challenge:
         I   Codes based on explicit schemes are easier to accelerate
         I   Sparse matrix and direct solver may not be straightforward
System Preparation: Fedora Core 7 was installed

    I   Installation instructions available online
    I   Download FC7-ppc and Cell Addon ISOs
    I   Update PS3 firmware (2.00)
    I   Format PS3 harddrive (10GB for GameOS)
    I   Install Linux bootloader from Cell Addon
    I   Change GameOS to Other OS (Linux)
    I   Use kboot and anaconda to finish installation (1.5 hrs)
A screenshot from my PS3
Environment Configuration:
IBM Cell SDK 3.0 installed

    I   IBM Cell SDK (Software Development Kit) for Multicore
        Acceleration (version 3.0) (http://www.ibm.com)
    I   Available for RHEL5.1 and Fedora 7
    I   3 package types: Developer, Product, and Extras.
    I   Developer package include:
          I   Accelerator Library Framework (ALF)
          I   BLAS linear algebra library
          I   GNU Fortran compiler for PPE and SPE
          I   SIMD MATH library
    I   Extra package include:
          I   FFT Library (1D and 2D)
          I   SPU Timer Library and Timing Tool
          I   OProfile – tools used for profiling user and kernel level code
          I   Random Number Generator Library
Accelerate NIMROD on PS3: A first attempt

    I   Build nimuw on PS3
          I   First built without acceleration
          I   Almost same as in Bassi (ppc64) but use gfortran.
          I   Run fine on PPE only (w/o SPE involved).
    I   Select a function to accelerate
          I   Identify the most computation intensive function
          I   Depends on problem to solve, scheme, solver, library
          I   Iterative solver preferred than direct solver
          I   For solver=’diagonal’, major operation is matvec product.
    I   Details:
          I   iter_cg_f90.f: iter_solve_real, matrix_mod.f: matvec
          I   matrix_mod.f: matvec_real
          I   matvec_real: matvecgg_real_rbl
Acclerate Linear Algebra with CBE BLAS
    I   Block (dense) matrix-vector multiply reduces to
        vector-vector product
          DO iq=1,nqr
          result(iq,0,0)=SUM(matrix(:,0:,0:,iq,0,0)*vector(:,:1,:1))
          ENDDO

    I   BLAS library in Cell SDK3.0
          I   3 levels: vec-vec, mat-vec, mat-mat
          I   2 APIs: PPE and SPE
          I   First step: use PPE BLAS API
    I   Replace SUM with DDOT (ppu blas)
          FUNCTION matcolvec(nvec,matcol,vec) RESULT(prod)
          prod = DDOT(nvec,matcol,1,vec,1)
          END FUNCTION matcolvec

          n1=SIZE(vector(:,:1,:1))
          DO iq=1,nqr
          result(iq,0,0)=
            matcolvec(n1,RESHAPE(matrix(:,0:,0:,iq,0,0),(/n1/)),
                          RESHAPE(vector(:,:1,:1),(/n1/)))
          ENDDO
Performance: A simple test case
(Linear Shear-Alfven wave)

    I   Environment variable: BLAS_NUMSPES=4 (for level 1 and 2
        blas); could be >4 for level 3 blas.
    I   Input parameters
                gridshape=’rect’,
                periodicity=’both’,
                geom=’lin’,
                per_length=0.7071068,

    I   Total (wallclock) time compared for 4 different mesh sizes
       mx x my              8x8           64x64    128x128    256x256
   ppu SUM (no spu)      4.34E+01       2.28E+02   1.27E+03   2.50E+04
   ppu DDOT (w spu)      3.29E+01       4.19E+02   2.02E+03   2.90E+04
Comparsion of Scaling
Summary

   I   Petascale performance achieved on hybrid system
   I   PS3 is a prototype of the hybrid system
   I   We have started porting of nimrod to PS3 and CBE
       platform in general
   I   Prliminary efforts and issues presented.
Porting NIMROD to CBE Platform: A long-term
project

    I   This is an exploratory study that may lead to a long-term
        project
    I   Approaches:
          I   Use vendor built CBE libraries: BLAS, FFT (from IBM SDK)
          I   Develop own customized CBE functions.
          I   The mix of the above two.
    I   Require coordinated planning, efforts and supports
You can also read