GAPE-DR Project: a combination of peta scale computing and high-speed networking - Kei Hiraki Department of Computer Science The University of Tokyo

Page created by Carolyn Chavez
 
CONTINUE READING
GAPE-DR Project: a combination of peta scale computing and high-speed networking - Kei Hiraki Department of Computer Science The University of Tokyo
GAPE-DR
         GAPE    DR PProject:
                        j t
a combination of peta-scale
                 peta scale computing
     and high-speed networking

                      Kei Hiraki

                      Department of Computer Science
                      The University of Tokyo

University of Tokyo                                    Kei Hiraki
Computing System for real Scientists
• Fast Computation at low cost
   – High
     High-speed
             speed CPU
                   CPU, huge memory
                              memory, good graphics
• Very fast data movements
   – High-speed network
                   network, Global file system
                                        system, Replication
     facilities
• Transparency to local computation
   – No complex middleware, or no modification to existing
     software

• Real Scientists are not computer scientists
• Computer scientists aren’t work forces for real scientists

                                                        Kei Hiraki
Our goal

    • HPC system
          – as an infrastructure of scientific research
          – Simulation, data intensive computation, searching and
            data mining
    • Tools for real scientists
          – High
            High-speed,
                 speed low
                        low-cost,
                            cost and easy to use
            ⇒ Today’s supercomputer cannot

                • High-speed/Low cost computation
                                                    GRAPE-DR
                • High-speed
                  High speed global networking
                                                    project
                • OS/ ALL IP network technology

University of Tokyo                                          Kei Hiraki
GRAPE-DR

• GRAPE-DR:Very high-speed attached processor
  – 2Pflops(2008) Possibly fastest supercomputer
  – Successor of Grape-6 astronomical simulator
• 2PFLOPS on 512 node
                   d cluster
                      l t system
                             t
  – 512 Gflops / chip
  – 2 M processor / system

• Semi-general-purpose
  S i        l         processing
                              i
  – N-body simulation, gridless fluid dyinamics
  – Linear solver, molecular dynamics
  – High-order database searching

                                                   Kei Hiraki
Data Reservoir
• Sharing Scientific Data between distant research institutes
   – Physics,
       y    , astronomy,
                      y, earth science,, simulation data

• Veryy High-speed
          g p      single
                      g file transfer on Long
                                            g Fat p
                                                  pipe
                                                    p Network
   – > 10 Gbps, > 20,000 Km, > 400ms RTT

• High utilization of available bandwidth
   – Transferred file data rate > 90% of available bandwidth
      • Including
        I l di h   header
                       d overheads,
                                h d iinitial
                                        iti l negotiation
                                                  ti ti overheads
                                                            h d

• OS and File system transparency
   – Storage level data sharing (high speed iSCSI protocol on stock TCP)
   – Fast single
             g file transfer

                                                                    Kei Hiraki
Our Next generation Supercomputing system

        ●Very high-performance computer
        ●Supercomputing/networking system for non CS area
                                                     GRAPE-DR processor
                                                     ・very fast (512 Gflops/chip)
                                                     ・low power (8(8.5Gflops/W)
                                                                     5Gflops/W)
                                                     ・low cost
                1 chip ---512Gflops,
                1 system -- 2Pflops

                                       Next generation
                                       Supercomputing
All IP computer architecture               system
・All components of computers
                                                                  Long distance TCP
Have own IP,
           IP and connect
freely                                                            technology
・Processor, memory,                                               ・same performance
Display, keyboard, mouse                                          From local to global
And disks etc.                                                            i i
                                                                  communication
(Keio Univ. Univ. of Tokyo)                                       ・30000Km,1.1GB/s
University of Tokyo                                                          Kei Hiraki
Peak              Speed of Top-End systems
FLOPS30
   10
     27
   10           Grape DR
                                             Parallel Systems
   1Y          2 PFLOPS
   1Z
                Earth
   1E         Simulator
             40TFLOPS
   1P                        ORNL 1P
                                                   KEISOKU
                           GB/P1P                   system
   1T
                                 Processor chips
   1G SIGMA-1                 Core2Extreme
                                  3GHz
   1M
        70   80    90 2000 2010 2020 2030 2040 2050          Kei Hiraki
History of speedup(some important systems)
        FLOPS

          1E
                                    BlueGene/L
         1P                 Grape-6
                            Grape 6
                                                       GRAPE-DR
                           ASCI-RED
                                                      Earth Simulator
          1T                CM-5
                                                   SR8000
                      Cray-2                     NWT
         1G
                 Cray-1
                                         SX-2
  CDC6600
         1M
                                HITAC 8800

                      70       80      90    2000
                                                                               西暦
                                                    2010 2020    2030   2040

University of Tokyo                                                                 Kei Hiraki
History of power efficiency
    fl /W
    flops/W                                            Necessary power efficiency curve
          1P
                       GRAPE-DR(estimated)
          1T                 Grape-6
                                                                    Lower power efficiency
                            BlueGene/L
          1G
                                                                    i
                                                                    improvement
                                                                              t
                         ASCI RED
         1M            Cray-2                        SX-8, ASCI-Purple,
                 Cray-1
                                                         HPC2500
          1K                                         Earth Simulator
                                           VPP5000
            1

                                                                              Year
                      70     80    90    2000 2010    2020   2030      2040

University of Tokyo                                                                  Kei Hiraki
GRAPE-DR project (2004 to 2008)

        •     Development of practical MPP system
              – Pflops systems for real scientists
        •     Sub projects
              – Processor system
                            y
              – System software
              – Application software
                • Simulation of stars and galaxies、CFD、MD, Linear system
        •     2-3 Pflops peak、20 to 30 racks
        •     0.7 – 1Pflops Linpack
        •     4000 – 6000 GRAPE-DR chips
        •     512 – 768 servers + Interconnect
        •     500Kw/Pflops

University of Tokyo                                                Kei Hiraki
GRAPE-DR system

•   2 Pflops peak, 40 racks
•   512 servers + Infiniband
•   4000 GRAPE-DR
          GRAPE DR chips
                       hi
•   0.5MW/Pflops(peak)
               p (p     )
•   World fastest computer?

       • GRAPE-DR (2008)
       • IBM BlueGene/P(2008?)
              Bl G    /P(2008?)
       • Cray Baker(ORNL)(2008?)

                                   Kei Hiraki
GRAPE-DR chip

                           Memory           IA
                                           Server

512 GFlops /chip 2TFLOPS/card
                 2TFLOPS/ d
                  PCI-express x8 I/O bus

   2PFLOPS/system
   2M PE/system
      PE/   t
   20Gb IInfiniband
   20Gbps  fi ib d
                                               Kei Hiraki
Node Architecture
        •   Accelerator attached to a host machine via I/O bus
        •   Instruction execution between shared memory   y and local data
        •   All the PE in a chip execute the same instruction (SIMD)

                                   Distribution of

                                                         Shared
                                   instruction, data                     PE   PE   PE

                                                                                          GR
                                                              d memory

                                                                                           RAPE-D
                                                                         PE   PE   PE
                                   Control
                                     制御
                                  processor
                                  プロセッサ                                  PE   PE   PE
                       Host
                       ホスト

                                                                                                DR Chip
                      machine
                       マシン

                                                         Shared memory
                                                                         PE   PE   PE

                                                                         PE   PE   PE
                                           Collection                    PE   PE   PE
                                           of results

                                                        Processing element
                         Network to other hosts
University of Tokyo                                                                     Kei Hiraki
GRAPE-DR Processor chip
   • SIMD architecture
              hi
   • 512PEs
   • No interconnections between PEs
Local memory
64bit FALU, IALU
Conditional execution

Broadcasting memory         shared memory Shared memoryShared memory

Collective Operations           Broadcast/Collective Tree

                             I/O bus
           Main memory on              External shared
           Ah
            hostt server                  memory
                                                                       Kei Hiraki
Processing Element
          • 512 PE in a chip

University of Tokyo                         Kei Hiraki
• 90nm CMOS
• 18mm x18mm
• 400M Tr
• BGA 725

• 512Gflops(32bit)
• 256Gflops(64bit)
• 60W(max)

                     Kei Hiraki
Chip layout in a processing element

   Module             Color
   Name
   grf0               red
   fmul0              orange
   fadd0              green
   alu0               blue
   lm0                magenta
   others             yellow

University of Tokyo                                     Kei Hiraki
GRAPE-DR Processor chip

    Engineering Sample (2006)
                                        (
          512 Processing Elements/Chip(Largest   number in a chip))
          Working on a prototype board (at designed speed)
          500MHz、512Gflops(Fastest chip for production)
               Power Consumption max. 60W          Idle 30W
                     (Lowest power per computation)

University of Tokyo                                          Kei Hiraki
GRAPE-DR Prototype boards

        • For evaluation(1 GRAPE-DR chip、PCI-X bus)
        • Next step is 4 chip board

University of Tokyo                                   Kei Hiraki
GRAPE-DR Prototype boards(2)

        •   2nd prototype
        •   Single-chip board
        •   PCI-Express x8 interface
        •   On board DRAM
            On-board
        •   Designed to run real applications
        •   (Mass production version will have 4 chips)
            (Mass-production

University of Tokyo                                       Kei Hiraki
Block diagram of a board
        •   4 GRAPE-DR chip ⇒ 2 Tflops(single), 1Tflops(double)
        •   4 FPGA for control processors
        •   On-board DDR2-DRAM 4GB
        •   Interconnection byy a PCI-express
                                        p     bridge
                                                  g chip
                                                       p
                      DRAM              DRAM              DRAM              DRAM

         GRAPE-         CP     GRAPE-     CP     GRAPE-     CP     GRAPE-     CP
           DR         (FPGA)     DR     (FPGA)     DR     (FPGA)     DR     (FPGA)

                                    PCI-express bridge chip

                                    PCI-express x16
University of Tokyo                                                                  Kei Hiraki
Compiler for GRAPE-DR
• GRAPE-DR optimizing compiler
  – Parallelization with global analysis
  – Generation of multi-threaded code for complex
    operation

  ○ Sakura-C
    Sakura C compiler
    ・ Basic optimization
          (PDG flow analysys、pointer analysis etc
          (PDG,flow                             etc.))
    ・ Souce program in C ⇒ intermeddiate language
       ⇒ GRAPE-DR machine code
  ○ Currently, 2x to 3x slower than hand optimized codes

                                                    Kei Hiraki
GRAPE-DR Application software

•   Astronomical Simulation
•   SPH(Smoothed Particle Hydrodynamics)
•   Molecular dynamics
•   Quantum molecular simulation
•   Genome sequence matching
•   Linear equations on dense matrices
    – Linpack

                                       Kei Hiraki
Application fields for GRAPE-DR
                                 Computation
                                    p        Speed
                                              p    ((GRAPE-DR,, GPGPU etc.))

                                      N-body simulation
                                      Molecular Dynamics
                                                  y
                                      Dense linear equations
                                      Finite Element Method
                                      Quantum molecular simulation
                                      Genome matching
                                      Protein Folding

                                      General Purpose CPU
                                      Balanced performance

                                                         Fluid dynamics
                                                         Weather forecasting
                          QCD                            Earthquake simulation
                          FFT based applications
       Network BW
       IBM BlueGene/L,P
           BlueGene/L P                                       Memory BW
                                                              Vector machines
University of Tokyo                                                              Kei Hiraki
Eflops system
        • 2011 10Pflops systems will be appeared
              • Next Generation Supercomputer by Japanese Government

              • IBM BlueGene/Q system
              • Cray Cascade system (ORNL)
              • IBM HPCS

        Architecture
              – Cluster of Comodity MPU (Sparc/Intel/AMD/IBM Power)
              – Vector processor
              – Denselyy integrated
                             g      low-power
                                        p     p
                                              processors

              – 1 ~ 1.5 ~2.5 MW/Pflops power efficiency

University of Tokyo                                                   Kei Hiraki
Peak
FLOPS30
                        Speed of Top-End systems
   10
     27
   10
                    Grape DR                            Parallel Systems
   1Y
                   2 PFLOPS
   1Z

   1E
                                       KEISOKU system 1XP
   1P                                ORNL 1P
             Earth Simulator 40T   BG/P1P
   1T
                                         Processor chips
   1G SIGMA-1
                                                 Intel XXnm
                                                   YY
                                                   YYcore
   1M
        70    80      90 2000 2010 2020 2030 2040 2050             Kei Hiraki
History of power efficiency
    fl /W
    flops/W                                            Necessary power efficiency curve
          1P
                       GRAPE-DR(estimated)
          1T                 Grape-6

                            BlueGene/L
          1G
                         ASCI RED
         1M            Cray-2                        SX-8, ASCI-Purple,
                 Cray-1
                                                         HPC2500
          1K                                         Earth Simulator
                                           VPP5000
            1

                                                                              Year
                      70     80    90    2000 2010    2020   2030      2040

University of Tokyo                                                                  Kei Hiraki
CDC6600                               Development of Supercomputers                                                                        IBM 360/67

         CDC7600            Vector                               SIMD                    Distributed Memory                                   Shared Memory                                1970
                            TI ASC
               STAR-100
                                                           ILLIAC IV                                                                                     Fastest system at one time
                                                                                  AP-120B
    Cray-1                                                                                                                                               Research/Special system
                                   230-75APU
                                   230 75APU                                                              C mmp
                                                                                                          C.mmp
                                               M180IAP
                Cyber205
                                                                                                                                            HEP                                            1980
                                                          DAP

         Cray-XMP
                                   VP-200      S-810       MPP
                                                                                        Cosmic Cube
                                                                                                                                                                                          Cydrome
Cray-2                     SX-2 VP-400                       CM-1
                                                                                             iPSC
                                                                                                                     Ncube
                                                                                                                                               Multimax            FX-8                     Multiflow
                                                                                                                  WARP                              Sequent
                                               S-820         CM-2                                                                                                  FX800
         Cray-YMP

                 ETA-10            VP-2600                                                                    RP3                                 CS-1           FX2800                    1990
                           SX-3                                  MP-1
                                                                                         T.S.Delta                                                            KSR-1
                                                                                                                    QCD-PAX      AP1000
          Cray-C90                                               MP-2
                                    NWT                                     CM-5
                                               S 3800
                                               S-3800                                    P
                                                                                         Paragon              SP1                                                     CS6400
Cray-3                                                                            T3D                                            AP3000                                              Challenge
                                                                                                                                                  CS-2
                           SX-4                                                                                                                                                                   Itanium
          Cray-T90
                                     VPP700                                       T3E                                         CP-PACS                                                Origin2000
                                                                 MMX                     ASCI RED             SP2                                                         Starfire
                                                                                                                                            MTA
                           SX-5
          Cray SV1
          Cray-SV1                                                                                                               SR8000                                                     2000
                                   VPP5000                        SSE                                         SP3                                             PrimePower
                                                                                            IA Clusters
                                                         PS2EE                                                                                                                       Origin3800
                                                                        GRAPE-6                             ASCI         QCDSP                      Regatta           SUN Fire
                           ES                                    SSE2                                       White
          Cray-X1          SX-66
                           SX                                                                                                                                  HPC2500
                                                                                                                          BG/L            SR11000
                                                         CELL                      XT3                                                                                                  Altix
         BlackWidow        SX-8
                                                                    GRAPE-DR                                              BG/P
                           SX-9
University of Tokyo                                                                                                                                                             Kei Hiraki
Eflops system
        • Requirements for building Eflops systems
              • Power efficiency
                 • MAX Power ~ 30 MW
                 • Efficiency
                   Effi i      mustt b
                                     be smaller
                                            ll th
                                                than 30 Gflops/W
                                                        Gfl   /W
              • Number of processor chips
                 • 300,000
                   300 000 chips
                               hi ? (3 Tfl
                                        Tflops/chip)
                                              / hi )
              • Memory system
                 • Smaller
                   S ll memory / processor chip    hi ⇒ Power
                                                          P
                   consumption
                 • Some kind of shared memory mechanism is a must
              • Limitation of cost
                 • 1 billi
                     billion $/ Eflops
                                 Efl   ⇒ 1000 $/Tflops
                                                 $/Tfl
University of Tokyo                                        Kei Hiraki
Effective approaches
        • General purpose MPUs (clusters)

        • Vectors

        • GPGPU

        • SIMD (GRAPE
               (GRAPE-DR)
                      DR)

        • FPGA based simulating engine

University of Tokyo                          Kei Hiraki
Unsuitable architecture
        • Vectors
              •   Too much power for numerical computation
              •   Too much power for wide memory interface
              •   L
                  Less effective
                        ff ti to t wide
                                    id range off numerical
                                                      i l algorithms
                                                           l ith
              •   Too expensive for computing speed

        • General purpose MPUs (clusters)
              • Too low FPU speed per die size
              • Too much power for numerical computation
              • Too many processor chips for stable operation

University of Tokyo                                             Kei Hiraki
Unsuitable architecture
        • Vectors
              – Current version 65nm Technology
                 • 0.06 Gflops/W peak
                 • 2,000,000$/Tflops
                   2 000 000$/Tfl

        • General purpose MPUs (clusters)
              – Current version 65nm Technology
                 • 0.3 Gflops/W peak (BlueGene/P, TACC etc.)
                 • 200,000$/Tflops

      Goals are
        30Gflops/W 1000$/Tflops
        30Gflops/W, 1000$/Tflops, 3Tflops/chip
University of Tokyo                                            Kei Hiraki
General purpose CPU

University of Tokyo                           Kei Hiraki
                            (J.Makino 2007)
Architectural Candidates
        • SIMD (GRAPE-DR)
              •   30%   FALU
              •   20%   IALU            About 50% is used for computation
              •   20%   Register
                           g     file
              •   20%   Memory          Very efficient
              •   10%   Interconnect
        • GPGPU
              • Less expensive due to the number of production
              • Less efficient than properly designed SIMD processor
                 • Larger die size
                 • Larger power consumption

University of Tokyo                                                    Kei Hiraki
GPGPU(1)

                          About 50% is used for computation

                          Very efficient

University of Tokyo                                    Kei Hiraki
                      (J.Makino 2007)
GPGPU(2)

                          About 50% is used for computation

                          Very efficient

University of Tokyo                                    Kei Hiraki
                      (J.Makino 2007)
Comparison with GPGPU

                           About 50% is used for computation

                           Very efficient

University of Tokyo                                     Kei Hiraki
                       (J.Makino 2007)
Comparison with GPGPU(2)

                          About 50% is used for computation

                          Very efficient

University of Tokyo                                    Kei Hiraki
                      (J.Makino 2007)
FPGA based simulator

                            About 50% is used for computation

                            Very efficient

University of Tokyo                                      Kei Hiraki
                        (J.Makino 2007)
(Kawai, Fukushige, 2006)

                              About 50% is used for computation

                              Very efficient

University of Tokyo                                        Kei Hiraki
Astronomical Simulation
                           (Kawai, Fukushige, 2006)

                                   About 50% is used for computation

                                   Very efficient

University of Tokyo                                             Kei Hiraki
About 50% is used for computation

                                  Very efficient

 University of Tokyo                                           Kei Hiraki
(Kawai,        Fukushige, 2006)
Grape-7 board (Fukushige et al.)
                      (Kawai, Fukushige, 2006)

University of Tokyo                              Kei Hiraki
(Kawai, Fukushige, 2006)

                              About 50% is used for computation

                              Very efficient

University of Tokyo                                        Kei Hiraki
SIMD v.s. FPGA
        • High-efficiency of FPGA based simulation
              – Small precision calculation
              – Bit manipulation
              – Use
                U off generall purpose FPGA is
                                            i essential
                                                   ti l tto reduce
                                                              d
                cost.

              – Current version 65nm Technology
                 • 8 Gflops/W peak
                 • 105,000$/Tflops

              – 30 Gflops/W peak will be feasible at 2017
              – 1000 $/ Tflops will be a tough target
University of Tokyo                                           Kei Hiraki
SIMD v.s. FPGA
        • GRAPE-DRx based simulation
              – Double precision calculation
              – Programming is much easier than FPGA based
                approach
                 • HDL based programming v.s. C programming
              – Use of latest semiconductor technology is the key

              – Current version 90nm Technology
                 • 4 Gflops/W peak
                 • 10,000$/Tflops
                   10 000$/Tflops
              – 30 Gflops/W peak will be feasible at 2017
              – 1000 $/ Tflops will be feasible at 2017
University of Tokyo                                            Kei Hiraki
Systems in 2017

        • Eflops system

              – Necessary condition
                 • Cost ~1000$/Tflops
                                $   p
                 • Power consumption (30Gflops/W -> 30MW)

              – GRAPE-DR technology
                 • 45nm CMOS -> 4Tflops/chip, 40Gflops/W
                 • 22nm CMOS ->
                              > 16Tflops/chip,
                                16Tflops/chip 160Gflops/W

              – GPGPU Low double p precision p
                                             performance
                 • Power consumption, Die size
              – ClearSpeed
                 • Low performance (25Gfops/chip)
University of Tokyo                                         Kei Hiraki
To FPGA community
        • Key issues to achieve next generation HPC
              – Low cost.
                    cost Special purpose FPGA will be useless to survive in
                HPC
              – Low power. Low power is the most important feature of FPGA
                based simulation
              – High-speed. 500 Gflops/chip? (short precision)

        • Programming environment
              –   More important to HPC users (not skilled in hardware design)
              –   Programming using conventional programming languages
              –   Good compiler,
                            p , run-time and debuggergg
              –   New algorithm that utilize short-precision arithmetic is
                  necessary

University of Tokyo                                                        Kei Hiraki
You can also read