Announcing Tesla K20 Family NVIDIA Tesla Update - Sumit Gupta General Manager

Page created by Willie Payne

Hobbies & Interests

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Announcing Tesla K20 Family NVIDIA Tesla Update - Sumit Gupta General Manager

Announcing Tesla K20 Family
   NVIDIASumit
           Tesla    Update
                Gupta
   Supercomputing’12
        General Manager
     Tesla Accelerated Computing
           Sumit Gupta
           General Manager
     Tesla Accelerated Computing

                                   1

Today’s information is embargoed until
November 12 – 6:00 am US Pacific Time

Accelerated Computing Meets
                              Increased Demand for Science
                                                                            50x
                                                                                   Top500 Systems   OEM Systems
                                                                                   Industry Apps    Universities
                                                                            40x

                                                                            30x
                                                                                           Fermi
                                                                                         Launches
                                                                            20x

                                                                            10x

                                                                             0
http://www.teragridforum.org/mediawiki/images/f/f8/TGQR_2011Q1_Report.pdf         2008      2009     2010          2011       2012
                                                                                                                     Normalized to 2008

March of the GPUs
                                                     Maxwell
                  16

                  14
GFLOPS per Watt

                  12

                  10

                   8

                   6                        Kepler

                   4
                                  Fermi
                   2   Tesla

                       2008       2010       2012     2014

Tesla K20 Family
1   World’s Fastest, Most Efficient Accelerator

    Powered by CUDA: World’s Most Pervasive Parallel
2   Programming Model

3   Delivers World Record Performance for Scientific Apps

Announcing Tesla K20 Accelerator Family

                                          Tesla K20X   Tesla K20

                  Peak Double Precision     1.31 TF     1.17 TF

                  Peak Single Precision     3.95 TF     3.52 TF

                   Memory Bandwidth        250 GB/s     208 GB/s
  Tesla K20X
                      Memory size            6 GB        5 GB

K20X: 3x Faster Than Fermi
DGEMM
 TFlops
  1.5

   1

                                                      1.22
  0.5

                                    0.43
               0.17
   0
          Xeon E5-2687Wc      Tesla M2090 (Fermi)   Tesla K20X
          (8 core, 3.1 Ghz)

K20X: Most Efficient Accelerator
Linpack
 TFlops
  4.0
                                                                          76%
                                                                       Efficiency
  3.0

                    61%
  2.0            Efficiency

  1.0                                                                     2.25
                   1.03
  0.0
                 Fermi Server                                       Kepler Server
           2x SB CPUs + 2x M2090s                               2x SB CPUs + 2x K20X
                        Server Configuration: Dual socket E5-2680, 2.7 GHz + 2 GPUs

Titan: World’s #1 Open Science Supercomputer
                   18,688 Tesla K20X GPUs
      27 Petaflops Peak: 90% of Performance from GPUs
      17.59 Petaflops Sustained Performance on Linpack

K20X: Most Energy Efficient Accelerator
                       Current Green500 List
Titan K20X System
       Beats
 #1 on Green500:
   BlueGene/Q

2142.77 MFLOPS/W

30 Petaflops in 30 Days

K20 / K20X Availability

           Shipping this week
General Availability: November-December

Tesla K20 Family
1   World’s Fastest, Most Efficient Accelerator

    Powered by CUDA: World’s Most Pervasive Parallel
2   Programming Model

3   Delivers World Record Performance for Scientific Apps

CUDA: World’s Most Pervasive Parallel
              Programming Model

              Institutions with   629 University Courses
      8,000   CUDA Developers        In 62 Countries

  1,500,000   CUDA Downloads

395,000,000   CUDA GPUs Shipped

CUDA Apps Grows 60%, Accelerating Key Apps
# of Apps
                                                      Top Supercomputing Apps
  200                             61% Increase                         AMBER              LAMMPS
                                                 Computational
                                                                      CHARMM               NAMD
                                                   Chemistry          GROMACS             DL_POLY
  150                                                                 QMCPACK             Gaussian
                                                   Material
                   40% Increase                                    Quantum Espresso       NWChem
                                                   Science             GAMESS              VASP
                                                                        COSMO              CAM-SE
  100                                              Climate &
                                                                        GEOS-5              NIM
                                                   Weather                                  WRF
                                                                        Chroma                  GTS
   50                                               Physics             Denovo                 ENZO
                                                                         GTC                   MILC
                                                                   ANSYS Mechanical     ANSYS Fluent
                                                     CAE              MSC Nastran        OpenFOAM
    0                                                               SIMULIA Abaqus        LS-DYNA
            2010       2011            2012
                                                                 Accelerated, In Development

Leading Apps Now Accelerated by GPUs

Fluid Dynamics   Structual Mechanics   Life Sciences

                                       CHARMM

Tesla K20 Family
1   World’s Fastest, Most Efficient Accelerator

    Powered by CUDA: World’s Most Pervasive Parallel
2   Programming Model

3   Delivers World Record Performance for Scientific Apps

Fastest Performance on Scientific Applications
                     Tesla K20X Speed-Up over Sandy Bridge CPUs

  Higher Ed   MATLAB (FFT)*

    Physics         Chroma

     Earth      SPECFEM3D
   Science

  Molecular          AMBER
  Dynamics

                          0.0x        5.0x          10.0x                    15.0x                      20.0x
                                                           System Config- CPU results: Dual socket E5-2687w, 3.10 GHz
                                                               GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs
                                                     *MATLAB results comparing one i7-2600K CPU vs with Tesla K20 GPU

Record Breaking Simulation
                                                          Discover better materials for
                                                               magnetic storage
      New Record 10+ PFLOPS

        Old Record 3.1 PFLOPS

                 Effort 2% Lines of Code

2011 Gordon Bell Winner at 3.08 Petaflops on K Computer
                                                          WL-LSMS: Material Science

Applications Scale to 1000s of GPUs
                  Material Science                                           Molecular Dynamics
 Compute       QMCPACK, 3x3x1 Graphite                                        NAMD, 100x STMV
                                                             ns/day
 Efficiency
                                                            2.0
1500000

1250000
                                                            1.5
1000000

 750000                                                     1.0

 500000
                                                            0.5
 250000

       0                                                    0.0
           0     500       1000    1500      2000    2500             128           256          512            768
                        # of Compute Nodes                                         # of Compute Nodes
               Cray XK7-Tesla K20X    Cray XK7-CPU                          Cray XK7 - K20X    Cray XK7 - CPU

The Era of Accelerated Computing is Here

                                           Era of
                                   Accelerated Computing

                            Era of
                   Distributed Computing
      Era of
Vector Computing

          1980           1990              2000            2010   2020

SC12 News Summary
1   Introducing the Tesla K20 Accelerator Family

2   New CUDA Accelerated Apps and Growing Ecosystem

3   Record Setting Performance on Scientific Applications

    Embargoed Until Nov 12 – 6:00 am US PT

Customers Seeing Impressive K20 Speedups
         “Tesla K20 GPU is 2.3x faster than Tesla M2070, and
          no change was required in our code!
                                               ”
                                   Associate Professor in Mechanical Engineering
                                                                  Inanc Senocak

         “Results are amazing!   It is 160x faster than our CPU
           code and 2.5x faster than Fermi for our solutions
                                                                              ”
                                                  Professor in Computer Science
                                                                   Estaban Clua

         “Tesla K20 is very impressive. Our application
           runs 20x faster compared to a Sandy Bridge CPU.
                                                                              ”
                                                              Research Scientist
                                                   Oreste Villa, Antonino Tumeo

Teaching Parallel Programming with CUDA
   “I have found GPU programming using CUDA to be one of the easiest ways
    to introduce students to parallel programming.
                                                  ”        Professor Eric Darve
                                                            Stanford University

   “My students are amazed to find how easy the parallel programming with
    CUDA is and are thrilled by the performance from NVIDIA GPUs.
                                                                  ”
                                                       Professor Miaoqing Huang
                                                          University of Arkansas

   “CUDA allows me to teach students with no prior parallel programming
    experience to parallelize real-world apps in just a few weeks.
                                                                   ”
                                                            Professor Chris Lupo
                                                        Cal Poly San Luis Obispo

OpenACC Makes GPU Accelerator Easier
                        Design alternative fuels with
     4x Faster          up to 50% higher efficiency
Jaguar       Titan
42 days     10 days

  Minimal Effort
  with OpenACC
    Modified

Kepler: GPU Acceleration Made Easier Than Ever
            Hyper-Q                     Dynamic Parallelism
Easy speed-up for legacy MPI codes   GPU generates work for itself

Kepler: GPU Acceleration Made Easier Than Ever

                         Hyper-Q: 32 MPI jobs per GPU                            Dynamic Parallelism: GPU Generates Own Work
                        Easy Speed-up for Legacy MPI Apps                               Less Effort, Higher Performance
                                    CP2K- Quantum Chemistry                                                                                         Quicksort
                       20x                                                                                         4.0x
                                                                       3x                                                                                                            2x

                                                                                    Relative Sorting Performance
Speedup vs. Dual K20

                       15x                                                                                         3.0x

                       10x                                                                                         2.0x

                        5x                                                                                         1.0x

                        0x                                                                                         0.0x
                             0         5           10            15         20                                            0                           5                         10
                                              Number of GPUs                                                                        Increasing Problem Size (# of Elements)          Millions

                                 K20 with Hyper-Q     K20 without Hyper-Q                                                     Without Dynamic Parallelism     With Dynamic Parallelism

All Accelerators Programmed the Same Way

  Method                Xeon Phi                            GPU
                     Limited Support
                                                         Broad Support
   Libraries   Few functions in Intel MKL for
                                                  BLAS, FFT, MAGMA, CULA, …
                      offload mode

                                                          OpenACC
                       Proprietary
  Directives    Xeon Phi specific directives
                                                  Based on portable, industry
                                                          standard

                         Proprietary                       CUDA
  Language
               Vector intrinsics, like assembly     Simple C/C++/Fortran
  Extensions            programming                      extensions

You can also read