Announcing Tesla K20 Family NVIDIA Tesla Update - Sumit Gupta General Manager
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Announcing Tesla K20 Family
NVIDIASumit
Tesla Update
Gupta
Supercomputing’12
General Manager
Tesla Accelerated Computing
Sumit Gupta
General Manager
Tesla Accelerated Computing
1Accelerated Computing Meets
Increased Demand for Science
50x
Top500 Systems OEM Systems
Industry Apps Universities
40x
30x
Fermi
Launches
20x
10x
0
http://www.teragridforum.org/mediawiki/images/f/f8/TGQR_2011Q1_Report.pdf 2008 2009 2010 2011 2012
Normalized to 2008March of the GPUs
Maxwell
16
14
GFLOPS per Watt
12
10
8
6 Kepler
4
Fermi
2 Tesla
2008 2010 2012 2014Tesla K20 Family
1 World’s Fastest, Most Efficient Accelerator
Powered by CUDA: World’s Most Pervasive Parallel
2 Programming Model
3 Delivers World Record Performance for Scientific AppsAnnouncing Tesla K20 Accelerator Family
Tesla K20X Tesla K20
Peak Double Precision 1.31 TF 1.17 TF
Peak Single Precision 3.95 TF 3.52 TF
Memory Bandwidth 250 GB/s 208 GB/s
Tesla K20X
Memory size 6 GB 5 GBK20X: 3x Faster Than Fermi
DGEMM
TFlops
1.5
1
1.22
0.5
0.43
0.17
0
Xeon E5-2687Wc Tesla M2090 (Fermi) Tesla K20X
(8 core, 3.1 Ghz)K20X: Most Efficient Accelerator
Linpack
TFlops
4.0
76%
Efficiency
3.0
61%
2.0 Efficiency
1.0 2.25
1.03
0.0
Fermi Server Kepler Server
2x SB CPUs + 2x M2090s 2x SB CPUs + 2x K20X
Server Configuration: Dual socket E5-2680, 2.7 GHz + 2 GPUsTitan: World’s #1 Open Science Supercomputer
18,688 Tesla K20X GPUs
27 Petaflops Peak: 90% of Performance from GPUs
17.59 Petaflops Sustained Performance on LinpackK20X: Most Energy Efficient Accelerator
Current Green500 List
Titan K20X System
Beats
#1 on Green500:
BlueGene/Q
2142.77 MFLOPS/W30 Petaflops in 30 Days
K20 / K20X Availability
Shipping this week
General Availability: November-DecemberTesla K20 Family
1 World’s Fastest, Most Efficient Accelerator
Powered by CUDA: World’s Most Pervasive Parallel
2 Programming Model
3 Delivers World Record Performance for Scientific AppsCUDA: World’s Most Pervasive Parallel
Programming Model
Institutions with 629 University Courses
8,000 CUDA Developers In 62 Countries
1,500,000 CUDA Downloads
395,000,000 CUDA GPUs ShippedCUDA Apps Grows 60%, Accelerating Key Apps
# of Apps
Top Supercomputing Apps
200 61% Increase AMBER LAMMPS
Computational
CHARMM NAMD
Chemistry GROMACS DL_POLY
150 QMCPACK Gaussian
Material
40% Increase Quantum Espresso NWChem
Science GAMESS VASP
COSMO CAM-SE
100 Climate &
GEOS-5 NIM
Weather WRF
Chroma GTS
50 Physics Denovo ENZO
GTC MILC
ANSYS Mechanical ANSYS Fluent
CAE MSC Nastran OpenFOAM
0 SIMULIA Abaqus LS-DYNA
2010 2011 2012
Accelerated, In DevelopmentLeading Apps Now Accelerated by GPUs
Fluid Dynamics Structual Mechanics Life Sciences
CHARMMTesla K20 Family
1 World’s Fastest, Most Efficient Accelerator
Powered by CUDA: World’s Most Pervasive Parallel
2 Programming Model
3 Delivers World Record Performance for Scientific AppsFastest Performance on Scientific Applications
Tesla K20X Speed-Up over Sandy Bridge CPUs
Higher Ed MATLAB (FFT)*
Physics Chroma
Earth SPECFEM3D
Science
Molecular AMBER
Dynamics
0.0x 5.0x 10.0x 15.0x 20.0x
System Config- CPU results: Dual socket E5-2687w, 3.10 GHz
GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs
*MATLAB results comparing one i7-2600K CPU vs with Tesla K20 GPURecord Breaking Simulation
Discover better materials for
magnetic storage
New Record 10+ PFLOPS
Old Record 3.1 PFLOPS
Effort 2% Lines of Code
2011 Gordon Bell Winner at 3.08 Petaflops on K Computer
WL-LSMS: Material ScienceApplications Scale to 1000s of GPUs
Material Science Molecular Dynamics
Compute QMCPACK, 3x3x1 Graphite NAMD, 100x STMV
ns/day
Efficiency
2.0
1500000
1250000
1.5
1000000
750000 1.0
500000
0.5
250000
0 0.0
0 500 1000 1500 2000 2500 128 256 512 768
# of Compute Nodes # of Compute Nodes
Cray XK7-Tesla K20X Cray XK7-CPU Cray XK7 - K20X Cray XK7 - CPUThe Era of Accelerated Computing is Here
Era of
Accelerated Computing
Era of
Distributed Computing
Era of
Vector Computing
1980 1990 2000 2010 2020SC12 News Summary
1 Introducing the Tesla K20 Accelerator Family
2 New CUDA Accelerated Apps and Growing Ecosystem
3 Record Setting Performance on Scientific Applications
Embargoed Until Nov 12 – 6:00 am US PTCustomers Seeing Impressive K20 Speedups
“Tesla K20 GPU is 2.3x faster than Tesla M2070, and
no change was required in our code!
”
Associate Professor in Mechanical Engineering
Inanc Senocak
“Results are amazing! It is 160x faster than our CPU
code and 2.5x faster than Fermi for our solutions
”
Professor in Computer Science
Estaban Clua
“Tesla K20 is very impressive. Our application
runs 20x faster compared to a Sandy Bridge CPU.
”
Research Scientist
Oreste Villa, Antonino TumeoTeaching Parallel Programming with CUDA
“I have found GPU programming using CUDA to be one of the easiest ways
to introduce students to parallel programming.
” Professor Eric Darve
Stanford University
“My students are amazed to find how easy the parallel programming with
CUDA is and are thrilled by the performance from NVIDIA GPUs.
”
Professor Miaoqing Huang
University of Arkansas
“CUDA allows me to teach students with no prior parallel programming
experience to parallelize real-world apps in just a few weeks.
”
Professor Chris Lupo
Cal Poly San Luis ObispoOpenACC Makes GPU Accelerator Easier
Design alternative fuels with
4x Faster up to 50% higher efficiency
Jaguar Titan
42 days 10 days
Minimal Effort
with OpenACC
ModifiedKepler: GPU Acceleration Made Easier Than Ever
Hyper-Q Dynamic Parallelism
Easy speed-up for legacy MPI codes GPU generates work for itselfKepler: GPU Acceleration Made Easier Than Ever
Hyper-Q: 32 MPI jobs per GPU Dynamic Parallelism: GPU Generates Own Work
Easy Speed-up for Legacy MPI Apps Less Effort, Higher Performance
CP2K- Quantum Chemistry Quicksort
20x 4.0x
3x 2x
Relative Sorting Performance
Speedup vs. Dual K20
15x 3.0x
10x 2.0x
5x 1.0x
0x 0.0x
0 5 10 15 20 0 5 10
Number of GPUs Increasing Problem Size (# of Elements) Millions
K20 with Hyper-Q K20 without Hyper-Q Without Dynamic Parallelism With Dynamic ParallelismAll Accelerators Programmed the Same Way
Method Xeon Phi GPU
Limited Support
Broad Support
Libraries Few functions in Intel MKL for
BLAS, FFT, MAGMA, CULA, …
offload mode
OpenACC
Proprietary
Directives Xeon Phi specific directives
Based on portable, industry
standard
Proprietary CUDA
Language
Vector intrinsics, like assembly Simple C/C++/Fortran
Extensions programming extensionsYou can also read