NVIDIA TESLA: A UNIFIED - GRAPHICS AND COMPUTING ARCHITECTURE

Page created by Brenda Wilson
 
CONTINUE READING
........................................................................................................................................................................................................................................................

                                                      NVIDIA TESLA: A UNIFIED
                                                           GRAPHICS AND
                                                      COMPUTING ARCHITECTURE
........................................................................................................................................................................................................................................................
                                                                                 TO ENABLE FLEXIBLE, PROGRAMMABLE GRAPHICS AND HIGH-PERFORMANCE COMPUTING,

                                                                                 NVIDIA HAS DEVELOPED THE TESLA SCALABLE UNIFIED GRAPHICS AND PARALLEL
                                                                                 COMPUTING ARCHITECTURE. ITS SCALABLE PARALLEL ARRAY OF PROCESSORS IS

                                                                                 MASSIVELY MULTITHREADED AND PROGRAMMABLE IN C OR VIA GRAPHICS APIS.

                                                                                ......     The modern 3D graphics process-
                                                                                 ing unit (GPU) has evolved from a fixed-
                                                                                                                                                                         In this article, we discuss the require-
                                                                                                                                                                       ments that drove the unified graphics and
                                                                                 function graphics pipeline to a programma-                                            parallel computing processor architecture,
                                                                                 ble parallel processor with computing power                                           describe the Tesla architecture, and how it is
                                                                                 exceeding that of multicore CPUs. Tradi-                                              enabling widespread deployment of parallel
                                                                                 tional graphics pipelines consist of separate                                         computing and graphics applications.
                                                                                 programmable stages of vertex processors
                                                                                 executing vertex shader programs and pixel                                            The road to unification
                                       Erik Lindholm                             fragment processors executing pixel shader                                               The first GPU was the GeForce 256,
                                                                                 programs. (Montrym and Moreton provide                                                introduced in 1999. It contained a fixed-
                                       John Nickolls                             additional background on the traditional                                              function 32-bit floating-point vertex trans-
                                                                                 graphics processor architecture.1)                                                    form and lighting processor and a fixed-
                                     Stuart Oberman                                 NVIDIA’s Tesla architecture, introduced                                            function integer pixel-fragment pipeline,
                                                                                 in November 2006 in the GeForce 8800                                                  which were programmed with OpenGL
                                      John Montrym                               GPU, unifies the vertex and pixel processors                                          and the Microsoft DX7 API.5 In 2001,
                                                                                 and extends them, enabling high-perfor-                                               the GeForce 3 introduced the first pro-
                                               NVIDIA                            mance parallel computing applications writ-                                           grammable vertex processor executing vertex
                                                                                 ten in the C language using the Compute                                               shaders, along with a configurable 32-bit
                                                                                 Unified Device Architecture (CUDA2–4)                                                 floating-point fragment pipeline, pro-
                                                                                 parallel programming model and develop-                                               grammed with DX85 and OpenGL.6 The
                                                                                 ment tools. The Tesla unified graphics and                                            Radeon 9700, introduced in 2002, featured
                                                                                 computing architecture is available in a                                              a programmable 24-bit floating-point pixel-
                                                                                 scalable family of GeForce 8-series GPUs                                              fragment processor programmed with DX9
                                                                                 and Quadro GPUs for laptops, desktops,                                                and OpenGL.7,8 The GeForce FX added 32-
                                                                                 workstations, and servers. It also provides                                           bit floating-point pixel-fragment processors.
                                                                                 the processing architecture for the Tesla                                             The XBox 360 introduced an early unified
                                                                                 GPU computing platforms introduced in                                                 GPU in 2005, allowing vertices and pixels
                                                                                 2007 for high-performance computing.                                                  to execute on the same processor.9
                                                                                                                                                                                        ........................................................................

                        0272-1732/08/$20.00             G   2008 IEEE                           Published by the IEEE Computer Society.                                                                                                            39
.........................................................................................................................................................................................................................
                                                 HOT CHIPS 19

                                                                                 Vertex processors operate on the vertices                                          texture units. The generality required of a
                                                                             of primitives such as points, lines, and                                               unified processor opened the door to a
                                                                             triangles. Typical operations include trans-                                           completely new GPU parallel-computing
                                                                             forming coordinates into screen space,                                                 capability. The downside of this generality
                                                                             which are then fed to the setup unit and                                               was the difficulty of efficient load balancing
                                                                             the rasterizer, and setting up lighting and                                            between different shader types.
                                                                             texture parameters to be used by the pixel-                                               Other critical hardware design require-
                                                                             fragment processors. Pixel-fragment proces-                                            ments were architectural scalability, perfor-
                                                                             sors operate on rasterizer output, which fills                                         mance, power, and area efficiency.
                                                                             the interior of primitives, along with the                                                The Tesla architects developed the
                                                                             interpolated parameters.                                                               graphics feature set in coordination with
                                                                                 Vertex and pixel-fragment processors                                               the development of the Microsoft Direct3D
                                                                             have evolved at different rates: Vertex                                                DirectX 10 graphics API.10 They developed
                                                                             processors were designed for low-latency,                                              the GPU’s computing feature set in coor-
                                                                             high-precision math operations, whereas                                                dination with the development of the
                                                                             pixel-fragment processors were optimized                                               CUDA C parallel programming language,
                                                                             for high-latency, lower-precision texture                                              compiler, and development tools.
                                                                             filtering. Vertex processors have tradition-
                                                                             ally supported more-complex processing, so                                             Tesla architecture
                                                                             they became programmable first. For the                                                   The Tesla architecture is based on a
                                                                             last six years, the two processor types                                                scalable processor array. Figure 1 shows a
                                                                             have been functionally converging as the                                               block diagram of a GeForce 8800 GPU
                                                                             result of a need for greater programming                                               with 128 streaming-processor (SP) cores
                                                                             generality. However, the increased general-                                            organized as 16 streaming multiprocessors
                                                                             ity also increased the design complexity,                                              (SMs) in eight independent processing units
                                                                             area, and cost of developing two separate                                              called texture/processor clusters (TPCs).
                                                                             processors.                                                                            Work flows from top to bottom, starting
                                                                                 Because GPUs typically must process                                                at the host interface with the system PCI-
                                                                             more pixels than vertices, pixel-fragment                                              Express bus. Because of its unified-processor
                                                                             processors traditionally outnumber vertex                                              design, the physical Tesla architecture
                                                                             processors by about three to one. However,                                             doesn’t resemble the logical order of
                                                                             typical workloads are not well balanced,                                               graphics pipeline stages. However, we will
                                                                             leading to inefficiency. For example,                                                  use the logical graphics pipeline flow to
                                                                             with large triangles, the vertex processors                                            explain the architecture.
                                                                             are mostly idle, while the pixel processors                                               At the highest level, the GPU’s scalable
                                                                             are fully busy. With small triangles,                                                  streaming processor array (SPA) performs
                                                                             the opposite is true. The addition of                                                  all the GPU’s programmable calculations.
                                                                             more-complex primitive processing in                                                   The scalable memory system consists of
                                                                             DX10 makes it much harder to select a                                                  external DRAM control and fixed-function
                                                                             fixed processor ratio.10 All these factors                                             raster operation processors (ROPs) that
                                                                             influenced the decision to design a unified                                            perform color and depth frame buffer
                                                                             architecture.                                                                          operations directly on memory. An inter-
                                                                                 A primary design objective for Tesla was                                           connection network carries computed
                                                                             to execute vertex and pixel-fragment shader                                            pixel-fragment colors and depth values from
                                                                             programs on the same unified processor                                                 the SPA to the ROPs. The network also
                                                                             architecture. Unification would enable dy-                                             routes texture memory read requests from
                                                                             namic load balancing of varying vertex- and                                            the SPA to DRAM and read data from
                                                                             pixel-processing workloads and permit the                                              DRAM through a level-2 cache back to the
                                                                             introduction of new graphics shader stages,                                            SPA.
                                                                             such as geometry shaders in DX10. It also                                                 The remaining blocks in Figure 1 deliver
                                                                             let a single team focus on designing a fast                                            input work to the SPA. The input assembler
                                                                             and efficient processor and allowed the                                                collects vertex work as directed by the input
                                                                             sharing of expensive hardware such as the                                              command stream. The vertex work distri-
.......................................................................

                     40                         IEEE MICRO
Figure 1. Tesla unified graphics and computing GPU architecture. TPC: texture/processor cluster; SM: streaming
 multiprocessor; SP: streaming processor; Tex: texture, ROP: raster operation processor.

bution block distributes vertex work packets     Command processing
to the various TPCs in the SPA. The TPCs            The GPU host interface unit communi-
execute vertex shader programs, and (if          cates with the host CPU, responds to
enabled) geometry shader programs. The           commands from the CPU, fetches data from
resulting output data is written to on-chip      system memory, checks command consisten-
buffers. These buffers then pass their results   cy, and performs context switching.
to the viewport/clip/setup/raster/zcull block       The input assembler collects geometric
to be rasterized into pixel fragments. The       primitives (points, lines, triangles, line
pixel work distribution unit distributes pixel   strips, and triangle strips) and fetches
fragments to the appropriate TPCs for            associated vertex input attribute data. It
pixel-fragment processing. Shaded pixel-         has peak rates of one primitive per clock
fragments are sent across the interconnec-       and eight scalar attributes per clock at the
tion network for processing by depth and         GPU core clock, which is typically
color ROP units. The compute work                600 MHz.
distribution block dispatches compute               The work distribution units forward the
thread arrays to the TPCs. The SPA accepts       input assembler’s output stream to the array
and processes work for multiple logical          of processors, which execute vertex, geom-
streams simultaneously. Multiple clock           etry, and pixel shader programs, as well as
domains for GPU units, processors,               computing programs. The vertex and com-
DRAM, and other units allow independent          pute work distribution units deliver work to
power and performance optimizations.             processors in a round-robin scheme. Pixel
                                                                                                ........................................................................

                                                                                          MARCH–APRIL 2008                                   41
.........................................................................................................................................................................................................................
                                                 HOT CHIPS 19

                                                                                Figure 2. Texture/processor cluster (TPC).

                                                                             work distribution is based on the pixel                                                tions to texture operations, one texture unit
                                                                             location.                                                                              serves two SMs. This architectural ratio can
                                                                                                                                                                    vary as needed.
                                                                             Streaming processor array
                                                                                The SPA executes graphics shader thread                                             Geometry controller
                                                                             programs and GPU computing programs                                                       The geometry controller maps the logical
                                                                             and provides thread control and manage-                                                graphics vertex pipeline into recirculation
                                                                             ment. Each TPC in the SPA roughly                                                      on the physical SMs by directing all
                                                                             corresponds to a quad-pixel unit in previous                                           primitive and vertex attribute and topology
                                                                             architectures.1 The number of TPCs deter-                                              flow in the TPC. It manages dedicated on-
                                                                             mines a GPU’s programmable processing                                                  chip input and output vertex attribute
                                                                             performance and scales from one TPC in a                                               storage and forwards contents as required.
                                                                             small GPU to eight or more TPCs in high-                                                  DX10 has two stages dealing with vertex
                                                                             performance GPUs.                                                                      and primitive processing: the vertex shader
                                                                                                                                                                    and the geometry shader. The vertex shader
                                                                             Texture/processor cluster                                                              processes one vertex’s attributes indepen-
                                                                               As Figure 2 shows, each TPC contains a                                               dently of other vertices. Typical operations
                                                                             geometry controller, an SM controller                                                  are position space transforms and color and
                                                                             (SMC), two streaming multiprocessors                                                   texture coordinate generation. The geome-
                                                                             (SMs), and a texture unit. Figure 3 expands                                            try shader follows the vertex shader and
                                                                             each SM to show its eight SP cores. To                                                 deals with a whole primitive and its vertices.
                                                                             balance the expected ratio of math opera-                                              Typical operations are edge extrusion for
.......................................................................

                     42                         IEEE MICRO
for transcendental functions and attribute
                                               interpolation—the interpolation of pixel
                                               attributes from vertex attributes defining a
                                               primitive. Each SFU also contains four
                                               floating-point multipliers. The SM uses the
                                               TPC texture unit as a third execution unit
                                               and uses the SMC and ROP units to
                                               implement external memory load, store,
                                               and atomic accesses. A low-latency inter-
                                               connect network between the SPs and the
                                               shared-memory banks provides shared-
                                               memory access.
                                                  The GeForce 8800 Ultra clocks the SPs
                                               and SFU units at 1.5 GHz, for a peak of 36
                                               Gflops per SM. To optimize power and area
                                               efficiency, some SM non-data-path units
                                               operate at half the SP clock rate.

                                               SM multithreading. A graphics vertex or
                                               pixel shader is a program for a single thread
 Figure 3. Streaming multiprocessor (SM).
                                               that describes how to process a vertex or a
                                               pixel. Similarly, a CUDA kernel is a C
stencil shadow generation and cube map         program for a single thread that describes
texture generation. Geometry shader output     how one thread computes a result. Graphics
primitives go to later stages for clipping,    and computing applications instantiate
viewport transformation, and rasterization     many parallel threads to render complex
into pixel fragments.                          images and compute large result arrays. To
                                               dynamically balance shifting vertex and
Streaming multiprocessor                       pixel shader thread workloads, the unified
   The SM is a unified graphics and            SM concurrently executes different thread
computing multiprocessor that executes         programs and different types of shader
vertex, geometry, and pixel-fragment shader    programs.
programs and parallel computing programs.         To efficiently execute hundreds of
As Figure 3 shows, the SM consists of eight    threads in parallel while running several
streaming processor (SP) cores, two special-   different programs, the SM is hardware
function units (SFUs), a multithreaded         multithreaded. It manages and executes up
instruction fetch and issue unit (MT Issue),   to 768 concurrent threads in hardware with
an instruction cache, a read-only constant     zero scheduling overhead.
cache, and a 16-Kbyte read/write shared           To support the independent vertex,
memory.                                        primitive, pixel, and thread programming
   The shared memory holds graphics input      model of graphics shading languages and
buffers or shared data for parallel comput-    the CUDA C/C++ language, each SM
ing. To pipeline graphics workloads            thread has its own thread execution state
through the SM, vertex, geometry, and          and can execute an independent code path.
pixel threads have independent input and       Concurrent threads of computing programs
output buffers. Workloads can arrive and       can synchronize at a barrier with a single
depart independently of thread execution.      SM instruction. Lightweight thread crea-
Geometry threads, which generate variable      tion, zero-overhead thread scheduling, and
amounts of output per thread, use separate     fast barrier synchronization support very
output buffers.                                fine-grained parallelism efficiently.
   Each SP core contains a scalar multiply-
add (MAD) unit, giving the SM eight            Single-instruction, multiple-thread. To man-
MAD units. The SM uses its two SFU units       age and execute hundreds of threads running
                                                                                               ........................................................................

                                                                                         MARCH–APRIL 2008                                   43
.........................................................................................................................................................................................................................
                                                 HOT CHIPS 19

                                                                             several different programs efficiently, the
                                                                             Tesla SM uses a new processor architecture
                                                                             we call single-instruction, multiple-thread
                                                                             (SIMT). The SM’s SIMT multithreaded
                                                                             instruction unit creates, manages, schedules,
                                                                             and executes threads in groups of 32
                                                                             parallel threads called warps. The term warp
                                                                             originates from weaving, the first parallel-
                                                                             thread technology. Figure 4 illustrates SIMT
                                                                             scheduling. The SIMT warp size of 32
                                                                             parallel threads provides efficiency on plen-
                                                                             tiful fine-grained pixel threads and comput-
                                                                             ing threads.
                                                                                Each SM manages a pool of 24 warps,
                                                                             with a total of 768 threads. Individual
                                                                             threads composing a SIMT warp are of the
                                                                             same type and start together at the same
                                                                             program address, but they are otherwise free
                                                                             to branch and execute independently. At
                                                                             each instruction issue time, the SIMT
                                                                             multithreaded instruction unit selects a
                                                                             warp that is ready to execute and issues
                                                                             the next instruction to that warp’s active
                                                                             threads. A SIMT instruction is broadcast
                                                                             synchronously to a warp’s active parallel
                                                                             threads; individual threads can be inactive                                              Figure 4. Single-instruction, multiple-
                                                                             due to independent branching or predica-                                                 thread (SIMT) warp scheduling.
                                                                             tion.
                                                                                The SM maps the warp threads to the SP
                                                                             cores, and each thread executes indepen-                                                  SIMT architecture is similar to single-
                                                                             dently with its own instruction address and                                            instruction, multiple-data (SIMD) design,
                                                                             register state. A SIMT processor realizes full                                         which applies one instruction to multiple
                                                                             efficiency and performance when all 32                                                 data lanes. The difference is that SIMT
                                                                             threads of a warp take the same execution                                              applies one instruction to multiple inde-
                                                                             path. If threads of a warp diverge via a data-                                         pendent threads in parallel, not just multi-
                                                                             dependent conditional branch, the warp                                                 ple data lanes. A SIMD instruction controls
                                                                             serially executes each branch path taken,                                              a vector of multiple data lanes together and
                                                                             disabling threads that are not on that path,                                           exposes the vector width to the software,
                                                                             and when all paths complete, the threads                                               whereas a SIMT instruction controls the
                                                                             reconverge to the original execution path.                                             execution and branching behavior of one
                                                                             The SM uses a branch synchronization stack                                             thread.
                                                                             to manage independent threads that diverge                                                In contrast to SIMD vector architectures,
                                                                             and converge. Branch divergence only                                                   SIMT enables programmers to write thread-
                                                                             occurs within a warp; different warps                                                  level parallel code for independent threads
                                                                             execute independently regardless of whether                                            as well as data-parallel code for coordinated
                                                                             they are executing common or disjoint code                                             threads. For program correctness, program-
                                                                             paths. As a result, Tesla architecture GPUs                                            mers can essentially ignore SIMT execution
                                                                             are dramatically more efficient and flexible                                           attributes such as warps; however, they can
                                                                             on branching code than previous generation                                             achieve substantial performance improve-
                                                                             GPUs, as their 32-thread warps are much                                                ments by writing code that seldom requires
                                                                             narrower than the SIMD width of prior                                                  threads in a warp to diverge. In practice, this
                                                                             GPUs.1                                                                                 is analogous to the role of cache lines in
.......................................................................

                     44                         IEEE MICRO
traditional codes: Programmers can safely         programs are becoming longer and more
ignore cache line size when designing for         scalar, and it is increasingly difficult to fully
correctness but must consider it in the code      occupy even two components of the prior
structure when designing for peak perfor-         four-component vector architecture. Previ-
mance. SIMD vector architectures, on the          ous architectures employed vector pack-
other hand, require the software to manu-         ing—combining sub-vectors of work to
ally coalesce loads into vectors and to           gain efficiency—but that complicated the
manually manage divergence.                       scheduling hardware as well as the compiler.
                                                  Scalar instructions are simpler and compiler
SIMT warp scheduling. The SIMT ap-                friendly. Texture instructions remain vector
proach of scheduling independent warps is         based, taking a source coordinate vector and
simpler than previous GPU architectures’          returning a filtered color vector.
complex scheduling. A warp consists of up            High-level graphics and computing-lan-
to 32 threads of the same type—vertex,            guage compilers generate intermediate in-
geometry, pixel, or compute. The basic unit       structions, such as DX10 vector or PTX
of pixel-fragment shader processing is the 2      scalar instructions,10,2 which are then opti-
3 2 pixel quad. The SM controller groups          mized and translated to binary GPU
eight pixel quads into a warp of 32 threads.      instructions. The optimizer readily expands
It similarly groups vertices and primitives       DX10 vector instructions to multiple Tesla
into warps and packs 32 computing threads         SM scalar instructions. PTX scalar instruc-
into a warp. The SIMT design shares the           tions optimize to Tesla SM scalar instruc-
SM instruction fetch and issue unit effi-         tions about one to one. PTX provides a
ciently across 32 threads but requires a full     stable target ISA for compilers and provides
warp of active threads for full performance
                                                  compatibility over several generations of
efficiency.
                                                  GPUs with evolving binary instruction set
   As a unified graphics processor, the SM
                                                  architectures. Because the intermediate lan-
schedules and executes multiple warp types
                                                  guages use virtual registers, the optimizer
concurrently—for example, concurrently
                                                  analyzes data dependencies and allocates
executing vertex and pixel warps. The SM
                                                  real registers. It eliminates dead code, folds
warp scheduler operates at half the 1.5-GHz
                                                  instructions together when feasible, and
processor clock rate. At each cycle, it selects
                                                  optimizes SIMT branch divergence and
one of the 24 warps to execute a SIMT warp
                                                  convergence points.
instruction, as Figure 4 shows. An issued
warp instruction executes as two sets of 16
                                                  Instruction set architecture. The Tesla SM
threads over four processor cycles. The SP
cores and SFU units execute instructions          has a register-based instruction set including
independently, and by issuing instructions        floating-point, integer, bit, conversion, tran-
between them on alternate cycles, the             scendental, flow control, memory load/store,
scheduler can keep both fully occupied.           and texture operations.
   Implementing zero-overhead warp sched-            Floating-point and integer operations
uling for a dynamic mix of different warp         include add, multiply, multiply-add, mini-
programs and program types was a chal-            mum, maximum, compare, set predicate,
lenging design problem. A scoreboard              and conversions between integer and float-
qualifies each warp for issue each cycle.         ing-point numbers. Floating-point instruc-
The instruction scheduler prioritizes all         tions provide source operand modifiers for
ready warps and selects the one with highest      negation and absolute value. Transcenden-
priority for issue. Prioritization considers      tal function instructions include cosine,
warp type, instruction type, and ‘‘fairness’’     sine, binary exponential, binary logarithm,
to all warps executing in the SM.                 reciprocal, and reciprocal square root.
                                                  Attribute interpolation instructions provide
SM instructions. The Tesla SM executes            efficient generation of pixel attributes.
scalar instructions, unlike previous GPU          Bitwise operators include shift left, shift
vector instruction architectures. Shader          right, logic operators, and move. Control
                                                                                                      ........................................................................

                                                                                                MARCH–APRIL 2008                                   45
.........................................................................................................................................................................................................................
                                                 HOT CHIPS 19

                                                                             flow includes branch, call, return, trap, and                                          load-to-use latency for local and global
                                                                             barrier synchronization.                                                               memory implemented in external DRAM.
                                                                                The floating-point and integer instruc-                                                The latest Tesla architecture GPUs
                                                                             tions can also set per-thread status flags for                                         provide efficient atomic memory opera-
                                                                             zero, negative, carry, and overflow, which                                             tions, including integer add, minimum,
                                                                             the thread program can use for conditional                                             maximum, logic operators, swap, and
                                                                             branching.                                                                             compare-and-swap operations. Atomic op-
                                                                                                                                                                    erations facilitate parallel reductions and
                                                                             Memory access instructions. The texture                                                parallel data structure management.
                                                                             instruction fetches and filters texture sam-
                                                                             ples from memory via the texture unit. The                                             Streaming processor. The SP core is the
                                                                             ROP unit writes pixel-fragment output to                                               primary thread processor in the SM. It
                                                                             memory.                                                                                performs the fundamental floating-point
                                                                                To support computing and C/C++                                                      operations, including add, multiply, and
                                                                             language needs, the Tesla SM implements                                                multiply-add. It also implements a wide
                                                                             memory load/store instructions in addition                                             variety of integer, comparison, and conver-
                                                                             to graphics texture fetch and pixel output.                                            sion operations. The floating-point add and
                                                                             Memory load/store instructions use integer                                             multiply operations are compatible with the
                                                                             byte addressing with register-plus-offset                                              IEEE 754 standard for single-precision FP
                                                                             address arithmetic to facilitate conventional                                          numbers, including not-a-number (NaN)
                                                                             compiler code optimizations.                                                           and infinity values. The unit is fully
                                                                                For computing, the load/store instruc-                                              pipelined, and latency is optimized to
                                                                             tions access three read/write memory spaces:                                           balance delay and area.
                                                                                                                                                                       The add and multiply operations use
                                                                                 N     local memory for per-thread, private,
                                                                                                                                                                    IEEE round-to-nearest-even as the default
                                                                                       temporary data (implemented in ex-
                                                                                                                                                                    rounding mode. The multiply-add opera-
                                                                                       ternal DRAM);
                                                                                                                                                                    tion performs a multiplication with trunca-
                                                                                 N     shared memory for low-latency access
                                                                                                                                                                    tion, followed by an add with round-to-
                                                                                       to data shared by cooperating threads
                                                                                                                                                                    nearest-even. The SP flushes denormal
                                                                                       in the same SM; and
                                                                                                                                                                    source operands to sign-preserved zero and
                                                                                 N     global memory for data shared by all
                                                                                                                                                                    flushes results that underflow the target
                                                                                       threads of a computing application
                                                                                                                                                                    output exponent range to sign-preserved
                                                                                       (implemented in external DRAM).
                                                                                                                                                                    zero after rounding.
                                                                                The memory instructions load-global,
                                                                             store-global, load-shared, store-shared,                                               Special-function unit. The SFU supports
                                                                             load-local, and store-local access global,                                             computation of both transcendental func-
                                                                             shared, and local memory. Computing                                                    tions and planar attribute interpolation.11 A
                                                                             programs use the fast barrier synchroniza-                                             traditional vertex or pixel shader design
                                                                             tion instruction to synchronize threads                                                contains a functional unit to compute
                                                                             within the SM that communicate with each                                               transcendental functions. Pixels also need
                                                                             other via shared and global memory.                                                    an attribute-interpolating unit to compute
                                                                                To improve memory bandwidth and                                                     the per-pixel attribute values at the pixel’s x,
                                                                             reduce overhead, the local and global load/                                            y location, given the attribute values at the
                                                                             store instructions coalesce individual paral-                                          primitive’s vertices.
                                                                             lel thread accesses from the same warp into                                               For functional evaluation, we use qua-
                                                                             fewer memory block accesses. The addresses                                             dratic interpolation based on enhanced
                                                                             must fall in the same block and meet                                                   minimax approximations to approximate
                                                                             alignment criteria. Coalescing memory                                                  the reciprocal, reciprocal square root, log2x,
                                                                             requests boosts performance significantly                                              2x, and sin/cos functions. Table 1 shows the
                                                                             over separate requests. The large thread                                               accuracy of the function estimates. The SFU
                                                                             count, together with support for many                                                  unit generates one 32-bit floating point
                                                                             outstanding load requests, helps cover                                                 result per cycle.
.......................................................................

                     46                         IEEE MICRO
Table 1. Function approximation statistics.

                            Input                Accuracy (good                                                  % exactly
 Function                interval                           bits)                   ULP* error                    rounded                   Monotonic

 1/x                          [1, 2)                        24.02                         0.98                            87                       Yes
 1/sqrt(x)                    [1, 4)                        23.40                         1.52                            78                       Yes
 2x                           [0, 1)                        22.51                         1.41                            74                       Yes
 log2x                       [1, 2)                        22.57                          N/A**                         N/A                        Yes
 sin/cos                  [0,   p/2)                       22.47                          N/A                           N/A                        No
     ........................................................................................................................................................
       * ULP: unit-in-the-last-place.
       ** N/A: not applicable.

   The SFU also supports attribute interpo-                                         neously: vertex, geometry, and pixel. It
lation, to enable accurate interpolation of                                         packs each of these input types into the
attributes such as color, depth, and texture                                        warp width, initiating shader processing,
coordinates. The SFU must interpolate                                               and unpacks the results.
these attributes in the (x, y) screen space                                            Each input type has independent I/O
to determine the values of the attributes at                                        paths, but the SMC is responsible for load
each pixel location. We express the value of                                        balancing among them. The SMC supports
a given attribute U in an (x, y) plane in                                           static and dynamic load balancing based on
plane equations of the following form:                                              driver-recommended allocations, current
                                                                                    allocations, and relative difficulty of addi-
           U ðx, yÞ ~                                                               tional resource allocation. Load balancing of
           ðAU | x z BU | y z CU Þ=                                                 the workloads was one of the more
                                                                                    challenging design problems due to its
           ðAW | x z BW | y z CW Þ                                                  impact on overall SPA efficiency.

where A, B, and C are interpolation                                                 Texture unit
parameters associated with each attribute                                               The texture unit processes one group of
U, and W is related to the distance of the                                          four threads (vertex, geometry, pixel, or
pixel from the viewer for perspective                                               compute) per cycle. Texture instruction
projection. The attribute interpolation                                             sources are texture coordinates, and the
hardware in the SFU is fully pipelined,                                             outputs are filtered samples, typically a
and it can interpolate four samples per                                             four-component (RGBA) color. Texture is
cycle.                                                                              a separate unit external to the SM connect-
   In a shader program, the SFU can                                                 ed via the SMC. The issuing SM thread can
generate perspective-corrected attributes as                                        continue execution until a data dependency
follows:                                                                            stall.
                                                                                        Each texture unit has four texture address
   N    Interpolate 1/W, and invert to form                                         generators and eight filter units, for a peak
        W.                                                                          GeForce 8800 Ultra rate of 38.4 gigabi-
   N    Interpolate U/W.                                                            lerps/s (a bilerp is a bilinear interpolation of
   N    Multiply U/W by W to form perspec-                                          four samples). Each unit supports full-speed
        tive-correct U.                                                             2:1 anisotropic filtering, as well as high-
                                                                                    dynamic-range (HDR) 16-bit and 32-bit
                                                                                    floating-point data format filtering.
SM controller. The SMC controls multiple                                                The texture unit is deeply pipelined.
SMs, arbitrating the shared texture unit,                                           Although it contains a cache to capture
load/store path, and I/O path. The SMC                                              filtering locality, it streams hits mixed with
serves three graphics workloads simulta-                                            misses without stalling.
                                                                                                                                                                    ........................................................................

                                                                                                                                                                MARCH–APRIL 2008                                 47
.........................................................................................................................................................................................................................
                                                 HOT CHIPS 19

                                                                             Rasterization                                                                          ROPs handle depth and stencil testing and
                                                                                Geometry primitives output from the                                                 updates and color blending and updates.
                                                                             SMs go in their original round-robin input                                             The memory controller uses lossless color
                                                                             order to the viewport/clip/setup/raster/zcull                                          (up to 8:1) and depth compression (up to
                                                                             block. The viewport and clip units clip the                                            8:1) to reduce bandwidth. Each ROP has a
                                                                             primitives to the standard view frustum and                                            peak rate of four pixels per clock and
                                                                             to any enabled user clip planes. They                                                  supports 16-bit floating-point and 32-bit
                                                                             transform postclipping vertices into screen                                            floating-point HDR formats. ROPs support
                                                                             (pixel) space and reject whole primitives                                              double-rate-depth processing when color
                                                                             outside the view volume as well as back-                                               writes are disabled.
                                                                             facing primitives.                                                                        Each memory partition is 64 bits wide
                                                                                Surviving primitives then go to the setup                                           and supports double-data-rate DDR2 and
                                                                             unit, which generates edge equations for the                                           graphics-oriented GDDR3 protocols at up
                                                                             rasterizer. Attribute plane equations are also                                         to 1 GHz, yielding a bandwidth of about
                                                                             generated for linear interpolation of pixel                                            16 Gbytes/s.
                                                                             attributes in the pixel shader. A coarse-                                                 Antialiasing support includes up to 163
                                                                             rasterization stage generates all pixel tiles                                          multisampling and supersampling. HDR
                                                                             that are at least partially inside the primi-                                          formats are fully supported. Both algo-
                                                                             tive.                                                                                  rithms support 1, 2, 4, 8, or 16 samples per
                                                                                The zcull unit maintains a hierarchical z                                           pixel and generate a weighted average of the
                                                                             surface, rejecting pixel tiles if they are                                             samples to produce the final pixel color.
                                                                             conservatively known to be occluded by                                                 Multisampling executes the pixel shader
                                                                             previously drawn pixels. The rejection rate                                            once to generate a color shared by all pixel
                                                                             is up to 256 pixels per clock. The screen is                                           samples, whereas supersampling runs the
                                                                             subdivided into tiles; each TPC processes a                                            pixel shader once per sample. In both cases,
                                                                             predetermined subset. The pixel tile address                                           depth values are correctly evaluated for each
                                                                             therefore selects the destination TPC. Pixel                                           sample, as required for correct interpene-
                                                                             tiles that survive zcull then go to a fine-                                            tration of primitives.
                                                                             rasterization stage that generates detailed                                               Because multisampling runs the pixel
                                                                             coverage information and depth values for                                              shader once per pixel (rather than once
                                                                             the pixels.                                                                            per sample), multisampling has become the
                                                                                OpenGL and Direct3D require that a                                                  most popular antialiasing method. Beyond
                                                                             depth test be performed after the pixel                                                four samples, however, storage cost increases
                                                                             shader has generated final color and depth                                             faster than image quality improves, espe-
                                                                             values. When possible, for certain combi-                                              cially with HDR formats. For example, a
                                                                             nations of API state, the Tesla GPU                                                    single 1,600 3 1,200 pixel surface, storing
                                                                             performs the depth test and update ahead                                               16 four-component, 16-bit floating-point
                                                                             of the fragment shader, possibly saving                                                samples, requires 1,600 3 1,200 3 16 3
                                                                             thousands of cycles of processing time,                                                (64 bits color + 32 bits depth) 5 368
                                                                             without violating the API-mandated seman-                                              Mbytes.
                                                                             tics.                                                                                     For the vast majority of edge pixels, two
                                                                                The SMC assembles surviving pixels into                                             colors are enough; what matters is more-
                                                                             warps to be processed by a SM running the                                              detailed coverage information. The cover-
                                                                             current pixel shader. When the pixel shader                                            age-sampling antialiasing (CSAA) algorithm
                                                                             has finished, the pixels are optionally depth                                          provides low-cost-per-coverage samples, al-
                                                                             tested if this was not done ahead of the                                               lowing upward scaling. By computing and
                                                                             shader. The SMC then sends surviving                                                   storing Boolean coverage at up to 16
                                                                             pixels and associated data to the ROP.                                                 samples and compressing redundant color
                                                                                                                                                                    and depth and stencil information into the
                                                                             Raster operations processor                                                            memory footprint and bandwidth of four or
                                                                               Each ROP is paired with a specific                                                   eight samples, 163 antialiasing quality can
                                                                             memory partition. The TPCs feed data to                                                be achieved at 43 antialiasing performance.
                                                                             the ROPs via an interconnection network.                                               CSAA is compatible with existing rendering
.......................................................................

                     48                         IEEE MICRO
Table 2. Comparison of antialiasing modes.

                                                                         Antialiasing mode

 Feature                                 Brute-force supersampling             Multisampling            Coverage sampling

 Quality level                           13      43             163           13     43      163         13           43              163
 Texture and shader samples              1       4              16            1       1       1          1            1                  1
 Stored color and z samples              1       4              16            1       4      16          1            4                  4
 Coverage samples                        1       4              16            1       4      16          1            4               16

techniques including HDR and stencil             performs virtual to physical translation.
algorithms. Edges defined by the intersec-       Hardware reads the page tables from local
tion of interpenetrating polygons are ren-       memory to respond to misses on behalf of a
dered at the stored sample count quality         hierarchy of translation look-aside buffers
(43 or 83). Table 2 summarizes the               spread out among the rendering engines.
storage requirements of the three algo-
rithms.                                          Parallel computing architecture
                                                    The Tesla scalable parallel computing
Memory and interconnect                          architecture enables the GPU processor
   The DRAM memory data bus width is             array to excel in throughput computing,
384 pins, arranged in six independent            executing high-performance computing ap-
partitions of 64 pins each. Each partition       plications as well as graphics applications.
owns 1/6 of the physical address space. The      Throughput applications have several prop-
memory partition units directly enqueue          erties that distinguish them from CPU serial
requests. They arbitrate among hundreds of       applications:
in-flight requests from the parallel stages of
the graphics and computation pipelines.              N   extensive data parallelism—thousands
The arbitration seeks to maximize total                  of computations on independent data
DRAM transfer efficiency, which favors                   elements;
grouping related requests by DRAM bank               N   modest task parallelism—groups of
and read/write direction, while minimizing               threads execute the same program,
latency as far as possible. The memory                   and different groups can run different
controllers support a wide range of DRAM                 programs;
clock rates, protocols, device densities, and        N   intensive floating-point arithmetic;
data bus widths.                                     N   latency tolerance—performance is the
                                                         amount of work completed in a given
Interconnection network. A single hub unit               time;
routes requests to the appropriate partition         N   streaming data flow—requires high
from the nonparallel requesters (PCI-Ex-                 memory bandwidth with relatively
press, host and command front end, input                 little data reuse;
assembler, and display). Each memory                 N   modest inter-thread synchronization
partition has its own depth and color                    and communication—graphics
ROP units, so ROP memory traffic origi-                  threads do not communicate, and
nates locally. Texture and load/store re-                parallel computing applications re-
quests, however, can occur between any                   quire limited synchronization and
TPC and any memory partition, so an                      communication.
interconnection network routes requests
and responses.                                     GPU parallel performance on through-
                                                 put problems has doubled every 12 to
Memory management unit. All processing           18 months, pulled by the insatiable de-
engines generate addresses in a virtual          mands of the 3D game market. Now, Tesla
address space. A memory management unit          GPUs in laptops, desktops, workstations,
                                                                                                   ........................................................................

                                                                                             MARCH–APRIL 2008                                   49
.........................................................................................................................................................................................................................
                                                 HOT CHIPS 19

                                                                                                                                                                    The two-level parallel decomposition maps
                                                                                                                                                                    naturally to the Tesla architecture: Parallel
                                                                                                                                                                    SMs compute result blocks, and parallel
                                                                                                                                                                    threads compute result elements.
                                                                                                                                                                       The programmer or compiler writes a
                                                                                                                                                                    program that computes a sequence of result
                                                                                                                                                                    grids, partitioning each result grid into
                                                                                                                                                                    coarse-grained result blocks that are com-
                                                                                                                                                                    puted independently in parallel. The pro-
                                                                                                                                                                    gram computes each result block with an
                                                                                                                                                                    array of fine-grained parallel threads, parti-
                                                                                                                                                                    tioning the work among threads that
                                                                                                                                                                    compute result elements.

                                                                                                                                                                    Cooperative thread array or thread block
                                                                                                                                                                       Unlike the graphics programming model,
                                                                                                                                                                    which executes parallel shader threads
                                                                                                                                                                    independently, parallel-computing pro-
                                                                                                                                                                    gramming models require that parallel
                                                                                                                                                                    threads synchronize, communicate, share
                                                                                                                                                                    data, and cooperate to efficiently compute a
                                                                                                                                                                    result. To manage large numbers of con-
                                                                                                                                                                    current threads that can cooperate, the Tesla
                                                                                                                                                                    computing architecture introduces the co-
                                                                                                                                                                    operative thread array (CTA), called a thread
                                                                                                                                                                    block in CUDA terminology.
                                                                                                                                                                       A CTA is an array of concurrent threads
                                                                                                                                                                    that execute the same thread program and
                       Figure 5. Decomposing result data into a grid of blocks partitioned into
                                                                                                                                                                    can cooperate to compute a result. A CTA
                       elements to be computed in parallel.
                                                                                                                                                                    consists of 1 to 512 concurrent threads, and
                                                                                                                                                                    each thread has a unique thread ID (TID),
                                                                                                                                                                    numbered 0 through m. The programmer
                                                                                                                                                                    declares the 1D, 2D, or 3D CTA shape and
                                                                             and systems are programmable in C with                                                 dimensions in threads. The TID has one,
                                                                             CUDA tools, using a simple parallel                                                    two, or three dimension indices. Threads of
                                                                             programming model.                                                                     a CTA can share data in global or shared
                                                                                                                                                                    memory and can synchronize with the
                                                                             Data-parallel problem decomposition                                                    barrier instruction. CTA thread programs
                                                                                To map a large computing problem                                                    use their TIDs to select work and index
                                                                             effectively to a highly parallel processing                                            shared data arrays. Multidimensional TIDs
                                                                             architecture, the programmer or compiler                                               can eliminate integer divide and remainder
                                                                             decomposes the problem into many small                                                 operations when indexing arrays.
                                                                             problems that can be solved in parallel. For                                              Each SM executes up to eight CTAs
                                                                             example, the programmer partitions a large                                             concurrently, depending on CTA resource
                                                                             result data array into blocks and further                                              demands. The programmer or compiler
                                                                             partitions each block into elements, so that                                           declares the number of threads, registers,
                                                                             the result blocks can be computed indepen-                                             shared memory, and barriers required by
                                                                             dently in parallel, and the elements within                                            the CTA program. When an SM has
                                                                             each block can be computed cooperatively                                               sufficient available resources, the SMC
                                                                             in parallel. Figure 5 shows the decomposi-                                             creates the CTA and assigns TID numbers
                                                                             tion of a result data array into a 3 3 2 grid                                          to each thread. The SM executes the CTA
                                                                             of blocks, in which each block is further                                              threads concurrently as SIMT warps of 32
                                                                             decomposed into a 5 3 3 array of elements.                                             parallel threads.
.......................................................................

                     50                         IEEE MICRO
Figure 6. Nested granularity levels: thread (a), cooperative thread array (b), and grid (c).
 These have corresponding memory-sharing levels: local per-thread, shared per-CTA, and
 global per-application.

CTA grids                                        Parallel granularity
   To implement the coarse-grained block            Figure 6 shows levels of parallel granu-
and grid decomposition of Figure 5, the          larity in the GPU computing model. The
GPU creates CTAs with unique CTA ID              three levels are
and grid ID numbers. The compute work
distributor dynamically balances the GPU            N   thread—computes result elements se-
workload by distributing a stream of CTA                lected by its TID;
work to SMs with sufficient available               N   CTA—computes result blocks selected
resources.                                              by its CTA ID;
   To enable a compiled binary program to           N   grid—computes many result blocks,
run unchanged on large or small GPUs with               and sequential grids compute sequen-
any number of parallel SM processors,                   tially dependent application steps.
CTAs execute independently and compute
result blocks independently of other CTAs          Higher levels of parallelism use multiple
in the same grid. Sequentially dependent         GPUs per CPU and clusters of multi-GPU
application steps map to two sequentially        nodes.
dependent grids. The dependent grid waits
for the first grid to complete; then the CTAs    Parallel memory sharing
of the dependent grid read the result blocks        Figure 6 also shows levels of parallel
written by the first grid.                       read/write memory sharing:
                                                                                                    ........................................................................

                                                                                                MARCH–APRIL 2008                                 51
.........................................................................................................................................................................................................................
                                                 HOT CHIPS 19

                                                                                 N     local—each executing thread has a                                            tially on one core, or partially in parallel on
                                                                                       private per-thread local memory for                                          a few cores.
                                                                                       register spill, stack frame, and address-
                                                                                       able temporary variables;                                                    CUDA programming model
                                                                                 N     shared—each executing CTA has a                                                 CUDA is a minimal extension of the C
                                                                                       per-CTA shared memory for access to                                          and C++ programming languages. A pro-
                                                                                       data shared by threads in the same                                           grammer writes a serial program that calls
                                                                                       CTA;                                                                         parallel kernels, which can be simple
                                                                                 N     global—sequential grids communicate                                          functions or full programs. The CUDA
                                                                                       and share large data sets in global                                          program executes serial code on the CPU
                                                                                       memory.                                                                      and executes parallel kernels across a set of
                                                                                                                                                                    parallel threads on the GPU. The program-
                                                                                Threads communicating in a CTA use                                                  mer organizes these threads into a hierarchy
                                                                             the fast barrier synchronization instruction                                           of thread blocks and grids as described
                                                                             to wait for writes to shared or global                                                 earlier. (A CUDA thread block is a GPU
                                                                             memory to complete before reading data                                                 CTA.)
                                                                             written by other threads in the CTA. The                                                  Figure 7 shows a CUDA program exe-
                                                                             load/store memory system uses a relaxed                                                cuting a series of parallel kernels on a
                                                                             memory order that preserves the order of                                               heterogeneous CPU–GPU system. Ker-
                                                                             reads and writes to the same address from                                              nelA and KernelB execute on the GPU
                                                                             the same issuing thread and from the                                                   as grids of nBlkA and nBlkB thread
                                                                             viewpoint of CTA threads coordinating                                                  blocks (CTAs), which instantiate nTidA
                                                                             with the barrier synchronization instruction.                                          and nTidB threads per CTA.
                                                                             Sequentially dependent grids use a global                                                 The CUDA compiler nvcc compiles an
                                                                             intergrid synchronization barrier between                                              integrated application C/C++ program
                                                                             grids to ensure global read/write ordering.                                            containing serial CPU code and parallel
                                                                                                                                                                    GPU kernel code. The CUDA runtime API
                                                                             Transparent scaling of GPU computing                                                   manages the GPU as a computing device
                                                                                Parallelism varies widely over the range of                                         that acts as a coprocessor to the host CPU
                                                                             GPU products developed for various market                                              with its own memory system.
                                                                             segments. A small GPU might have one SM                                                   The CUDA programming model is
                                                                             with eight SP cores, while a large GPU                                                 similar in style to a single-program multi-
                                                                             might have many SMs totaling hundreds of                                               ple-data (SPMD) software model—it ex-
                                                                             SP cores.                                                                              presses parallelism explicitly, and each
                                                                                The GPU computing architecture trans-                                               kernel executes on a fixed number of
                                                                             parently scales parallel application perfor-                                           threads. However, CUDA is more flexible
                                                                             mance with the number of SMs and SP                                                    than most SPMD implementations because
                                                                             cores. A GPU computing program executes                                                each kernel call dynamically creates a new
                                                                             on any size of GPU without recompiling,                                                grid with the right number of thread blocks
                                                                             and is insensitive to the number of SM                                                 and threads for that application step.
                                                                             multiprocessors and SP cores. The program                                                 CUDA extends C/C++ with the declara-
                                                                             does not know or care how many processors                                              tion specifier keywords __global__ for
                                                                             it uses.                                                                               kernel entry functions, __device__ for
                                                                                The key is decomposing the problem into                                             global variables, and __shared__ for
                                                                             independently computed blocks as de-                                                   shared-memory variables. A CUDA kernel’s
                                                                             scribed earlier. The GPU compute work                                                  text is simply a C function for one
                                                                             distribution unit generates a stream of                                                sequential thread. The built-in variables
                                                                             CTAs and distributes them to available                                                 threadIdx.{x, y, z} and block
                                                                             SMs to compute each independent block.                                                 Idx.{x, y, z} provide the thread ID
                                                                             Scalable programs do not communicate                                                   within a thread block (CTA), while block
                                                                             among CTA blocks of the same grid; the                                                 Idx provides the CTA ID within a grid.
                                                                             same grid result is obtained if the CTAs                                               The extended function call syntax ker-
                                                                             execute in parallel on many cores, sequen-                                             nel,,,nBlocks,nThreads...(args);
.......................................................................

                     52                         IEEE MICRO
Figure 7. CUDA program sequence of kernel A followed by kernel B on a heterogeneous
 CPU–GPU system.

invokes a parallel kernel function on a grid    It uses parallel threads to compute the same
of nBlocks, where each block instanti-          array indices in parallel, and each thread
ates nThreads concurrent threads, and           computes only one sum.
args are ordinary arguments to function
kernel().                                       Scalability and performance
   Figure 8 shows an example serial C pro-         The Tesla unified architecture is designed
gram and a corresponding CUDA C program.        for scalability. Varying the number of SMs,
The serial C program uses two nested loops to   TPCs, ROPs, caches, and memory parti-
iterate over each array index and compute       tions provides the right mix for different
c[idx] 5 a[idx] + b[idx] each trip.             performance and cost targets in the value,
The parallel CUDA C program has no loops.       mainstream, enthusiast, and professional

 Figure 8. Serial C (a) and CUDA C (b) examples of programs that add arrays.

                                                                                                ........................................................................

                                                                                          MARCH–APRIL 2008                                   53
.........................................................................................................................................................................................................................
                                                 HOT CHIPS 19

                                                                                                                                                                         N     384-pin DRAM interface;
                                                                                                                                                                         N     1.08-GHz DRAM clock;
                                                                                                                                                                         N     104-Gbyte/s peak bandwidth; and
                                                                                                                                                                         N     typical power of 150 W at 1.3 V.

                                                                                                                                                                    T     he Tesla architecture is the first
                                                                                                                                                                          ubiquitous supercomputing platform.
                                                                                                                                                                    NVIDIA has shipped more than 50 million
                                                                                                                                                                    Tesla-based systems. This wide availability,
                                                                                                                                                                    coupled with C programmability and the
                                                                                                                                                                    CUDA software development environment,
                                                                                                                                                                    enables broad deployment of demanding
                                                                                                                                                                    parallel-computing and graphics applications.
                                                                                                                                                                       With future increases in transistor density,
                                                                                                                                                                    the architecture will readily scale processor
                                                                                                                                                                    parallelism, memory partitions, and overall
                                                                                                                                                                    performance. Increased number of multipro-
                                                                                                                                                                    cessors and memory partitions will support
                                                                                                                                                                    larger data sets and richer graphics and
                                                                                                                                                                    computing, without a change to the pro-
                                                                                                                                                                    gramming model.
                                                                                                                                                                       We continue to investigate improved sched-
                       Figure 9. GeForce 8800 Ultra die layout.                                                                                                     uling and load-balancing algorithms for the
                                                                                                                                                                    unified processor. Other areas of improvement
                                                                             market segments. NVIDIA’s Scalable Link                                                are enhanced scalability for derivative products,
                                                                             Interconnect (SLI) enables multiple GPUs                                               reduced synchronization and communication
                                                                             to act together as one, providing further                                              overhead for compute programs, new graphics
                                                                             scalability.                                                                           features, increased realized memory band-
                                                                                CUDA C/C++ applications executing on                                                width, and improved power efficiency.        MICRO
                                                                             Tesla computing platforms, Quadro work-
                                                                             stations, and GeForce GPUs deliver com-                                                Acknowledgments
                                                                             pelling computing performance on a range                                                  We thank the entire NVIDIA GPU deve-
                                                                             of large problems, including more than                                                 lopment team for their extraordinary effort
                                                                             1003 speedups on molecular modeling,                                                   in bringing Tesla-based GPUs to market.
                                                                             more than 200 Gflops on n-body problems,                                               ................................................................................................
                                                                             and real-time 3D magnetic-resonance im-                                                References
                                                                             aging.12–14 For graphics, the GeForce 8800                                               1. J. Montrym and H. Moreton, ‘‘The GeForce
                                                                             GPU delivers high performance and image                                                          6800,’’ IEEE Micro, vol. 25, no. 2, Mar./
                                                                             quality for the most demanding games.15                                                          Apr. 2005, pp. 41-51.
                                                                                Figure 9 shows the GeForce 8800 Ultra                                                 2. CUDA Technology, NVIDIA, 2007, http://
                                                                             physical die layout implementing the Tesla                                                       www.nvidia.com/CUDA.
                                                                             architecture shown in Figure 1. Implemen-                                                3. CUDA Programming Guide 1.1, NVIDIA,
                                                                             tation specifics include                                                                         2007; http://developer.download.nvidia.
                                                                                                                                                                              com/compute/cuda/1_1/NVIDIA_CUDA_
                                                                                 N     681 million transistors, 470 mm2;                                                      Programming_Guide_1.1.pdf.
                                                                                 N     TSMC 90-nm CMOS;                                                               4. J. Nickolls, I. Buck, K. Skadron, and M.
                                                                                 N     128 SP cores in 16 SMs;                                                                Garland, ‘‘Scalable Parallel Programming
                                                                                 N     12,288 processor threads;                                                              with CUDA,’’ ACM Queue, vol. 6, no. 2,
                                                                                 N     1.5-GHz processor clock rate;                                                          Mar./Apr. 2008, pp. 40-53.
                                                                                 N     peak 576 Gflops in processors;                                                 5. DX Specification, Microsoft; http://msdn.
                                                                                 N     768-Mbyte GDDR3 DRAM;                                                                  microsoft.com/directx.
.......................................................................

                     54                         IEEE MICRO
6. E. Lindholm, M.J. Kilgard, and H. Moreton,      group. His research interests include graph-
     ‘‘A User-Programmable Vertex Engine,’’         ics processor design and parallel graphics
     Proc. 28th Ann. Conf. Computer Graphics        architectures. Lindholm has an MS in
     and Interactive Techniques (Siggraph 01),      electrical engineering from the University
     ACM Press, 2001, pp. 149-158.                  of British Columbia.
 7. G. Elder, ‘‘Radeon 9700,’’ Eurographics/
     Siggraph Workshop Graphics Hardware,           John Nickolls is director of GPU comput-
     Hot   3D    Session,   2002,    http://www.    ing architecture at NVIDIA. His interests
     graphicshardware.org/previous/www_2002/        include parallel processing systems, languag-
     presentations/Hot3D-RADEON9700.ppt.            es, and architectures. Nickolls has a BS in
 8. Microsoft DirectX 9 Programmable Graph-         electrical engineering and computer science
    ics Pipeline, Microsoft Press, 2003.            from the University of Illinois and MS and
 9. J. Andrews and N. Baker, ‘‘Xbox 360             PhD degrees in electrical engineering from
     System     Architecture,’’     IEEE   Micro,   Stanford University.
     vol. 26, no. 2, Mar./Apr. 2006, pp. 25-37.
10. D. Blythe, ‘‘The Direct3D 10 System,’’          Stuart Oberman is a design manager in the
     ACM Trans. Graphics, vol. 25, no. 3, July      GPU hardware group at NVIDIA. His
     2006, pp. 724-734.                             research interests include computer arith-
11. S.F. Oberman and M.Y. Siu, ‘‘A High-            metic, processor design, and parallel archi-
     Performance Area-Efficient Multifunction       tectures. Oberman has a BS in electrical
     Interpolator,’’ Proc. 17th IEEE Symp. Com-     engineering from the University of Iowa
     puter Arithmetic (Arith-17), IEEE Press,       and MS and PhD degrees in electrical
     2005, pp. 272-279.                             engineering from Stanford University. He is
12. J.E. Stone et al., ‘‘Accelerating Molecular     a senior member of the IEEE.
     Modeling Applications with Graphics Pro-
     cessors,’’ J. Computational Chemistry,         John Montrym is a chief architect at
     vol. 28, no. 16, 2007, pp. 2618-2640.          NVIDIA, where he has worked in the
13. L. Nyland, M. Harris, and J. Prins, ‘‘Fast N-   development of several GPU product fam-
    Body Simulation with CUDA,’’ GPU Gems           ilies. His research interests include graphics
     3, H. Nguyen, ed., Addison-Wesley, 2007,       processor design, parallel graphics architec-
     pp. 677-695.                                   tures, and hardware-software interfaces.
14. S.S. Stone et al., ‘‘How GPUs Can Improve       Montrym has a BS in electrical engineering
     the Quality of Magnetic Resonance Imag-        from the Massachusetts Institute of Tech-
     ing,’’ Proc. 1st Workshop on General           nology.
     Purpose Processing on Graphics Process-
     ing Units, 2007; http://www.gigascale.org/        Direct questions and comments about
     pubs/1175.html.                                this article to Erik Lindholm or John
15. A.L. Shimpi and D. Wilson, ‘‘NVIDIA’s           Nickolls, NVIDIA, 2701 San Tomas
     GeForce 8800 (G80): GPUs Re-architected        Expressway, Santa Clara, CA 95050;
     for DirectX 10,’’ AnandTech, Nov. 2006;        elindholm@nvidia.com or jnickolls@nvidia.
     http://www.anandtech.com/video/showdoc.        com.
     aspx?i52870.
                                                    For more information on this or any other
Erik Lindholm is a distinguished engineer           computing topic, please visit our Digital
at NVIDIA, working in the architecture              Library at http://computer.org/csdl.

                                                                                                     ........................................................................

                                                                                               MARCH–APRIL 2008                                   55
You can also read