An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS

Page created by Miguel Moody
 
CONTINUE READING
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008                                                                                           29

                            An 80-Tile Sub-100-W TeraFLOPS
                                Processor in 65-nm CMOS
 Sriram R. Vangal, Member, IEEE, Jason Howard, Gregory Ruhl, Member, IEEE, Saurabh Dighe, Member, IEEE,
    Howard Wilson, James Tschanz, Member, IEEE, David Finan, Arvind Singh, Member, IEEE, Tiju Jacob,
      Shailendra Jain, Vasantha Erraguntla, Member, IEEE, Clark Roberts, Yatin Hoskote, Member, IEEE,
                               Nitin Borkar, and Shekhar Borkar, Member, IEEE

   Abstract—This paper describes an integrated network-on-chip
architecture containing 80 tiles arranged as an 8 10 2-D array of
floating-point cores and packet-switched routers, both designed
to operate at 4 GHz. Each tile has two pipelined single-precision
floating-point multiply accumulators (FPMAC) which feature a
single-cycle accumulation loop for high throughput. The on-chip
2-D mesh network provides a bisection bandwidth of 2 Terabits/s.
The 15-FO4 design employs mesochronous clocking, fine-grained
clock gating, dynamic sleep transistors, and body-bias techniques.
In a 65-nm eight-metal CMOS process, the 275 mm2 custom
design contains 100 M transistors. The fully functional first silicon                         Fig. 1. NoC architecture.
achieves over 1.0 TFLOPS of performance on a range of bench-
marks while dissipating 97 W at 4.27 GHz and 1.07 V supply.
   Index Terms—CMOS digital integrated circuits, crossbar router                              multiprocessors include the RAW [3], TRIPS [4], and ASAP
and network-on-chip (NoC), floating-point unit, interconnection,                              [5] projects. These tiled architectures show promise for greater
leakage reduction, MAC, multiply-accumulate.                                                  integration, high performance, good scalability and potentially
                                                                                              high energy efficiency.
                               I. INTRODUCTION                                                   With the increasing demand for interconnect bandwidth,
                                                                                              on-chip networks are taking up a substantial portion of system
                                                                                              power budget. The 16-tile MIT RAW on-chip network con-

T     HE scaling of MOS transistors into the nanometer regime
      opens the possibility for creating large scalable Net-
work-on-Chip (NoC) architectures [1] containing hundreds
                                                                                              sumes 36% of total chip power, with each router dissipating
                                                                                              40% of individual tile power [6]. The routers and the links of
                                                                                              the Alpha 21364 microprocessor consume about 20% of the
of integrated processing elements with on-chip communica-                                     total chip power. With on-chip communication consuming a
tion. NoC architectures, with structured on-chip networks are                                 significant portion of the chip power and area budgets, there is
emerging as a scalable and modular solution to global commu-                                  a compelling need for compact, low power routers. At the same
nications within large systems-on-chip. The basic concept is                                  time, while applications dictate the choice of the compute core,
to replace today’s shared buses with on-chip packet-switched                                  the advent of multimedia applications, such as three-dimen-
interconnection networks [2]. NoC architectures use layered                                   sional (3-D) graphics and signal processing, places stronger
protocols and packet-switched networks which consist of                                       demands for self-contained, low-latency floating-point proces-
on-chip routers, links, and well defined network interfaces. As                               sors with increased throughput. A computational fabric built
shown in Fig. 1, the NoC architecture basic building block is the                             using these optimized building blocks is expected to provide
“network tile”. The tiles are connected to an on-chip network                                 high levels of performance in an energy efficient manner. This
that routes packets between them. Each tile may consist of                                    paper describes design details of an integrated 80-tile NoC
one or more compute cores and include logic responsible for                                   architecture implemented in a 65-nm process technology. The
routing and forwarding the packets, based on the routing policy                               prototype is designed to deliver over 1.0 TFLOPS of average
of the network. The structured network wiring of such a NoC                                   performance while dissipating less than 100 W.
design gives well-controlled electrical parameters that simpli-                                  The remainder of the paper is organized as follows. Section II
fies timing and allows the use of high-performance circuits to                                gives an architectural overview of the 80-tile NoC and describes
reduce latency and increase bandwidth. Recent tile-based chip                                 the key building blocks. The section also explains the FPMAC
                                                                                              unit pipeline and design optimizations used to accomplish
                                                                                              single-cycle accumulation. Router architecture details, NoC
   Manuscript received April 16, 2007; revised September 27, 2007.
   S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan,
                                                                                              communication protocol and packet formats are also described.
C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar are with the Microprocessor                  Section III describes chip implementation details, including the
Technology Laboratories, Intel Corporation, Hillsboro, OR 97124 USA (e-mail:                  high-speed mesochronous clock distribution network used in
sriram.r.vangal@intel.com).                                                                   this design. Details of the circuits used for leakage power man-
   A. Singh, T. Jacob, S. Jain, and V. Erraguntla are with the Microprocessor
Technology Laboratories, Intel Corporation, Bangalore 560037, India.                          agement in both logic and memory blocks are also discussed.
   Digital Object Identifier 10.1109/JSSC.2007.910957                                         Section IV presents the chip measurement results. Section V
                                                                         0018-9200/$25.00 © 2008 IEEE

      Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
30                                                                                                 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008

Fig. 2. NoC block diagram and tile architecture.

concludes by summarizing the NoC architecture along with key                                  are 32-bit inputs in IEEE-754 single-precision format [9]. The
performance and power numbers.                                                                design is capable of sustained pipelined performance of one
                                                                                              FPMAC instruction every 250 ps. The multiplier is designed
                         II. NOC ARCHITECTURE                                                 using a Wallace tree of 4-2 carry-save adders. The well-matched
   The NoC architecture (Fig. 2) contains 80 tiles arranged as an                             delays of each Wallace tree stage allow for highly efficient
8 10 2-D mesh network that is designed to operate at 4 GHz                                    pipelining           . Four Wallace tree stages are used to com-
[7]. Each tile consists of a processing engine (PE) connected to                              press the partial product bits to a sum and carry pair. Notice that
a 5-port router with mesochronous interfaces (MSINT), which                                   the multiplier does not use a carry propagate adder at the final
forwards packets between the tiles. The 80-tile on-chip network                               stage. Instead, the multiplier retains the output in carry-save
enables a bisection bandwidth of 2 Terabits/s. The PE contains                                format and converts the result to base 32 (at stage ), prior to
two independent fully-pipelined single-precision floating-point                               accumulation. In an effort to achieve fast single-cycle accumu-
multiply-accumulator (FPMAC) units, 3 KB single-cycle in-                                     lation, we first analyzed each of the critical operations involved
struction memory (IMEM), and 2 KB of data memory (DMEM).                                      in conventional FPUs with the intent of eliminating, reducing
A 96-bit Very Long Instruction Word (VLIW) encodes up to                                      or deferring the logic operations inside the accumulate loop
eight operations per cycle. With a 10-port (6-read, 4-write) reg-                             and identified the following three optimizations [8].
ister file, the architecture allows scheduling to both FPMACs,                                  1) The accumulator (stage ) retains the multiplier output in
simultaneous DMEM load and stores, packet send/receive from                                         carry-save format and uses an array of 4-2 carry save adders
mesh network, program control, and dynamic sleep instructions.                                      to “accumulate” the result in an intermediate format. This
A router interface block (RIB) handles packet encapsulation be-                                     removes the need for a carry-propagate adder in the critical
tween the PE and router. The fully symmetric architecture al-                                       path.
lows any PE to send (receive) instruction and data packets to                                   2) Accumulation is performed in base 32 system, converting
(from) any other tile. The 15 fan-out-of-4 (FO4) design uses a                                      the expensive variable shifters in the accumulate loop to
balanced core and router pipeline with critical stages employing                                    constant shifters.
performance setting semi-dynamic flip-flops. In addition, a scal-                               3) The costly normalization step is moved outside the accu-
able low power mesochronous clock distribution is employed in                                       mulate loop, where the accumulation result in carry-save is
a 65-nm eight-metal CMOS process that enables high integra-                                         added (stage ), the sum normalized (stage ) and con-
tion and single-chip realization of the teraFLOPS processor.                                        verted back to base 2 (stage ).
                                                                                                 These optimizations allow accumulation to be implemented
A. FPMAC Architecture                                                                         in just 15 FO4 stages. This approach also reduces the latency
   The nine-stage pipelined FPMAC architecture (Fig. 3) uses a                                of dependent FPMAC instructions and enables a sustained mul-
single-cycle accumulate algorithm [8] with base 32 and internal                               tiply-add result (2 FLOPS) every cycle. Careful pipeline re-bal-
carry-save arithmetic with delayed addition. The FPMAC con-                                   ancing allows removal of 3 pipe-stages resulting in a 25% la-
tains a fully pipelined multiplier unit (pipe stages       ), and                             tency improvement over work in [8]. The dual FPMACs in each
a single-cycle accumulation loop          , followed by pipelined                             PE provide 16 GFLOPS of aggregate performance and are crit-
addition and normalization units              . Operands A and B                              ical to achieving the goal of teraFLOPS performance.

     Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
VANGAL et al.: AN 80-TILE SUB-100-W TERAFLOPS PROCESSOR IN 65-NM CMOS                                                                                      31

Fig. 3. FPMAC nine-stage pipeline with single-cycle accumulate loop.

B. Instruction Set                                                                                                               TABLE I
                                                                                                                      INSTRUCTION TYPES AND LATENCY
   The architecture defines a 96-bit VLIW which allows a max-
imum of up to eight operations to be issued every cycle. The in-
structions fall into one of the six categories (Table I): Instruction
issue to both floating-point units, simultaneous data memory
load and stores, packet send/receive via the on-die mesh net-
work, program control using jump and branch instructions, syn-
chronization primitives for data transfer between PEs and dy-
namic sleep instructions. The data path between the DMEM
and the register file supports transfer of two 32-bit data words
per cycle on each load (or store) instruction. The register file                             3-bit destination ID field (DID) specifies the router exit port.
issues four 32-bit data words to the dual FPMACs per cycle,                                  This field is updated at each hop. Flow control and buffer
while retiring two 32-bit results every cycle. The synchroniza-                              management between routers is debit-based using almost-full
tion instructions aid with data transfer between tiles and allow                             bits, which the receiver queue signals via two flow control bits
the PE to stall while waiting for data (WFD) to arrive. To aid                                             when its buffers reach a specified threshold. Each
with power management, the architecture provides special in-                                 header FLIT supports a maximum of 10 hops. A chained header
structions for dynamic sleep and wakeup of each PE, including                                (CH) bit in the packet provides support for larger number of
independent sleep control of each floating-point unit inside the                             hops. Processing engine control information including sleep
PE. The architecture allows any PE to issue sleep packets to any                             and wakeup control bits are specified in the FLIT_1 that fol-
other tile or wake it up for processing tasks. With the exception                            lows the header FLIT. The minimum packet size required by
of FPU instructions which have a pipelined latency of nine cy-                               the protocol is two FLITs. The router architecture places no
cles, most instructions execute in 1–2 cycles.                                               restriction on the maximum packet size.
C. NoC Packet Format                                                                         D. Router Architecture
   Fig. 4 describes the NoC packet structure and routing pro-                                  A 4 GHz five-port two-lane pipelined packet-switched router
tocol. The on-chip 2-D mesh topology utilizes a 5-port router                                core (Fig. 5) with phase-tolerant mesochronous links forms
based on wormhole switching, where each packet is subdivided                                 the key communication fabric for the 80-tile NoC architecture.
into “FLITs” or “Flow control unITs”. Each FLIT contains six                                 Each port has two 39-bit unidirectional point-to-point links.
control signals and 32 data bits. The packet header (FLIT_0)                                 The input-buffered wormhole-switched router [18] uses two
allows for a flexible source-directed routing scheme, where a                                logical lanes (lane 0–1) for dead-lock free routing and a fully

     Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
32                                                                                                 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008

Fig. 4. NoC protocol: packet format and FLIT description.

                                                                                                                              TABLE II
                                                                                                                  ROUTER COMPARISON OVER WORK IN [11]

                                                                                              in [11] when ported and compared in the same 65-nm process
                                                                                              [12]. Results from comparison are summarized in Table II.

                                                                                              E. Mesochronous Communication
                                                                                                 The 2-mm-long point-to-point unidirectional router links im-
                                                                                              plement a phase-tolerant mesochronous interface (Fig. 7). Four
Fig. 5. Five-port two-lane shared crossbar router architecture.
                                                                                              of the five router links are source synchronous, each providing
                                                                                              a strobe (Tx_clk) with 38 bits of data. To reduce active power,
                                                                                              Tx_clk is driven at half the clock rate. A 4-deep circular FIFO,
non-blocking crossbar switch with a total bandwidth of 80 GB/s                                built using transparent latches captures data on both edges of
     bits     GHz         ports . Each lane has a 16 FLIT queue,                              the delayed link strobe at the receiver. The strobe delay and
arbiter and flow control logic. The router uses a 5-stage pipeline                            duty-cycle can be digitally programmed using the on-chip scan
with a two stage round-robin arbitration scheme that first binds                              chain. A synchronizer circuit sets the latency between the FIFO
an input port to an output port in each lane and then selects a                               write and read pointers to 1 or 2 cycles at each port, depending
pending FLIT from one of the two lanes. A shared data path                                    on the phase of the arriving strobe with respect to the local clock.
architecture allows crossbar switch re-use across both lanes on                               A more aggressive low-latency setting reduces the synchroniza-
a per-FLIT basis. The router links implement a mesochronous                                   tion penalty by one cycle. The interface includes the first stage
interface with first-in-first-out (FIFO) based synchronization at                             of the router pipeline.
the receiver.
   The router core features a double-pumped crossbar switch                                   F. Router Interface Block (RIB)
[10] to mitigate crossbar interconnect routing area. The                                        The RIB is responsible for message-passing and aids with
schematic in Fig. 6(a) shows the 36-bit crossbar data bus                                     synchronization of data transfer between the tiles and with
double-pumped at the fourth pipe-stage of the router by inter-                                power management at the PE level. Incoming 38-bit-wide
leaving alternate data bits using dual edge-triggered flip-flops,                             FLITs are buffered in a 16-entry queue, where demultiplexing
reducing crossbar area by 50%. In addition, the proposed                                      based on the lane-ID and framing to 64 bits for data packets
router architecture shares the crossbar switch across both lanes                              (DMEM) and 96 bits for instruction packets (IMEM) are
on an individual FLIT basis. Combined application of both                                     accomplished. The buffering is required during program ex-
ideas enables a compact 0.34 mm design, resulting in a 34%                                    ecution since DMEM stores from the 10-port register file
reduction in router layout area as shown in Fig. 6(b), 26% fewer                              have priority over data packets received by the RIB. The unit
devices, 13% improvement in average power and one cycle                                       decodes FLIT_1 (Fig. 4) of an incoming instruction packet
latency reduction (from 6 to 5 cycles) over the router design                                 and generates several PE control signals. This allows the PE

     Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
VANGAL et al.: AN 80-TILE SUB-100-W TERAFLOPS PROCESSOR IN 65-NM CMOS                                                                                       33

Fig. 6. (a) Double-pumped crossbar switch schematic. (b) Area benefit over work in [11].

Fig. 7. Phase-tolerant mesochronous interface and timing diagram.

                                                                                             receiving a full data packet, the RIB generates a break signal to
                                                                                             continue execution, if the IMEM is in a stalled (WFD) mode.
                                                                                             Upon receipt of a sleep packet via the mesh network, the RIB
                                                                                             unit can also dynamically put the entire PE to sleep or wake it
                                                                                             up for processing tasks on demand.
                                                                                                                 III. DESIGN DETAILS
                                                                                                To allow 4 GHz operation, the entire core is designed using
                                                                                             hand-optimized data path macros. CMOS static gates are used
                                                                                             to implement most of the logic. However, critical registers in
Fig. 8. Semi-dynamic flip-flop (SDFF) schematic.                                             the FPMAC and router logic utilize implicit-pulsed semi-dy-
                                                                                             namic flip-flops (SDFF) [13], [14]. The SDFF (Fig. 8) has a
to start execution (REN) at a specified IMEM address (PCA)                                   dynamic master stage coupled to a pseudo-static slave stage.
and is enabled by the new program counter (NPC) bit. After                                   The FPMAC accumulator register is built using data-inverting

     Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
34                                                                                                 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008

Fig. 9. (a) Global mesochronous clocking and (b) simulated clock arrival times.

rising edge-triggered SDFFs with synchronous reset and en-
able. The negative setup time of the flip-flop is taken advan-
tage of in the critical path. When compared to a conventional
static master–slave flip-flop, SDFF provides both shorter latency
and the capability of incorporating logic functions with min-
imum delay penalty, properties which are desirable in high-per-
formance digital designs.
   The chip uses a scalable global mesochronous clocking tech-
nique, which allows for clock phase-insensitive communication
across tiles and synchronous operation within each tile. The
on-chip phase-locked loop (PLL) output [Fig. 9(a)] is routed
using horizontal M8 and vertical M7 spines. Each spine con-
sists of differential clocks for low duty-cycle variation along the
worst-case clock route of 26 mm. An opamp at each tile con-                                   Fig. 10. Router and on-die network power management.
verts the differential clocks to a single ended clock with a 50%
duty cycle prior to distributing the clock across the tile using a
balanced H-tree. This clock distribution scales well as tiles are                             enable signals also activate the nMOS sleep transistors in the
added or removed. The worst case simulated global duty-cycle                                  input queue arrays of both lanes. The 360 m nMOS sleep
variation is 3 ps and local clock skew within the tile is 4 ps.                               device in the register-file is sized to provide a 4.3X reduction in
Fig. 9(b) shows simulated clock arrival times for all 80 tiles at                             array leakage power with a 4% frequency impact. The global
4 GHz operation. Note that multiple cycles are required for the                               clock buffer feeding the router is finally gated at the tile level
global clock to propagate to all 80 tiles. The systematic clock                               based on port activity.
skews inherent in the distribution help spread peak currents due                                 Each FPMAC implements unregulated sleep transistors with
to simultaneous clock switching over the entire cycle.                                        no data retention [Fig. 11(a)]. A 6-cycle pipelined wakeup se-
   Fine-grained clock gating [Fig. 9(a)], sleep transistor and                                quence largely mitigates current spikes over single-cycle re-ac-
body bias circuits [15] are used to reduce active and standby                                 tivation scheme, while allowing floating point unit execution to
leakage power, which are controlled at full-chip, tile-slice, and                             start one-cycle into wakeup. A faster 3-cycle fast-wake option
individual tile levels based on workload. Each tile is partitioned                            is also supported. On the other hand, memory arrays use a reg-
into 21 smaller sleep regions with dynamic control of individual                              ulated active clamped sleep transistor circuit [Fig. 11(b)] that
blocks in PE and router units based on instruction type. The                                  ensures data retention and minimizes standby leakage power
router is partitioned into 10 smaller sleep regions with control of                           [16]. The closed-loop opamp configuration ensures that the vir-
individual router ports, depending on network traffic patterns.                               tual ground voltage              is no greater than a         input
The design uses nMOS sleep transistors to reduce frequency                                    voltage under PVT variations.              is set based on memory
penalty and area overhead. Fig. 10 shows the router and on-die                                cell standby         voltage. The average sleep transistor area
network power management scheme. The enable signals gate                                      overhead is 5.4% with a 4% frequency penalty. About 90% of
the clock to each port, MSINT and the links. In addition, the                                 FPU logic and 74% of each PE is sleep-enabled. In addition,

     Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
VANGAL et al.: AN 80-TILE SUB-100-W TERAFLOPS PROCESSOR IN 65-NM CMOS                                                                                           35

Fig. 11. (a) FPMAC pipelined wakeup diagram and (b) state-retentive memory clamp circuit.

Fig. 12. Full-chip and tile micrograph and characteristics.

forward body bias can be externally applied to all nMOS de-                                   cide for lower resistance and a second-generation strained sil-
vices during active mode to increase the operating frequency                                  icon technology. The interconnect uses eight copper layers and
and reverse body bias can be applied during idle mode for fur-                                a low- carbon-doped oxide                    inter-layer dielectric.
ther leakage savings.                                                                         The functional blocks of the chip and individual tile are identi-
                                                                                              fied in the die photographs in Fig. 12. The 275 mm fully custom
                       IV. EXPERIMENTAL RESULTS                                               design contains 100 million transistors. Using a fully-tiled ap-
   The teraFLOPS processor is fabricated in a 65-nm process                                   proach, each 3 mm tile is drawn complete with C4 bumps,
technology [12] with a 1.2-nm gate-oxide thickness, nickel sali-                              power, global clock and signal routing, which are seamlessly

      Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
36                                                                                                 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008

Fig. 13. (a) Package die-side. (b) Package land-side. (c) Evaluation board.

arrayed by abutment. Each tile contains 1.2 million transistors
with the processing engine accounting for 1 million (83%) and
the router 17% of the total tile device count. De-coupling ca-
pacitors occupy about 20% of the total logic area. The chip has
three independent voltage regions, one for the tiles, a separate
supply for the PLL and a third one for the I/O circuits. Test and
debug features include a TAP controller and full-scan support
for all memory blocks on chip.
   The evaluation board with the packaged chip is shown in
Fig. 13. The die has 8390 C4 solder bumps, arrayed with a
single uniform bump pitch across the entire die. The chip-level
power distribution consists of a uniform M8-M7 grid aligned
with the C4 power and ground bump array. The package is a                                     Fig. 14. Measured chip F              and peak performance.
66 mm 66 mm flip-chip LGA (land grid array) and includes
an integrated heat spreader. The package has a 14-layer stack-up
                                                                                                                         TABLE III
(5-4-5) to meet the various power planes and signal require-                                  APPLICATION PERFORMANCE MEASURED AT 1.07 V AND 4.27 GHz OPERATION
ments and has a total of 1248 pins, out of which 343 are signal
pins. Decoupling capacitors are mounted on the land-side of the
package as shown in Fig. 13(b). A PC running custom software
is used to apply test vectors and observe results through the
on-chip scan chain. First silicon has been validated to be fully
functional.
   Frequency versus power supply on a typical part is shown
in Fig. 14. Silicon chip measurements at a case temperature
of 80 C demonstrates chip maximum frequency                    of
1 GHz at 670 mV and 3.16 GHz at 950 mV, with frequency
increasing to 5.1 GHz at 1.2 V and 5.67 GHz at 1.35 V. With
all 80 tiles              actively performing single precision                                  Several application kernels have been mapped to the design
block-matrix operations, the chip achieves a peak perfor-                                     and the performance is summarized in Table III. The table
mance of 0.32 TFLOPS (670 mV), 1.0 TFLOPS (950 mV),                                           shows the single-precision floating point operation count, the
1.63 TFLOPS (1.2 V) and 1.81 TFLOPS (1.35 V).                                                 number of active tiles ( ) and the average performance in

     Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
VANGAL et al.: AN 80-TILE SUB-100-W TERAFLOPS PROCESSOR IN 65-NM CMOS                                                                                            37

                                                                                             Fig. 16. Measured chip energy efficiency for stencil application.
Fig. 15. Stencil application performance measured at 1.07 V and 4.27 GHz
operation.

TFLOPS for each application, reported as a percentage of the
peak performance achievable with the design. In each case,
task mapping was hand optimized and communication was
overlapped with computation as much as possible to increase
efficiency. The stencil code solves a steady-state 2-D heat
diffusion equation with periodic boundary conditions on left
and right boundaries of a rectilinear grid, and prescribed
temperatures on top and bottom boundaries. For the stencil                                   Fig. 17. Estimated (a) tile power profile and (b) communication power
kernel, chip measurements indicate an average performance                                    breakdown.
of 1.0 TFLOPS at 4.27 GHz and 1.07 V supply with 358K
floating-point operations, achieving 73.3% of the peak per-
formance. This data is particularly impressive because the                                   frequency scaling. As expected, the chip energy efficiency in-
execution is entirely overlapped with local loads and stores                                 creases with power supply reduction, from 5.8 GFLOPS/W at
and communication between neighboring tiles. The SGEMM                                       1.35 V supply to 10.5 GFLOPS/W at the 1.0 TFLOPS goal
matrix multiplication code operates on two 100 100 matrices                                  to a maximum of 19.4 GFLOPS/W at 750 mV supply. Below
with 2.63 million floating-point operations, corresponding to                                750 mV, the chip           degrades faster than power saved by
an average performance of 0.51 TFLOPS. It is important to                                    lowering tile supply voltage, resulting in overall performance re-
note that the read bandwidth from local data memory limits                                   duction and consequent drop in the processor energy efficiency.
the performance to half the peak rate. The spreadsheet kernel                                The chip provides up to 394 GFLOPS of aggregate performance
applies reductions to tables of data consisting of pairs of values                           at 750 mV with a measured total power dissipation of just 20 W.
and weights. For each table the weighted sum of each row                                     Fig. 17 presents the estimated power breakdown at the tile and
and each column is computed. A 64-point 2-D FFT which                                        router levels, which is simulated at 4 GHz, 1.2 V supply and at
implements the Cooley–Tukey algorithm [17] using 64 tiles has                                110 C. The processing engine with the dual FPMACs, instruc-
also been successfully mapped to the design with an average                                  tion and data memory, and the register file accounts for 61% of
performance of 27.3 GFLOPS. It first computes 8-point FFTs                                   the total tile power [Fig. 17(a)]. The communication power is
in each tile, which in turn passes results to 63 other tiles for the                         significant at 28% of the tile power and the synchronous tile-
2-D FFT computation. The complex communication pattern                                       level clock distribution accounts for 11% of the total. Fig. 17(b)
results in high overhead and low efficiency.                                                 shows a more detailed tile to tile communication power break-
   Fig. 15 shows the total chip power dissipation with the ac-                               down, which includes the router, mesochronous interfaces and
tive and leakage power components separated as a function of                                 links. Clocking power is the largest component, accounting for
frequency and power supply with case temperature maintained                                  33% of the communication power. The input queues on both
at 80 C. We report measured power for the stencil applica-                                   lanes and data path circuits is the second major component dis-
tion kernel, since it is the most computationally intensive. The                             sipating 22% of the communication power.
chip power consumption ranges from 15.6 W at 670 mW to                                          Fig. 18 shows the output differential and single ended clock
230 W at 1.35 V. With all 80 tiles actively executing stencil                                waveforms measured at the farthest clock buffer from the phase-
code the chip achieves 1.0 TFLOPS of average performance at                                  locked loop (PLL) at a frequency of 5 GHz. Notice that the
4.27 GHz and 1.07 V supply with a total power dissipation of                                 duty cycles of the clocks are close to 50%. Fig. 19 plots the
97 W. The total power consumed increases to 230 W at 1.35 V                                  global clock distribution power as a function of frequency and
and 5.67 GHz operation, delivering 1.33 TFLOPS of average                                    power supply. This is the switching power dissipated in the
performance. Fig. 16 plots the measured energy efficiency in                                 clock spines from the PLL to the opamp at the center of all
GFLOPS/W for the stencil application with power supply and                                   80 tiles. Measured silicon data at 80 C shows that the power

     Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
38                                                                                                 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008

Fig. 18. Measured global clock distribution waveforms.

Fig. 19. Measured global clock distribution power.                                            Fig. 20. Measured chip leakage power as percentage of total power versus Vcc.
                                                                                              A 2X reduction is obtained by turning off sleep transistors.

is 80 mW at 0.8 V and 1.7 GHz frequency, increasing by 10X
to 800 mW at 1 V and 3.8 GHz. The global clock distribution
power is 2 W at 1.2 V and 5.1 GHz and accounts for just 1.3%
of the total chip power. Fig. 20 plots the chip leakage power as
a percentage of the total power with all 80 processing engines
and routers awake and with all the clocks disabled. Measure-
ments show that the worst-case leakage power in active mode
varies from a minimum of 9.6% to a maximum of 15.7% of
the total power when measured over the power supply range of
670 mV to 1.35 V. In sleep mode, the nMOS sleep transistors
are turned off, reducing chip leakage by 2X, while preserving
the logic state in all memory arrays. Fig. 21 shows the active and
leakage power reduction due to a combination of selective router
port activation, clock gating and sleep transistor techniques de-
scribed in Section III. Measured at 1.2 V, 80 C and 5.1 GHz op-                               Fig. 21. On-die network power reduction benefit.
eration, the total network power per-tile can be lowered from a
maximum of 924 mW with all router ports active to 126 mW, re-
sulting in a 7.3X reduction. The network leakage power per-tile                                                                V. CONCLUSION
with all ports and global clock buffers feeding the router dis-                                 In this paper, we have presented an 80-tile high-performance
abled is 126 mW. This number includes power dissipated in the                                 NoC architecture implemented in a 65-nm process technology.
router, MSINT and the links.                                                                  The prototype contains 160 lower-latency FPMAC cores

     Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
VANGAL et al.: AN 80-TILE SUB-100-W TERAFLOPS PROCESSOR IN 65-NM CMOS                                                                                                        39

and features a single-cycle accumulator architecture for high                                     [9] IEEE Standard for Binary Floating-Point Arithmetic IEEE Standards
throughput. Each tile also contains a fast and compact router                                         Board, New York, Tech. Rep. ANSI/IEEE Std. 754-1985, 1985.
                                                                                                 [10] S. Vangal, N. Borkar, and A. Alvandpour, “A six-port 57 GB/s double-
operating at core speed where the 80 tiles are interconnected                                         pumped non-blocking router core,” in Symp. VLSI Circuits, Jun. 2005,
using a 2-D mesh topology providing a high bisection band-                                            pp. 268–269.
width of over 2 Terabits/s. The design uses a combination                                        [11] H. Wilson and M. Haycock, “A six-port 30-GB/s non-blocking router
of micro-architecture, logic, circuits and a 65-nm process to                                         component using point-to-point simultaneous bidirectional signaling
                                                                                                      for high-bandwidth interconnects,” IEEE J. Solid-State Circuits, vol.
reach target performance. Silicon operates over a wide voltage                                        36, no. 12, pp. 1954–1963, Dec. 2001.
and frequency range, and delivers teraFLOPS performance                                          [12] P. Bai, C. Auth, S. Balakrishnan, M. Bost, R. Brain, V. Chikarmane,
with high power efficiency. For the most computationally                                              R. Heussner, M. Hussein, J. Hwang, D. Ingerly, R. James, J. Jeong,
                                                                                                      C. Kenyon, E. Lee, S.-H. Lee, N. Lindert, M. Liu, Z. Ma, T. Marieb,
intensive application kernel, the chip achieves an average per-                                       A. Murthy, R. Nagisetty, S. Natarajan, J. Neirynck, A. Ott, C. Parker,
formance of 1.0 TFLOPS, while dissipating 97 W at 4.27 GHz                                            J. Sebastian, R. Shaheed, S. Sivakumar, J. Steigerwald, S. Tyagi, C.
and 1.07 V supply, corresponding to an energy efficiency of                                           Weber, B. Woolery, A. Yeoh, K. Zhang, and M. Bohr, “A 65 nm logic
10.5 GFLOPS/W. Average performance scales to 1.33 TFLOPS                                              technology featuring 35 nm gate lengths, enhanced channel strain, 8 Cu
                                                                                                      interconnect layers, low-k ILD and 0.57 m SRAM cell,” in IEDM
at a maximum operational frequency of 5.67 GHz and 1.35 V                                             Tech. Dig., Dec. 2004, pp. 657–660.
supply. These results demonstrate the feasibility for high-per-                                  [13] F. Klass, “Semi-dynamic and dynamic flip-flops with embedded logic,”
formance and energy-efficient building blocks for peta-scale                                          in Symp. VLSI Circuits Dig. Tech. Papers, 1998, pp. 108–109.
                                                                                                 [14] J. Tschanz, S. Narendra, Z. Chen, S. Borkar, M. Sachdev, and V. De,
computing in the near future.
                                                                                                      “Comparative delay and energy of single edge-triggered and dual edge-
                                                                                                      triggered pulsed flip-flops for high-performance microprocessors,” in
                                                                                                      Proc. ISLPED, 2001, pp. 147–151.
                                                                                                 [15] J. Tschanz, S. Narendra, Y. Ye, B. Bloechel, S. Borkar, and V. De, “Dy-
                            ACKNOWLEDGMENT                                                            namic sleep transistor and body bias for active leakage power control
                                                                                                      of microprocessors,” IEEE J. Solid-State Circuits, vol. 38, no. 11, pp.
   The authors would like to thank V. De, Prof. A. Alvandpour,                                        1838–1845, Nov. 2003.
                                                                                                 [16] M. Khellah, D. Somasekhar, Y. Ye, N. Kim, J. Howard, G. Ruhl, M.
D. Somasekhar, D. Jenkins, P. Aseron, J. Collias, B. Nefcy,                                           Sunna, J. Tschanz, N. Borkar, F. Hamzaoglu, G. Pandya, A. Farhang,
P. Iyer, S. Venkataraman, S. Saha, M. Haycock, J. Schutz, and                                         K. Zhang, and V. De, “A 256-Kb dual-Vcc SRAM building block in
J. Rattner for help, encouragement, and support; T. Mattson,                                          65-nm CMOS process with actively clamped sleep transistor,” IEEE J.
R. Wijngaart, and M. Frumkin from SSG and ARL teams at Intel                                          Solid-State Circuits, vol. 42, no. 1, pp. 233–242, Jan. 2007.
                                                                                                 [17] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calcula-
for assistance with mapping the kernels to the design; the LTD                                        tion of complex Fourier series,” Math. Comput., vol. 19, pp. 297–301,
and ATD teams for PLL and package design and assembly; the                                            1965.
entire mask design team for chip layout; and the reviewers for                                   [18] S. Vangal, A. Singh, J. Howard, S. Dighe, N. Borkar, and A. Alvand-
                                                                                                      pour, “A 5.1 GHz 0.34 mm router for network-on-chip applications,”
their useful remarks.
                                                                                                      in Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2007, pp. 42–43.

                                  REFERENCES

   [1] L. Benini and G. De Micheli, “Networks on chips: A new SoC para-
       digm,” IEEE Computer, vol. 35, no. 1, pp. 70–78, Jan. 2002.
   [2] W. J. Dally and B. Towles, “Route packets, not wires: On-chip inter-                                             Sriram R. Vangal (S’90–M’98) received the B.S.
       connection networks,” in Proc. 38th Design Automation Conf., Jun.                                                degree from Bangalore University, India, the M.S.
       2001, pp. 681–689.                                                                                               degree from the University of Nebraska, Lincoln,
   [3] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Green-                                             and the Ph.D. degree from Linköping University,
       wald, H. Hoffmann, P. Johnson, J. W. Lee, W. Lee, A. Ma, A. Saraf,                                               Sweden, all in electrical engineering. His Ph.D. re-
       M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe,                                                  search focused on energy-efficient network-on-chip
       and A. Agarwal, “The Raw microprocessor: A computational fabric                                                  (NoC) designs.
       for software circuits and general-purpose programs,” IEEE Micro, vol.                                               With Intel since 1995, he is currently a Senior Re-
                                                                                                                        search Scientist at Microprocessor Technology Labs,
       22, no. 2, pp. 25–35, Mar.-Apr. 2002.
                                                                                                                        Hillsboro, OR. He was the technical lead for the ad-
   [4] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger,
                                                                                                                        vanced prototype team that designed the industry’s
       S. W. Keckler, and C. R. Moore, “Exploiting ILP, TLP, and DLP with                    first single-chip Teraflops processor. His research interests are in the area of
       the polymorphous TRIPS architecture,” in Proc. 30th Annu. Int. Symp.                  low-power high-performance circuits, power-aware computing and NoC archi-
       Computer Architecture, 2003, pp. 422–433.                                             tectures. He has published 16 papers and has 16 issued patents with 7 pending
   [5] Z. Yu, M. Meeuwsen, R. Apperson, O. Sattari, M. Lai, J. Webb, E.                      in these areas.
       Work, T. Mohsenin, M. Singh, and B. Baas, “An asynchronous array
       of simple processors for DSP applications,” in IEEE ISSCC Dig. Tech.
       Papers, Feb. 2006, pp. 428–429.
   [6] H. Wang, L. Peh, and S. Malik, “Power-driven design of router mi-
                                                                                                                            Jason Howard received the M.S.E.E. degree from
       croarchitectures in on-chip networks,” in MICRO-36, Proc. 36th Annu.                                                 Brigham Young University, Provo, UT, in 2000.
       IEEE/ACM Int. Symp. Micro Architecture, 2003, pp. 105–116.                                                             He was an Intern with Intel Corporation during the
   [7] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz,                                                      summers of 1998 and 1999 working on the Pentium
       D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y.                                                  4 microprocessor. In 2000, he formally joined Intel,
       Hoskote, and N. Borkar, “An 80-tile 1.28 TFLOPS network-on-chip                                                      working in the Oregon Rotation Engineers Program.
       in 65 nm CMOS,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2007, pp.                                                      After two successful rotations through NCG and
       98–99.                                                                                                               CRL, he officially joined Intel Laboratories, Hills-
   [8] S. Vangal, Y. Hoskote, N. Borkar, and A. Alvandpour, “A 6.2-GFLOPS                                                   boro, OR, in 2001. He is currently working for the
       floating-point multiply-accumulator with conditional normalization,”                                                 CRL Prototype Design team. He has co-authored
       IEEE J. Solid-State Circuits, vol. 41, no. 10, pp. 2314–2323, Oct. 2006.                                             several papers and patents pending.

     Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
40                                                                                                 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008

                         Gregory Ruhl (M’07) received the B.S. degree in                                               Arvind Singh (M’05) received the B.Tech. degree in
                         computer engineering and the M.S. degree in elec-                                             electronics and communication engineering from the
                         trical and computer engineering from the Georgia In-                                          Institute of Engineering and Technology, Lucknow,
                         stitute of Technology, Atlanta, in 1998 and 1999, re-                                         India, in 2000, and the M.Tech. degree in VLSI de-
                         spectively.                                                                                   sign from the Indian Institute of Technology, Delhi,
                            He joined Intel Corporation, Hillsboro, OR, in                                             India, in 2001. He holds the University Gold Medal
                         1999 as a part of the Rotation Engineering Program                                            for his B.Tech. degree.
                         where he worked on the PCI-X I/O switch, Gigabit                                                 He is a Research Scientist at the Microprocessor
                         Ethernet validation, and individual circuit research                                          Technology Labs, Intel Bangalore, India, working
                         projects. After completing the REP program, he                                                on scalable on-die interconnect fabric for terascale
                         joined Intel’s Circuits Research Lab where he has                                             research and prototyping. His research interests
been working on design, research and validation on a variety of topics ranging                include network-on-chip technologies and energy-efficient building blocks.
from SRAMs and signaling to terascale computing.                                              Prior to Intel, he worked on high-performance network search engines and
                                                                                              low-power SRAMs in Cypress Semiconductors.
                                                                                                 Mr. Singh is a member of the IEEE and TiE.

                          Saurabh Dighe (M’05) received the B.E. degree in
                          electronics engineering from University of Mumbai,
                                                                                                                            Tiju Jacob received the B.Tech. degree in electronics
                          India, in 2001, and the M.S. degree in computer engi-
                                                                                                                            and communication engineering from NIT, Calicut,
                          neering from University of Minnesota, Minneapolis,
                                                                                                                            India, in 2000, and the M.Tech. degree in microelec-
                          in 2003. His work focused on computer architecture
                                                                                                                            tronics and VLSI design from IIT Madras, India, in
                          and the design, modeling and simulation of digital
                                                                                                                            2003.
                          systems.
                                                                                                                               He joined Intel Corporation, Bangalore, India, in
                             He was with Intel Corporation, Santa Clara, CA,
                                                                                                                            Feb, 2003 and currently a member of Circuit Re-
                          working on front-end logic and validation method-
                                                                                                                            search Labs. His areas of interests include low-power
                          ologies for the Itanium processor and the Core pro-
                                                                                                                            high-performance circuits, many core processors
                          cessor design team. Currently, he is with Intel Corpo-
                                                                                                                            and advanced prototyping.
ration’s Circuit Research Labs, Microprocessor Technology Labs in Hillsboro,
OR and is a member of the prototype design team involved in the definition,
implementation and validation of future terascale computing technologies.

                                                                                                                            Shailendra Jain received the B.E. degree in
                                                                                                                            electronics engineering from the Devi Ahilya
                                                                                                                            Vishwavidyalaya, Indore, India, in 1999, and the
                         Howard Wilson was born in Chicago, IL, in 1957.                                                    M.Tech. degree in electrical engineering from the
                         He received the B.S. degree in electrical engineering                                              IIT, Madras, India, in 2001.
                         from Southern Illinois University, Carbondale, in                                                     Since 2004, he has been with Intel, Bangalore
                         1979.                                                                                              Design Lab, India. His work includes research and
                            From 1979 to 1984 he worked at Rock-                                                            advanced prototyping in the areas of low-power
                         well-Collins, Cedar Rapids, IA, where he designed                                                  high-performance digital circuits and physical design
                         navigation and electronic flight display systems.                                                  of multi-million transistors chips. He has coauthored
                         From 1984 to 1991, he worked at National Semi-                                                     two papers in these areas.
                         conductor, Santa Clara, CA, designing telecom
                         components for ISDN. With Intel since 1992, he
                         is currently a member of the Circuits Research
Laboratory, Hillsboro, OR, engaged in a variety of advanced prototype design
activities.                                                                                                            Vasantha Erraguntla (M’92) received the Bache-
                                                                                                                       lors degree in electrical engineering from Osmania
                                                                                                                       University, India, and the Masters degree in computer
                                                                                                                       engineering from University of Louisiana, Lafayette.
                                                                                                                          She joined Intel in Hillsboro, OR, in 1991 and
                            James Tschanz (M’97) received the B.S. degree in                                           worked on the high-speed router technology for the
                            computer engineering and the M.S. degree in elec-                                          Intel Teraflop machine (ASCI Red). For over 10
                            trical engineering from the University of Illinois at                                      years, she was engaged in a variety of advanced
                            Urbana-Champaign in 1997 and 1999, respectively.                                           prototype design activities at Intel Laboratories,
                               Since 1999, he has been a Circuits Researcher                                           implementing and validating research ideas in the
                            with the Intel Circuit Research Lab, Hillsboro, OR.                                        areas of in high-performance and low-power circuits
                            His research interests include low-power digital cir-             and high-speed signaling. She relocated to India in June 2004 to start up
                            cuits, design techniques, and methods for tolerating              the Bangalore Design Lab to help facilitate circuit research and prototype
                            parameter variations. He is also an Adjunct Faculty               development. She has co-authored seven papers and has two patents issued and
                            Member at the Oregon Graduate Institute, Beaverton,               four pending.
                            OR, where he teaches digital VLSI design.

                                                                                                                        Clark Roberts received the A.S.E.E.T. degree from
                                                                                                                        Mt. Hood Community College, Gresham, OR, in
                            David Finan received the A.S. degree in electronic                                          1985. While pursuing his education, he held several
                            engineering technology from Portland Community                                              technical positions at Tektronix Inc., Beaverton, OR,
                            College, Portland, OR, in 1989.                                                             from 1978 to 1992.
                               He joined Intel Corporation, Hillsboro, OR,                                                 In 1992, he joined Intel Corporation’s Supercom-
                            working on the iWarp project in 1988. He started                                            puter Systems Division, where he worked on the
                            working on the Intel i486DX2 project in 1990, the                                           research and design of clock distribution systems for
                            Intel Teraflop project in 1993, for Intel Design Lab-                                       massively parallel supercomputers. He worked on
                            oratories in 1995, for Intel Advanced Methodology                                           the system clock design for the ASCI Red TeraFlop
                            and Engineering in 1996, on the Intel Timna project                                         Supercomputer (Intel, DOE and Sandia). From 1996
                            in 1998, and for Intel Laboratories in 2000.                      to 2000, he worked as a Signal Integrity Engineer in Intel Corporation’s Micro-

     Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
VANGAL et al.: AN 80-TILE SUB-100-W TERAFLOPS PROCESSOR IN 65-NM CMOS                                                                                                    41

processor Division, focusing on clock distribution for Intel microprocessor                                             Nitin Borkar received the M.Sc. degree in physics
package and motherboard designs. In 2001, he joined the circuit research                                                from the University of Bombay, India, in 1982, and
area of Intel Labs. He has been working on physical prototype design and                                                the M.S.E.E. degree from Louisiana State University
measurement systems for high-speed IO and terascale microprocessor research.                                            in 1985.
                                                                                                                           He joined Intel Corporation in Portland, OR, in
                                                                                                                        1986. He worked on the design of the i960 family
                                                                                                                        of embedded microcontrollers. In 1990, he joined the
                           Yatin Hoskote (M’96) received the B.Tech. degree                                             i486DX2 microprocessor design team and led the de-
                           in electrical engineering from the Indian Institute                                          sign and the performance verification program. After
                           of Technology, Bombay, and the M.S. and Ph.D.                                                successful completion of the i486DX2 development,
                           degrees in computer engineering from the University                                          he worked on high-speed signaling technology for the
                           of Texas at Austin.                                                Teraflop machine. He now leads the prototype design team in the Circuit Re-
                              He joined Intel in 1995 as a member of Strategic                search Laboratory, developing novel technologies in the high-performance low-
                           CAD Labs doing research in verification technolo-                  power circuit areas and applying those towards terascale computing research.
                           gies. He is currently a Principal Engineer in the Ad-
                           vanced Prototype design team in Intel’s Micropro-
                           cessor Technology Lab working on next-generation
                           network-on-chip technologies.                                                                Shekhar Borkar (M’97) was born in Mumbai,
   Dr. Hoskote received the Best Paper Award at the Design Automation Con-                                              India. He received the B.S. and M.S. degrees in
ference in 1999 and an Intel Achievement Award in 2006. He is Chair of the pro-                                         physics from the University of Bombay, India, in
gram committee for 2007 High Level Design Validation and Test Workshop and                                              1979, and the M.S. degree in electrical engineering
a guest editor for IEEE Design & Test Special Issue on Multicore Interconnects.                                         from the University of Notre Dame, Notre Dame,
                                                                                                                        IN, in 1981.
                                                                                                                           He is an Intel Fellow and Director of the Circuit
                                                                                                                        Research Laboratories at Intel Corporation, Hills-
                                                                                                                        boro, OR. He joined Intel in 1981, where he has
                                                                                                                        worked on the design of the 8051 family of micro-
                                                                                                                        controllers, iWarp multi-computer, and high-speed
                                                                                              signaling technology for Intel supercomputers. He is an adjunct member of
                                                                                              the faculty of the Oregon Graduate Institute, Beaverton. He has published
                                                                                              10 articles and holds 11 patents.

      Authorized licensed use limited to: Washington State University. Downloaded on September 17, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
You can also read