Measuring the Gap Between FPGAs and ASICs

Page created by Gordon Hanson
 
CONTINUE READING
Measuring the Gap Between FPGAs and ASICs

                                                           Ian Kuon and Jonathan Rose
                          The Edward S. Rogers Sr. Department of Electrical and Computer Engineering
                                                    University of Toronto
                                                        Toronto, ON
                                                                {ikuon,jayar}@eecg.utoronto.ca

ABSTRACT                                                                                    1. In the early stages of system design, when system ar-
This paper presents experimental measurements of the dif-                                      chitects choose their implementation medium, they of-
ferences between a 90nm CMOS FPGA and 90nm CMOS                                                ten choose between FPGAs and ASICs. Such decisions
Standard Cell ASICs in terms of logic density, circuit speed                                   are based on the differences in cost (which is related to
and power consumption. We are motivated to make these                                          area), performance and power consumption between
measurements to enable system designers to make better in-                                     these implementation media but to date there have
formed choices between these two media and to give insight                                     been few attempts to quantify these differences. A
to FPGA makers on the deficiencies to attack and thereby                                        system architect can use these measurements to as-
improve FPGAs. In the paper, we describe the methodology                                       sess whether implementation in an FPGA is feasible.
by which the measurements were obtained and we show that,                                      These measurements can also be useful for those build-
for circuits containing only combinational logic and flip-                                      ing ASICs that contain programmable logic, by quan-
flops, the ratio of silicon area required to implement them in                                  tifying the impact of leaving part of a design to be
FPGAs and ASICs is on average 40. Modern FPGAs also                                            implemented in the programmable fabric.
contain “hard” blocks such as multiplier/accumulators and                                   2. FPGA makers seeking to improve FPGAs can gain in-
block memories and we find that these blocks reduce this                                        sight by quantitative measurements of these metrics,
average area gap significantly to as little as 21. The ratio                                    particularly when it comes to understanding the bene-
of critical path delay, from FPGA to ASIC, is roughly 3 to                                     fit of less programmable (but more efficient) hard het-
4, with less influence from block memory and hard multipli-                                     erogeneous blocks such as block memory [3, 17, 28]
ers. The dynamic power consumption ratio is approximately                                      multipliers/accumulators [3, 17, 28] and multiplexers
12 times and, with hard blocks, this gap generally becomes                                     [28] that modern FPGAs often employ.
smaller.
                                                                                             In this paper we focus on a comparison between a 90 nm
                                                                                          CMOS SRAM-programmable FPGA and a 90 nm CMOS
Categories and Subject Descriptors                                                        standard cell technology. We chose an SRAM-based FPGA
B.7 [Integrated Circuits]: Types and Design Styles                                        because that approach by far dominates the market, and it
                                                                                          was necessary to limit the scope of comparison in order to
General Terms                                                                             make this work tractable. Similarly, standard cells [8, 21]
                                                                                          are currently the dominant choice in ASIC implementations
Design, Performance, Measurement                                                          versus pure gate arrays and the newer “structured ASIC”
                                                                                          platforms [18, 19].
Keywords                                                                                     We present these measurements knowing that some of the
FPGA, ASIC, Area Comparison, Delay Comparison, Power                                      methodology used will be controversial. We will carefully
Comparison                                                                                describe the comparison process so that readers can form
                                                                                          their own opinions of the validity of the result. As always,
1. INTRODUCTION                                                                           the set of benchmarks we use are highly influential on the
                                                                                          results, and indeed any given FPGA vs. ASIC comparison
  We were motivated to measure the area, performance and
                                                                                          can vary significantly based on the application, as our results
power consumption gap between field-programmable gate
                                                                                          show. Since we perform measurements using a large set of
arrays (FPGAs) and standard cell application-specific inte-
                                                                                          designs, it was not feasible to individually optimize each de-
grated circuits (ASICs) for the following reasons:
                                                                                          sign and it is likely that manual optimizations or greater
                                                                                          tuning of the tools could yield improved results for any in-
                                                                                          dividual design; however, this is true for both the ASIC and
Permission to make digital or hard copies of all or part of this work for                 FPGA platforms. We believe our results are more meaning-
personal or classroom use is granted without fee provided that copies are                 ful than past comparisons because we do consider a range
not made or distributed for profit or commercial advantage and that copies                of benchmarks instead of focusing on just a single design as
bear this notice and the full citation on the first page. To copy otherwise, to           has been done in most past analyses.
republish, to post on servers or to redistribute to lists, requires prior specific           This paper is organized as follows: Section 2 describes
permission and/or a fee.
FPGA’06, February 22–24, 2006, Monterey, California, USA.                                 previous work on measuring the gap between FPGAs and
Copyright 2006 ACM 1-59593-292-5/06/0002 ...$5.00.                                        ASICs. Section 3 details the experimental methodology we

                                                                                     21
use in this work. The approach is a fundamentally empirical           paring more circuits and using an actual commercial FPGA
one in which the same circuits are implemented through                for the comparison.
two computer-aided design (CAD) flows which are described                 Compton and Hauck [11] have also measured the area dif-
in Sections 4 and 5. Section 6 gives a precise definition of           ferences between FPGA and standard cell designs. They im-
the comparison metrics. Section 7 presents the comparison             plemented multiple circuits from eight different application
results, and Section 8 concludes the paper.                           domains, including areas such as radar and image process-
                                                                      ing, on the Xilinx Virtex-II FPGA, in standard cells on a
2. PAST FPGA TO ASIC COMPARISONS                                      0.18 µm CMOS process from TSMC, and on a custom con-
                                                                      figurable platform. Since the Xilinx Virtex-II is designed in
   There have been a small number of past attempts to quan-
                                                                      0.15 µm CMOS technology, the area results are scaled up to
tify the gap between FPGAs and ASICs which we will review
                                                                      allow direct comparison with 0.18 µm CMOS. Using this ap-
here.
                                                                      proach, they found that the FPGA implementation is only
   One of the earliest statements quantifying the gap be-
                                                                      7.2 times larger on average than a standard cell implemen-
tween FPGAs and pre-fabricated media was by Brown et
                                                                      tation. The authors believe that one of the key factors in
al. [4]. That work reported the logic density gap between
                                                                      narrowing this gap is the availability of heterogeneous blocks
FPGAs and Mask-programmable Gate Arrays (MPGAs) to
                                                                      such as memory and multipliers in modern FPGAs and, in
be between 8 to 12 times, and the circuit performance gap
                                                                      our work, we quantify these claims.
to be approximately a factor of 3. The basis for these num-
                                                                         While the present work aims to measure the gap between
bers was a cursory comparison of the largest available gate
                                                                      FPGAs and ASICs, it is noteworthy that the area, speed and
counts in each technology, and the anecdotal reports of the
                                                                      power penalty of FPGAs is even larger when compared to
approximate operating frequencies in the two technologies
                                                                      the best possible custom implementation using full-custom
at the time. While the latter may have been reasonable, the
                                                                      design. It has been observed that full-custom designs tend
former suffered from optimistic gate counting in FPGAs.
                                                                      to be 3 to 8 times faster than comparable standard cell
   In this paper we are seeking to measure the gap against
                                                                      ASIC designs [8]. In terms of area, a full-custom design
standard cell implementations, rather than the less com-
                                                                      methodology has been found to achieve 14.5 times greater
mon MPGA. MPGAs have lower density relative to stan-
                                                                      density than a standard cell ASIC methodology [12]. Fi-
dard cells, which are on the order of 33% to 62% smaller
                                                                      nally, the power consumption of standard cell designs has
and 9% to 13% faster than MPGA implementations [15].
                                                                      been observed as being between 3 to 10 times greater than
Aside from the reliance on anecdotal evidence, the analysis
                                                                      full-custom designs [7, 9].
in [4] is dated since it does not include the impact of hard
dedicated circuit structures such as multipliers and block
memories that are now common [3, 28]. In this work, we                3.    NEW FPGA TO ASIC COMPARISON
address this issue by explicitly considering the incremental             The measurements of the gaps between between FPGAs
impact of such blocks.                                                and ASICs described in the previous section were generally
   More recently, a detailed comparison of FPGA and ASIC              based on simple estimates or single-point comparisons. To
implementations was performed by Zuchowski et al. [30].               provide a more definitive measurement, our approach is to
They found that the delay of an FPGA lookup table (LUT)               implement a range of benchmark circuits in FPGAs and
was approximately 12 to 14 times the delay of an ASIC gate.           standard cells with both designed using the same IC fabri-
Their work found that this ratio has remained relatively              cation process geometry. The two implementations are then
constant across CMOS process generations from 0.25 µm to              compared in terms of silicon area, maximum operating fre-
90 nm. ASIC gate density was found to be approximately                quency and power consumption.
45 times greater than that possible in FPGAs when mea-                   This comparison was performed using 90 nm CMOS tech-
sured in terms of kilo-gates per square micron. Finally, the          nologies to implement a large set of benchmarks. We se-
dynamic power consumption of a LUT was found to be over               lected the Altera Stratix II [3] FPGA based on the avail-
500 times greater than the power of an ASIC gate. Both                ability of specific device data [10]. This device is fabricated
the density and the power consumption exhibited variability           using TSMC’s Nexsys 90 nm process [1]. The IC process we
across process generations but the cause of such variability          use for the standard cells is STMicroelectronic’s CMOS090
was unclear. The main issue with this work is that it also            Design Platform [22]. This platform offers standard cell li-
depends on the number of gates that can be implemented by             braries optimized for speed or density and both high-VT and
a LUT. In our work, we remove this issue by instead focus-            standard-VT versions are available. While the TSMC and
ing on the area, speed and power consumption of application           STMicroelectronics processes are not identical, we believe
circuits.                                                             they are sufficiently similar to allow them to be compared
   Wilton et al. [27] also examined the area and delay penalty        in this work. The results from both platforms will assume a
of using programmable logic. The approach taken for the               nominal supply voltage of 1.2 V.
analysis was to replace part of a non-programmable de-
sign with programmable logic. They examined the area and              3.1   Benchmark Selection
delay of the programmable implementation relative to the                The selection of benchmarks can significantly impact the
non-programmable circuitry it replaced. This was only per-            results and, therefore, before considering how these bench-
formed for a single module in the design consisting of the            marks are implemented, we describe how the benchmarks
next state logic for a chip testing interface. They estimated         were initially selected. We considered a variety of bench-
that when the same logic is implemented on an FPGA fabric             marks (coded in either Verilog or VHDL) from a range of
and directly in standard cells, the FPGA implementation is            sources including publicly available designs from Opencores
88 times larger. They measured the delay ratio of FPGAs to            (http://www.opencores.org/) and designs developed for
ASICs to be 2.0 times. Our work improves on this by com-              projects at the University of Toronto.

                                                                 22
There were two critical factors that had to be considered
in benchmark selection. The first was ensuring that the Ver-                       Table 1: Benchmark Summary
ilog or VHDL RTL was synthesized similarly by the different                   Design      ALUTs     Total     Memory
tools used for FPGA and ASIC implementation. We did not                                             9x9       Bits
have access to a single synthesis tool that could adequately                                     Multipliers
target both platforms. Therefore, we had to ensure that the                  booth               68              0           0
results from the two synthesis tools were sufficiently sim-                    rs encoder         703              0           0
ilar. To check this, we compared the number of registers                     cordic18         2 105              0           0
inferred by the two synthesis processes, which we describe                   cordic8            455              0           0
in Sections 4 and 5.1. We rejected any design in which the                   des area           595              0           0
register counts deviated by more than 5%. Some differences                    des perf         2 604              0           0
in the register count are expected because different imple-                   fir restruct        673              0           0
mentations are appropriate on the different platforms. For                    mac1             1 885              0           0
example, FPGA designs tend to use one-hot encodings for                      aes192           1 456              0           0
state machines because of the low incremental cost for flip-                  fir3                 84              4           0
flops.                                                                        diffeq              192             24           0
   The other issue impacting benchmark selection was en-                     diffeq2             288             24           0
suring that the designs can make use of the block memories                   molecular        8 965            128           0
and dedicated multipliers on the Stratix II. This is impor-                  rs decoder1        706             13           0
tant because one of the aims of our work is analyzing the                    rs decoder2        946              9           0
improvements possible when these hard dedicated blocks are                   atm             16 544              0       3 204
used. However, not all designs will use such features which                  aes                809              0      32 768
made it important to ensure that the set of benchmarks in-                   aes inv            943              0      34 176
clude both cases when these hard structures are used and                     ethernet         2 122              0       9 216
not used.                                                                    serialproc         680              0       2 880
   Based on these two factors, the set of benchmarks in                      fir24             1 235             50          96
Table 1 were selected for use in this work. To provide an                    pipe5proc          837              8       2 304
indication of the size of the benchmarks, the table also lists               raytracer       16 346            171      54 758
the number of Altera Stratix II ALUTs, 9x9 multipliers and
memory bits used by each design. The ALUT is slightly
more powerful than the traditional 4-input LUT-based logic
block [3]. The 9x9 multipliers are the smallest possible di-          placement and routing, the Quartus Timing Analyzer was
vision of the Stratix II’s DSP block. These basic blocks can          used for static timing analysis.
be combined to form larger multipliers (four can be used to              In this flow, we allow the fitter to select the specific Stratix
make an 18x18 multiplier and eight are needed to make a               II device used; however, we restrict the selection process to
36x36 multiplier). While all the benchmarks are relatively            use only the fastest speed grade parts. FPGA manufactur-
modest in size, we believe that the circuits are sufficiently           ers test the speed of their parts after manufacturing and
large to give us an accurate measure of the gap between               then bin the parts into typically three different speed grades
FPGAs and ASICs.                                                      which capture different portions of the manufacturing range
                                                                      of the process. An ASIC’s delay is based on the worst case
                                                                      temperature, voltage and process since ASIC parts gener-
4. FPGA CAD FLOW                                                      ally are not speed binned; therefore, using the fastest speed
   The Altera Quartus II v5.0SP1 FPGA software was used               grade devices could arguably favour the FPGA. To address
for all stages of the CAD flow. Synthesis is performed us-             this issue, we will also present results using the slowest speed
ing Quartus II Integrated Synthesis (QIS). Quartus II was             grade parts. Device selection is also important because it af-
configured to perform balanced optimization which opti-                fects the available resources. For industrial FPGA designs,
mizes the speed of timing critical portions of the design and         generally the smallest (and cheapest) part would be selected.
area for the remainder of the design. When large memo-                As will be described later, our FPGA to ASIC comparison
ries were required, they were coded by explicit instantiation         optimistically ignores the issue of device size granularity.
in the RTL using an appropriate configuration of Altera’s                 It is also important to note that the final operating fre-
altsyncram design library function. To enable further opti-           quency of the design can vary depending on the random seed
mization of the design, QIS was left in its default configu-           given to the placement tool. Therefore, we repeated the en-
ration in which it automatically instantiates ROMs, RAMs              tire FPGA CAD flow five times using five different seeds.
and DSP blocks (the latter of which contains hard multiply-           Any results we report are derived from the placement that
accumulate circuits) when needed. All other options were              resulted in the fastest operating frequency.
also left at their default setting.
   Placement and routing with the Quartus II “fitter” was
performed using the “Standard Fit” effort level. This is the           5.   ASIC CAD FLOW
highest effort level and the tool attempts to obtain the best             The standard cell CAD flow is significantly more com-
possible timing results irrespective of any timing constraints        plicated than the relatively push-button approach for FP-
[2]. We rely on this high effort level to produce the fastest          GAs. The flow was built around tools from Cadence and
design possible, since we do not constrain the design with            Synopsys that were provided by CMC Microsystems (http:
any timing constraints. Similar results were obtained when            //www.cmc.ca). We relied on vendor documentation, tuto-
the design was constrained to an unattainable 1 GHz. After            rials created by CMC Microsystems and tool demonstration

                                                                 23
RTL Design
                        Description
                                                                       pilation either maintains or improves performance by per-
                                                                       forming gate-level optimizations [24].
                                                                          For any modern design, Design for Testability (DFT) tech-
                         Synthesis
                      Synopsys Design
                          Compiler
                                                                       niques are necessary to enable testing for manufacturing de-
                                                                       fects. In standard cell ASICs, it is customary to use scan
                       Placement and                                   chains to facilitate testing [26]. These scan chains require
                           Routing
                       Cadence SOC
                          Encounter
                                        Area
                                                                       that all sequential cells in a design are replaced with their
                                                                       scan-equivalent implementation and, therefore, in all the
                         Extraction                                    compilations performed with Design Compiler we make use
                         Synopsys
                         Star-RCXT
                                                                       of its Test Ready Compile option which performs this re-
                                                                       placement automatically. For the FPGA-based implemen-
                      Timing Analysis
                         Synopsys       Delay
                                                                       tation, testing is performed by the manufacturer and the
                         PrimeTime
                                                                       inherent programmability of the FPGA generally means no
                                                                       extra circuitry is required.
                        Simulation
                         Cadence                                          Once these two compilations have been performed, a rea-
                          NC-Sim
                                                                       sonable operating frequency for the clocks in the design
                                                                       should be available. The desired clock period for each clock
                      Power Analysis
                         Synopsys
                        PrimePower
                                        Power                          in the design is then adjusted from the unrealistic 0.5 ns con-
                                                                       straint to the critical path delay that was obtained in the
                                                                       compilation. A final high effort compilation is performed
              Figure 1: ASIC CAD Flow                                  using these realistic clock constraints. For this final compi-
                                                                       lation, we enable sequential area recovery optimizations
                                                                       which allows Design Compiler to remap sequential elements
sessions provided by the vendors to determine how best to
                                                                       that are not on the critical path in order to save area. Fol-
use these tools. Figure 1 illustrates the steps in the CAD
                                                                       lowing this compilation, scan-chains are inserted to connect
flow and this section describes each in greater detail.
                                                                       the scan-enabled flip flops. Finally, once the scan chains
5.1 ASIC Synthesis                                                     have been inserted, the final netlist and the associated con-
                                                                       straints are saved for use by the placement and routing tools.
   In the ASIC flow, synthesis was performed using Synop-
                                                                          In cases where the benchmark circuits required memo-
sys Design Compiler V-2004.06-SP1. A common compile
                                                                       ries, the appropriate memory cores were generated by STMi-
script was used for all the benchmarks. The approach for
                                                                       croelectronics’ memory compilers. CMC Microsystems and
the compilation was a top-down approach [23] in which all
                                                                       Circuits Multi-Projets (CMP) (http://cmp.imag.fr) coor-
the modules starting from the top-level module down are
                                                                       dinated the generation of these memory cores with STMi-
compiled together. This preserves the design hierarchy. It
                                                                       croelectronics. We chose to use low power 1.2 V memories,
is a reasonable approach because individually the bench-
                                                                       which resulted in memories that were significantly slower
marks have relatively modest sizes and therefore, neither
                                                                       than regular memories. (We were not able to obtain higher
CPU time nor memory size is a significant issue during com-
                                                                       speed memories in time for this work) Within this low power
pilation. The script starts by first analyzing the hardware
                                                                       class of memories, we selected compilers for higher speed
description language (HDL) source files for the benchmark
                                                                       over higher density or further reduced power consumption.
and then elaborating and linking the top level module. Next
                                                                       We also opted to make the memories as square as possi-
the constraints for the compilation are applied.
                                                                       ble. The models provided for the memories did not exactly
   As a starting point, the clocks in the design are initially
                                                                       match our voltage and temperature analysis conditions and,
constrained to operate at 2 GHz. This unrealistic constraint
                                                                       to account for this, we scaled the delay and power mea-
ensures that the compilation attempts to achieve the fastest
                                                                       surements using scaling factors determined by performing
clock frequency possible for each of the circuits. To ensure
                                                                       HSPICE simulations.
the area of the design remains reasonable, the maximum
                                                                          In all of the compilations, no effort was made to opti-
area is constrained to 0. This is a standard approach [23]
                                                                       mize the power consumption of the design. This is likely
for ensuring area optimization despite the fact that it too is
                                                                       atypical for modern designs but we believe it ensures a fair
an unreasonable constraint.
                                                                       comparison with the FPGA implementation. With the cur-
   The ST 90 nm process design kit available to us includes
                                                                       rent FPGA CAD tools, power optimization is not an option
four different standard cell libraries. Two of the libraries
                                                                       and, therefore, using tools such as Synopsys Power Compiler
are designed for area efficiency but there are relatively few
                                                                       to optimize the standard cell designs would demonstrate an
cells in these libraries. The other two libraries are opti-
                                                                       excessively large disparity between the approaches.
mized for speed. In each of the two cases, area or speed
optimization, one of the libraries uses a low leakage high-VT          5.2    ASIC Placement and Routing
implementation while the other library uses higher perform-               The netlist and constraints produced by synthesis were
ing standard-VT transistors. In Synopsys Design Compiler,              placed and routed with Cadence SOC Encounter GPS v4.1.5.
we set all four libraries as the target libraries which means          The flow was adapted from the one described in the En-
that it can select cells from any of these libraries as it sees        counter Design Flow Guide and Tutorial [5] and will be de-
fit.                                                                    scribed below.
   After setting the optimization constraints and target cells,           The sizes of our benchmark designs allow us to avoid the
the design is compiled using Design Compiler’s high-effort             hierarchical chip floor-planning steps required for large de-
compilation setting. After this full compilation is complete,          signs. Instead, we implement each design as an individual
a high-effort incremental mapping is performed. This com-               block and we do not perform any design partitioning. We

                                                                  24
found this approach to be reasonable in terms of run time              6.1     Area
and memory size for our benchmarks.                                       Determining the area of the standard cell implementa-
   The first step was to create a floorplan; a key decision              tion is straightforward as it is simply the final core area of
here was to set the target row utilization to 85% and the              the placed and routed design. For the FPGA, the area is
target aspect ratio to 1.0. Row utilization is the percentage          calculated using the actual silicon area of each of the re-
of the area required for the standard cells relative to the to-        sources used by the design. This means that we take the
tal row area allocated for placement. A high row utilization           final number of Altera Stratix II resources including the ba-
minimizes wasted area but makes routing more difficult. We               sic logic LABs, the M512, M4K, MRAM memories and DSP
selected a target of 85% to minimize any routing problems.             blocks and multiply each by the silicon area of that specific
This is intentionally below higher utilizations of > 85 %              block[10]. This includes the area for the routing surround-
that make placement and routing more challenging [29]. We              ing each of the blocks. The entire area of a block is used
encountered difficulty placing and routing the circuits with             regardless of whether only a portion of the block is used. For
the large memory macro blocks; therefore, the target row               example, if only a single memory bit were used in one of the
utilization in those benchmarks was reduced to 75%. After              large 589 824-bit MRAM blocks we would include the entire
the floorplanning with these constraints, placement is per-             area of the MRAM block. We recognize that this approach
formed. This placement is timing-driven using the worst-               may be considered optimistic for a few reasons. First, it ig-
case timing models. After placement, scan chain reordering             nores the fact that FPGAs unlike ASICs are not available in
is performed to reduce the wirelength required for the scan            arbitrary sizes. A designer is forced to select one particular
chain.                                                                 discrete size even if it is larger than required for the design.
   Next, the placement is refined using a built-in congestion           While this is an important factor, our goal is to focus on the
optimization command and Encounter’s optDesign macro                   cost of programmable fabric itself; therefore, we believe, it
command. This macro command performs optimizations                     is acceptable to ignore any area wasted due to the discrete
such as buffer additions, gate resizing and netlist restruc-            nature of FPGA device families. Related to this, is the fact
turing. After these optimizations, the clock tree is inserted.         that we are also handling the heterogeneity of the FPGA op-
Setup and hold time violations are then corrected using the            timistically. With commercial FPGAs, a designer is forced
new true clock tree delay information. Once the violations             to tolerate fixed ratios of logic, memories and multipliers.
are fixed, filler cells are added to the placement in prepara-           Again, since our focus is on the cost of programmability it-
tion for routing.                                                      self, we consider it acceptable to ignore the impact of the
   Routing is performed using Encounter’s Nanoroute en-                fixed heterogeneous block ratios.
gine. We allow the router to use the seven metal layers                   For both implementation media, we do not consider the
available in the STMicroelectronics process. After routing             impact of any input or output cells. As well, to avoid dis-
completes, we add any metal fill required to satisfy metal              closing any proprietary information, no absolute areas will
density requirements. A detailed extraction is then per-               be reported in this work; instead, we will only report the
formed. This extraction is not of the same quality as the              ratio of the FPGA area to the ASIC area.
sign-off extraction but is sufficient for guiding the timing-             6.2     Speed
driven optimizations. The extracted information is used
                                                                         Static timing analysis was used to measure the critical
to perform post-routing optimizations that are focused on
                                                                       path of the each design. This timing analysis determines
improving the critical paths. These optimizations include
                                                                       the maximum clock frequencies for each design. In the case
drive strength adjustments. After these in-place optimiza-
                                                                       of the eth top benchmark which contains multiple clocks,
tions, routing is again performed and the output is again
                                                                       we compare the geometric average of all the clocks in each
checked for any connectivity or design rule violations. The
                                                                       implementation. Timing analysis for the FPGA was per-
final netlist is then saved in various forms as required for
                                                                       formed using the timing analysis integrated in Quartus II.
the subsequent steps in the CAD flow.
                                                                       For the standard cell implementation, Synopsys PrimeTime
5.3 Extraction and Timing Analysis                                     SI (which accounts for delay due to cross-talk) was used with
                                                                       the worst-case timing models.
   With our current tool and technology kit setup, the RC
information we provide to SOC Encounter GPS is not suit-               6.3     Power
able for the final timing and power analysis. Therefore,                   Power has become one of the most important issues sep-
after the final placement and routing is complete, a final               arating FPGA and ASIC designs but it is one of the most
sign-off quality extraction is performed using Synopsys Star-           challenging metrics to compare. In this section, we first de-
RCXT V-2004.06. This final RC extraction is saved for use               scribe how we measure the static and dynamic components
in final timing and power analysis by Synopsys PrimeTime                of a design’s power consumption. The two contributions
SI version X-2005.06 and Synopsys PrimePower version V-                are separated both to simplify the analysis and because we
2004.06SP1 respectively.                                               are only able to report meaningful results for the dynamic
6. COMPARISON METRICS AND MEAS-                                        power consumption comparison. In an attempt to ensure a
                                                                       fair and useful comparison, we adjusted the measurements
   UREMENT METHOD                                                      of the static power and we describe our adjustments later
  Once the designs were implemented using both the ASIC                in this section so as to explain the limited static power con-
and FPGA approaches and we were confident that the im-                  sumption results we are able to report.
plementations were directly comparable, the area, delay and
power of the designs were compared. In this section, we give           6.3.1    Dynamic and Static Power Measurement
a precise definition of each metric and the method used for               The preferred measurement approach, particularly for dy-
measurement.                                                           namic power measurements, is to stimulate the post-placed

                                                                  25
and routed design with designer-created testbench vectors.              because many of the benchmarks do not fully utilize a spe-
For the present work, we take this approach when an appro-              cific FPGA device. To account for this, the static power con-
priate testbench is available for a benchmark, and the result           sumption reported by the Quartus Power Analyzer is scaled
tables will indicate if this was possible. Useful testbenches           by the fraction of the core FPGA area that the circuit uses.
are generally not available and, in those cases, we use a less          This decision is arguable as any purchaser of an FPGA is
accurate approach that relies on arbitrary settings of toggle           necessarily limited to specific devices and, therefore, would
rates and static probabilities for the nets in the designs.             indeed incur the extra static power consumption. However,
   All the power measurements were taken at a junction tem-             this device quantization effect obscures the underlying prop-
perature of 25 ◦ C using typical silicon. Both the FPGA and             erties that we seek to measure, and changes depending on
ASIC implementations are simulated at the same operat-                  an FPGA vendor’s decision on how many devices to put in
ing frequency to allow us to directly compare the dynamic               an FPGA family. We also anticipate that future generations
power consumptions. The operating frequency for all the                 of FPGAs will allow the power shutdown of unused portions
designs was kept constant at 33 MHz. This frequency was                 of the devices.
selected since it was a valid operating frequency for all the              To be clear, we give a hypothetical example of the frac-
benchmark designs on both platforms. We now describe the                tional static power calculation: if a circuit used 1 LAB and 1
process used to generate the power consumption estimates                MRAM block occupying a hypothetical area of 101 µm2 on
for the FPGA and ASIC designs.                                          an FPGA that contained a total of 10 LABs and 2 MRAM
   For the FPGA implementation, the placed and routed de-               blocks occupying an area of 210 µm2 , we would multiply
sign is exported as a netlist along with the appropriate delay          the reported static power consumption by 101/210 = 0.48
annotations from Quartus II. If there are test bench vectors            to obtain the static power consumption used for comparison
available for the benchmark, then digital simulation is per-            purposes. This approach assumes the leakage power is ap-
formed using Mentor Modelsim 6.0c, which creates a Value                proximately proportional to the total transistor width of a
Change Dump (VCD) file containing the switching activities               design which is reasonable based on [14] and that the area
on all circuit nodes. The Quartus Power Analyzer reads this             of a design is a linear function of the total transistor width.
file and determines the static and dynamic power consump-                   It is important to note that these measurements compare
tion. The activities are computed with glitch filtering en-              the power consumption gap as opposed to energy consump-
abled so that transitions that do not fully propagate through           tion gap. An analysis of the energy consumption gap would
the routing network are ignored. Since we are only focused              have to reflect the slower operating frequencies of the FPGA.
on the programmable fabric in this investigation, only core             The slower frequency means that more time or more paral-
power (supplied by VCCINT) reported by the power analyzer               lelism would be required to perform the same amount of
is considered. The power analyzer breaks this power con-                work as the ASIC design. To simplify the analysis in this
sumption down into static and dynamic components.                       work, only the power consumption gap will be considered.
   For the standard cell design, simulation of the placed and
routed netlist with back-annotated timing is performed us-              7.    RESULTS
ing Cadence NC-Sim 5.40. This also produces a VCD file
                                                                           The measurement methodology described above was ap-
capturing the states and transitions for all circuit nodes in
                                                                        plied to each of the benchmarks listed in Table 1, and the
the design. This file, along with the parasitic information
                                                                        metrics were compared. In the following sections, the area,
extracted by Star-RCXT, is used to perform power analysis
                                                                        delay and power gap between FPGAs and ASICs will be
with the Synopsys PrimePower tool, version V-2004.06SP1.
                                                                        reported and discussed.
For this dynamic analysis, PrimePower automatically han-
dles glitches by scaling the power when the interval between            7.1    Area
toggles is less than the rise and fall delays of the net. Prime-
                                                                           The area gap between FPGAs and ASICs for the bench-
Power also divides the power consumption into the static
                                                                        mark circuits is summarized in Table 2. The gap is reported
and dynamic components.
                                                                        as the factor by which the area of the FPGA implementation
   For most designs, proper testbenches were not available.
                                                                        is larger than the ASIC implementation. As described pre-
In such cases, power measurements were taken by assuming
                                                                        viously, this gap is sensitive to the benchmarks’ use of het-
all the nets in all designs toggle at the same frequency and
                                                                        erogeneous blocks (memory and multipliers) and the results
that all the nets have the same static probability. While this
                                                                        in the table are categorized in four ways: Those benchmarks
is not realistic, it provides a rough estimate of the power
                                                                        that used only the basic logic fabric of clusters of LUTs are
consumption differences between implementations. When
                                                                        labelled “Logic Only.” Those that used logic clusters and
this approach is used for measurements, it is noted. We
                                                                        hard DSP blocks containing multiplier-accumulators are la-
chose this approach over statistical vectorless estimation
                                                                        belled “Logic and DSP.” Those that used clusters and mem-
techniques that use toggle rates and static probabilities at
                                                                        ory blocks are labelled “Logic and Memory,” and finally
input nodes to estimate the toggle rates and probabilities
                                                                        those that used all three are labelled “Logic, DSP and Mem-
throughout the design because the two power estimation
                                                                        ory”. We implemented the benchmarks that contained mul-
tools produced significantly different activity estimates.
                                                                        tiplication operations with and without the hard DSP blocks
                                                                        so results for these benchmarks appear in two columns, and
6.3.2    Dynamic and Static Power Comparison Method-                    allow the direct measurement of the benefit of these blocks.
         ology                                                             First, consider those circuits that only use the basic logic
   We believe that the ASIC and FPGA dynamic power con-                 LUT clusters: the area required to implement these circuits
sumption measurements can directly be compared but the                  in FPGAs compared to standard cell ASICs is on average a
static power consumption requires adjustment before a re-               factor of 40 times larger, with the different designs covering
liable comparison is possible. This adjustment is necessary             a range from 23 to 55 times. This is significantly larger than

                                                                   26
the area gap suggested by [4], which used extant gate counts
as its source. It is much closer to the numbers suggested by                    Table 2: Area Ratio (FPGA/ASIC)
[30].                                                                                      Logic    Logic    Logic       Logic,
   We can confirm the plausibility of this larger number                      Name                    &        &         Memory
based on our recent experience in designing and building                                   Only     DSP     Memory      & DSP
complete FPGAs [16, 20]. As part of this work, we created                    booth           33
a design similar to the Xilinx Virtex-E, a relatively modern                 rs encoder      36
commercial architecture. If we consider such a design, only                  cordic18        26
the lookup tables and flip-flops perform the basic logic oper-                 cordic8         29
ations that would also be necessary in a standard cell design.               des area        43
The FPGA however also requires additional circuitry to en-                   des perf        23
able programmable connections between these lookup tables                    fir restruct     34
and flip-flops. This excess circuitry is the fundamental rea-                  mac1            50
son for the area gap. Using our model of the Virtex-E, we                    aes192          49
calculated that the LUT and flip-flop only take up 3.4 % of                    fir3             45      20
the total area for a Virtex-E cluster and its neighbouring                   diffeq           44      13
routing. The absolute area in the standard cell design re-                   diffeq2          43      15
quired to implement the functionality implemented by the                     molecular       55      45
LUT and flip-flop will be similar to area for the FPGA’s                       rs decoder1     55      61
LUT and flip-flop. This suggests the area gap should be at                     rs decoder2     48      43
least 100%/3.4% = 29. This is similar to our experimental                    atm                                93
measurement.                                                                 aes                                27
   The hard heterogeneous blocks do significantly reduce this                 aes inv                            21
area gap. As shown in Table 2, the benchmarks that make                      ethernet                           34
use of the hard multiplier-accumulators and logic clusters                   serialproc                         42
are on average only 28 times larger than an ASIC. When                       fir24                                          9.8
hard memories are used, the average of 37 times larger is                    pipe5proc                                     25
slightly lower than the average for regular logic and when                   raytracer                                     36
both multiplier-accumulators and memories are used, we
find the average is 21 times. Comparing the area gap be-                      Geomean         40      28         37         21
tween the benchmarks that make use of the hard multiplier-
accumulator blocks and those same benchmarks when the
hard blocks are not used best demonstrates the significant              percentage of the total area which is used by DSP blocks ex-
reduction in FPGA area when such hard blocks are available.            hibits a correlation of -0.87 with the area gap measurement.
In all but one case the area gap is significantly reduced1 .            This relatively strong inverse correlation corresponds with
This reduced area gap was expected because these hetero-               our expectations since as the DSP area content is increased
geneous blocks are fundamentally similar to an ASIC im-                the design becomes more like a standard cell design thereby
plementation with the only difference being that the FPGA               resulting in a reduced area gap.
implementation requires a programmable interface to the
outside blocks and routing.                                            7.2    Speed
   These results demonstrate the importance of the introduc-
                                                                          The speed gap for the benchmarks used in this work is
tion of these heterogeneous blocks in improving the compet-
                                                                       given in Table 3. The table reports the ratio between the
itiveness of FPGAs. It is important to recall that for these
                                                                       FPGA’s critical path delay relative to the ASIC for each of
heterogeneous blocks, the analysis is somewhat optimistic
                                                                       the benchmark circuits. As was done for the area compar-
for the FPGAs. As described earlier, we only consider the
                                                                       ison, the results are categorized according to the types of
area of blocks that are used, and we do not consider the
                                                                       heterogeneous blocks that were used on the FPGA.
fixed ratio of logic to heterogeneous blocks that a user is
                                                                          Table 3 shows that, for circuits with logic only, the average
forced to tolerate and pay for.
                                                                       FPGA circuit is 3.2 times slower than the ASIC implemen-
   It is noteworthy that significant variability in the area gap
                                                                       tation. This generally confirms the earlier estimates from
is observed in the benchmarks that make use of the hetero-
                                                                       [4], which were based on anecdotal evidence of circa-1991
geneous blocks. One contributor to this variability is the
                                                                       maximum operating speeds of the two approaches. How-
varying amounts of heterogeneous content. Our classifica-
                                                                       ever, these results deviate substantially from those reported
tion system is binary in that a benchmark either makes use
                                                                       in [30], which is based on an apples-to-oranges LUT-to-gate
of a hard structure or it does not but this fails to recognize
                                                                       comparison.
the varying amounts of heterogeneity in the benchmarks. To
                                                                          For circuits that make use of the hard DSP multiplier-
address this, we can consider the fraction of a design’s area
                                                                       accumulator blocks, the average circuit was 3.4 times slower
that is used by heterogeneous blocks. If we consider only
                                                                       in the FPGA than in an ASIC, and in general the use of
the benchmarks that employ DSP blocks, we find that the
                                                                       the hard block actually slowed down the design as can be
1                                                                      seen by comparing the second and third column of Table 3.
  The area gap of the rs decoder1 increases when the
multiplier-accumulator blocks are used. This is atypical and           This result is surprising since one would expect the faster
it appears to occur because the 5 bit by 5 bit multiplications         hard multipliers to result in faster overall circuits. We exam-
in the benchmark are more efficiently implemented in regu-               ined each of the circuits that did not benefit from the hard
lar logic instead of the Stratix II’s 9x9 multiplier blocks.           multipliers to determine the reason this occurred. For the

                                                                  27
Table 3: Critical Path Delay Ratio (FPGA/ASIC) -                     Table 4: Critical Path Delay Ratio (FPGA/ASIC) -
Fastest Speed Grade                                                  Slowest Speed Grade
    Name         Logic Logic   Logic    Logic,                           Name         Logic Logic   Logic    Logic,
                         &       &     Memory                                                 &       &     Memory
                  Only DSP Memory & DSP                                                Only DSP Memory & DSP
     booth          4.8                                                    booth          6.6
     rs encoder     3.5                                                    rs encoder     4.7
     cordic18       3.6                                                    cordic18       4.9
     cordic8        1.8                                                    cordic8        2.5
     des area       1.8                                                    des area       2.6
     des perf       2.8                                                    des perf       3.8
     fir restruct    3.5                                                    fir restruct    5.0
     mac1           3.5                                                    mac1           4.6
     aes192         4.0                                                    aes192         5.4
     fir3            3.9      3.4                                           fir3            5.4     4.6
     diffeq          4.0      4.1                                           diffeq          5.4     5.5
     diffeq2         3.9      4.0                                           diffeq2         5.2     5.4
     molecular      4.4      4.5                                           molecular      6.0     6.1
     rs decoder1    2.2      2.7                                           rs decoder1    3.0     3.6
     rs decoder2    2.0      2.2                                           rs decoder2    2.7     3.0
     atm                              2.7                                  atm                              3.6
     aes                              3.7                                  aes                              4.9
     aes inv                          4.0                                  aes inv                          5.5
     ethernet                         1.6                                  ethernet                         2.2
     serialproc                       1.0                                  serialproc                       1.4
     fir24                                         2.5                      fir24                                        3.3
     pipe5proc                                    2.5                      pipe5proc                                   3.5
     raytracer                                    1.4                      raytracer                                   2.0
     Geomean        3.2      3.4      2.3         2.1                      Geomean        4.3     4.5       3.1        2.8

molecular benchmark, the delays with and without the DSP             FPGA is on average 2.1 times slower. The use of memory
blocks were similar because there are more multipliers in the        blocks does appear to offer a performance advantage; how-
benchmark than there are DSP blocks. As a result, even               ever, this effect is exaggerated because of the slow low power
when DSP blocks are used the critical path on the FPGA is            memories used for the standard cell design as described in
through a multiplier implemented using regular logic blocks.         Section 5.1. We believe that, if higher speed memories were
For the rs decoder1 and rs decoder2 benchmarks, only small           used instead for the ASIC, the performance advantage of
5x5 bit and 8x8 bit multiplications are performed and the            the block memories would be relatively minor since, based
DSP blocks which are based on 9x9 bit multipliers do not sig-        on gate delays, the speed can be improved by over 20% [25,
nificantly speed up such small multiplications. In such cases         6]. Therefore, our conclusion for the memory blocks is the
where the speed improvement is minor, the extra routing              same as for the DSP blocks, which is that the primary ben-
that can be necessary to accommodate the fixed positions of           efit from such blocks is improved area efficiency.
the hard multiplier blocks can eliminate the speed advantage            As described earlier, the FPGA delay measurements as-
of the hard multipliers. Finally, the diffeq and diffeq2 bench-        sume the fastest speed grade part is used. Comparing to the
marks perform slower when the DSP blocks are used because            fastest speed grade is useful for understanding the best case
the 32x32 bit multiplications performed in the benchmarks            disparity between FPGAs and ASICs but it is not entirely
are not able to fully take advantage of the hard multipli-           fair. ASICs are generally designed for the worst case process
ers which were designed for 36x36 bit multiplication. As             and it may be fairer to compare the ASIC performance to
well, those two benchmarks contain two unpipelined stages            that of the slowest FPGA speed grade. Table 4 presents this
of multiplication and it appears that implementation in the          comparison. For logic only circuits, the ASIC performance
regular logic clusters is efficient in such a case. We believe         is now 4.3 times greater than the FPGA. When the circuits
that with a larger set of benchmark circuits we would have           make use of the DSP blocks the gap is 4.5 times and when
encountered more benchmarks that could benefit from the               memory blocks are used the performance difference is 3.1
use of the hard multipliers, particularly if any designs were        times. For the circuits that use both the memory and the
more tailored to the DSP block’s functionality. However, as          multipliers, the average is 2.8 times. As expected, the slower
these results demonstrated, the major benefit of these hard           speed grade parts cause a larger performance gap between
DSP blocks is not the performance improvement, if any, but           ASICs and FPGAs.
rather the significant improvement in area efficiency.
   For the circuits that make use of the block memory the            7.3    Power Consumption
FPGA-based designs are on average 2.3 times slower and                 In Table 5, we list the ratio of FPGA dynamic power con-
for the few circuits using both memory and multipliers the           sumption to ASIC power consumption for the benchmark

                                                                28
circuits. Again, we categorize the results based on which
hard FPGA blocks were used. As described earlier, two ap-             Table 5: Dynamic Power Consumption Ratio
proaches are used for power consumption measurements and              (FPGA/ASIC)
the table indicates which method was used. “Sim” means                 Name           Method     Logic    Logic    Logic       Logic,
that the simulation-based method (with full simulation vec-                                      Only      &        &         Memory
tors) was used and “Const” indicates that a constant toggle                                               DSP     Memory      & DSP
rate and static probability was applied to all nets in the de-         booth            Sim       16
sign. Static power results are not presented for reasons that          rs encoder       Sim       7.2
will be described later.                                               cordic18        Const      6.3
   The results indicate that on average FPGAs consume 12               cordic8         Const      6.0
times more dynamic power than ASICs when the circuits                  des area        Const      26
contain only logic. If we consider the subset of designs for           des perf        Const      9.3
which the simulation-based power measurements were used                fir restruct     Const      9.0
we observe that the results are on par with the results from           mac1            Const      18
the constant toggle rate method. We are more confident in               aes192           Sim       12
the results when this technique is used. However, the results          fir3             Const      12       7.4
using the constant toggle rate approach are relatively similar         diffeq           Const      15       12
and the simulation-based outcome is within the range of the            diffeq2          Const      16       12
results seen with the constant toggle rate method.                     molecular       Const      15       15
   When we consider designs that include hard blocks such              rs decoder1     Const      13       16
as DSP blocks and memory blocks, we observe that the                   rs decoder2     Const      11       11
gap is 12, 9.2 and 9.0 times for the cases when multipliers,           atm             Const                         11
memories and both memories and multipliers are used, re-               aes              Sim                          4.0
spectively. The area savings that these hard blocks enabled            aes inv          Sim                          3.9
suggested that some power savings should occur because a               ethernet        Const                         15
smaller area difference implies fewer excess transistors which          serialproc      Const                         24
in turn means that the capacitive load on the signals in               fir24            Const                                     5.2
the design will be less. With a lower load, dynamic power              pipe5proc       Const                                     12
consumption is reduced and we observe this in general. In              raytracer       Const                                     12
particular, we note that the circuits that use DSP blocks
consume equal or less power when the area efficient DSP                  Geomean                     12      12        9.2         9.0
blocks are used as compared to when those same circuits
are implemented without the DSP blocks. The one excep-
tion is again rs decoder1 which suffered from an inefficient             confidence level of either worst-case leakage estimate. These
use of the DSP blocks.                                                estimates are influenced by a variety of factors including the
   In addition to the dynamic power, we measured the static           maturity of a process and, therefore, a comparison of leak-
power consumption of the designs for both the FPGA and                age estimates from two different foundries, as we attempt
the ASIC implementations; however, as will be described, we           to do here, may reflect the underlying differences between
were unable to draw any useful conclusions. We performed              the foundries and not the differences between FPGAs and
these measurements for both typical silicon at 25 ◦ C and             ASICs that we seek to measure. Another issue that makes
worst-case silicon at 85 ◦ C. To account for the fact that the        comparison difficult is that, if static power is a concern for
provided worst case standard cell libraries were character-           either FPGAs or ASICs, manufacturers may opt to test the
ized for a higher temperature, the standard cell results were         power consumption and eliminate any parts which exceed a
scaled by a factor determined from HSPICE simulations of              fixed limit. Both business and technical factors could impact
a small sample of cells. We did not need to scale the results         those fixed limits. Given all these factors, to perform a com-
for typical silicon. The results we observed for these two            parison in which we could be confident, we would need to
cases deviated significantly. For logic only designs, on av-           perform HSPICE simulations using identical process mod-
erage the FPGA-based implementations consumed 87 times                els. We did not have these same concerns about dynamic
more static power than the equivalent ASIC when measured              power because process and temperature variations have sig-
for typical conditions and typical silicon but this difference         nificantly less impact on dynamic power.
was only 5.4 times under worst case conditions for worst                 Despite our inability to reliably measure the absolute static
case silicon.                                                         power consumption gap, we did find that, as expected, the
   The usefulness of either of these results is unclear. De-          static power gap and the area gap are somewhat corre-
signers are generally most concerned about worst-case con-            lated. (The correlation coefficient of the area gap to the
ditions which makes the typical-case measurements unin-               static power gap was 0.80 and 0.81 for the typical and worst
formative and potentially subject to error since more time            case measurements respectively.) This was expected because
is spent ensuring the accuracy of the worst-case models.              transistor width is generally proportional to the static power
The worst-case results measured in this work suffer from               consumption [14] and the area gap partially reflects the
error introduced by our temperature scaling. As well, static          difference in total transistor width between an FPGA and
power, which is predominantly due to sub-threshold leak-              an ASIC. This relationship is important because it demon-
age for current technologies[13], is very process dependent           strates that hard blocks such as multipliers and block mem-
and this makes it difficult to ensure a fair comparison given           ories, which reduced the area gap, reduce the static power
the available information. In particular, we do not know the          consumption gap as well.

                                                                 29
8. CONCLUSION                                                       [13] V. De and S. Borkar. Technology and design
   This paper has presented empirical measurements quan-                 challenges for low power and high performance. In
tifying the gap between FPGAs and ASICs. We observed                     ISLPED ’99, pages 163–168, New York, NY, USA,
that for circuits implemented entirely using LUTs and flip-               1999. ACM Press.
flops (logic-only), an FPGA is on average 40 times larger            [14] W. Jiang, V. Tiwari, E. de la Iglesia, and A. Sinha.
and 3.2 times slower than a standard cell implementation.                Topological analysis for leakage prediction of digital
An FPGA also consumes 12 times more dynamic power than                   circuits. In ASP-DAC ’02, page 39, Washington, DC,
an equivalent ASIC on average. We confirmed that the use                  USA, 2002. IEEE Computer Society.
of hard multipliers and dedicated memories enable a sub-            [15] H. S. Jones Jr., P. R. Nagle, and H. T. Nguyen. A
stantial reduction in area and power consumption but these               comparison of standard cell and gate array
blocks have a relatively minor impact on the delay differ-                implementations in a common CAD system. In IEEE
ences between ASICs and FPGAs.                                           1986 CICC, pages 228–232, 1986.
                                                                    [16] I. Kuon, A. Egier, and J. Rose. Design, layout and
9. ACKNOWLEDGEMENTS                                                      verification of an FPGA using automated tools. In
  We are indebted to Jaro Pristupa for the extensive sup-                FPGA ’05, pages 215–226, New York, NY, USA, 2005.
port he provided for both the technology kits and the nu-                ACM Press.
merous CAD tools required for this work. This comparison            [17] Lattice Semiconductor Corporation. LatticeECP/EC
would not have been possible without the area measure-                   Family Data Sheet, May 2005. Version 01.6.
ments of the Stratix II provided by Richard Cliff from Altera        [18] LSI Logic. RapidChip Platform ASIC, 2005.
and the technology files and memory cores provided by CMC                 http://www.lsilogic.com/products/rapidchip_
Microsystems and Circuits Multi-Projets. Paul Chow, Peter                platform_asic/index.html.
Jamieson, Alex Rodionov, and Peter Yiannacouras provided            [19] NEC Electronics. ISSP (Structured ASIC), 2005.
some of the benchmarks we used in this work. Ian Kuon                    http://www.necel.com/issp/english/.
received financial support from NSERC and this research              [20] K. Padalia, R. Fung, M. Bourgeault, A. Egier, and
project was also supported by a NSERC Discovery Grant.                   J. Rose. Automatic transistor and physical design of
10. REFERENCES                                                           FPGA tiles from an architectural specification. In
                                                                         FPGA ’03, pages 164–172, New York, NY, USA, 2003.
 [1] Altera Corporation. Partnership with TSMC yields                    ACM Press.
     first silicon success on Altera’s 90-nm, low-k products,        [21] M. J. S. Smith. Application-Specific Integrated
     June 2004. http://www.altera.com/corporate/news_                    Circuits. Addison-Wesley, 1997.
     room/releases/releases_archive/2004/products/
                                                                    [22] STMicroelectronics. 90nm CMOS090 Design Platform,
     nr-tsmc_partnership.html.
                                                                         2005. http://www.st.com/stonline/prodpres/
 [2] Altera Corporation. Quartus II Development Software                 dedicate/soc/asic/90plat.htm.
     Handbook, 5.0 edition, May 2005.
                                                                    [23] Synopsys. Design Compiler Reference Manual:
 [3] Altera Corporation. Stratix II Device Handbook, 3.0                 Constraints and Timing, version v-2004.06 edition,
     edition, May 2005.                                                  June 2004.
 [4] S. D. Brown, R. Francis, J. Rose, and Z. Vranesic.             [24] Synopsys. Design Compiler User Guide, version
     Field-programmable gate arrays. Kluwer Academic                     v-2004.06 edition, June 2004.
     Publishers, 1992.
                                                                    [25] Toshiba Corporation. 90nm (Ldrawn=70nm) CMOS
 [5] Cadence. Encounter Design Flow Guide and Tutorial,                  ASIC TC300 Family, BCE0012A, 2003. Available
     Product Version 3.3.1, February 2004.                               online at http://www.semicon.toshiba.co.jp/eng/
 [6] Cadence Design Systems. TSMC Standard Cell                          prd/asic/doc/pdf/bce0012a.pdf.
     Libraries, 2003. Available online at http://www.               [26] N. H. E. Weste and D. Harris. CMOS VLSI Design A
     cadence.com/partners/tsmc/SC_Brochure_9.pdf.                        Circuits and Systems Perspective. Pearson
 [7] A. Chang and W. J. Dally. Explaining the gap                        Addison-Wesley, 2005.
     between ASIC and custom power: a custom                        [27] S. J. Wilton, N. Kafafi, J. C. H. Wu, K. A. Bozman,
     perspective. In DAC ’05, pages 281–284, New York,                   V. Aken’Ova, and R. Saleh. Design considerations for
     NY, USA, 2005. ACM Press.                                           soft embedded programmable logic cores. IEEE JSSC,
 [8] D. Chinnery and K. Keutzer. Closing the Gap                         40(2):485–497, February 2005.
     Between ASIC & Custom Tools and Techniques for                 [28] Xilinx. Virtex-4 Family Overview, 1.4 edition, June
     High-Performance ASIC Design. Kluwer Academic                       2005.
     Publishers, 2002.
                                                                    [29] X. Yang, B.-K. Choi, and M. Sarrafzadeh.
 [9] D. G. Chinnery and K. Keutzer. Closing the power                    Routability-driven white space allocation for fixed-die
     gap between ASIC and custom: an ASIC perspective.                   standard-cell placement. IEEE Trans.
     In DAC ’05, pages 275–280, New York, NY, USA,                       Computer-Aided Design, 22(4):410–419, April 2003.
     2005. ACM Press.
                                                                    [30] P. S. Zuchowski, C. B. Reynolds, R. J. Grupp, S. G.
[10] R. Cliff. Altera Corporation. Private Communication.                 Davis, B. Cremen, and B. Troxel. A hybrid ASIC and
[11] K. Compton and S. Hauck. Automatic design of                        FPGA architecture. In ICCAD ’02, pages 187–194,
     area-efficient configurable ASIC cores. IEEE                           November 2002.
     Transactions on Computers, submitted.
[12] W. J. Dally and A. Chang. The role of custom design
     in ASIC chips. In DAC ’00, pages 643–647, 2000.

                                                               30
You can also read