A LANDSCAPE OF THE NEW DARK SILICON DESIGN REGIME

Page created by Troy Ward
 
CONTINUE READING
...................................................................................................................................................................................................................

                                       A LANDSCAPE OF THE NEW
                                      DARK SILICON DESIGN REGIME
...................................................................................................................................................................................................................
                                                                    THE RISE OF DARK SILICON IS DRIVING A NEW CLASS OF ARCHITECTURAL TECHNIQUES

                                                                    THAT ‘‘SPEND’’ AREA TO ‘‘BUY’’ ENERGY EFFICIENCY. THIS ARTICLE EXAMINES FOUR

                                                                    RECENTLY PROPOSED DIRECTIONS (‘‘THE FOUR HORSEMEN’’) FOR ADAPTING TO DARK

                                                                    SILICON, OUTLINES A SET OF EVOLUTIONARY DARK SILICON DESIGN PRINCIPLES, AND

                                                                    SHOWS HOW ONE OF THE DARKEST COMPUTING ARCHITECTURES—THE HUMAN

                                                                    BRAIN—OFFERS INSIGHTS INTO MORE REVOLUTIONARY DIRECTIONS FOR COMPUTER

                                                                    ARCHITECTURE.

                                                                    ......        Recent VLSI technology trends
                                                                    have led to a disruptive new regime for dig-
                                                                                                                                                  A recent paper refers to this widespread
                                                                                                                                              disruptive factor informally as the ‘‘dark sili-
                                                                    ital chip designers, where Moore’s law con-                               con apocalypse,’’4 because it officially marks
                                                                    tinues but CMOS scaling provides                                          the end of one reality (Dennard scaling5),
                                                                    increasingly diminished fruits. As in prior                               where progress could be measured by
                                                                    years, the computational capabilities of                                  improvements in transistor speed and
                                                                    chips are still increasing by 2.8 per process                            count, and the beginning of a new reality
                                                                    generation. However, a utilization wall1 lim-                             (post-Dennard scaling), where progress is
                                                                    its us to only 1.4 of this benefit—causing                               measured by improvements in transistor en-
                                                                    large underclocked swaths of silicon area—                                ergy efficiency. Previously, we tweaked our
                     Michael B. Taylor                              hence the term dark silicon.2,3                                           circuits to reduce transistor delays and
                                                                        Fortunately, simple scaling theory makes                              turbo-charged them with dual-rail domino
                   University of California,                        the utilization wall easy to derive, helping us                           to reduce fan-out-of-4 (FO4) delays. From
                                                                    to think intuitively about the problem. Tran-                             now on, we will tweak our circuits to mini-
                                 San Diego                          sistor density continues to improve by 2                                 mize capacitance switched per function; we
                                                                    every two years, and native transistor speeds                             will strip our circuits down and starve them
                                                                    improve by 1.4. But transistor energy effi-                              of voltage to squeeze out every femtojoule.
                                                                    ciency improves by only 1.4, which, under                                Whereas once we would spend exponentially
                                                                    constant power budgets, causes a 2 shortfall                             increasing quantities of transistors to buy
                                                                    in energy budget to power a chip at its native                            performance, now we will spend these tran-
                                                                    frequency. Therefore, our utilization of a                                sistors to buy energy efficiency.
                                                                    chip’s potential is falling exponentially by a                                The CMOS scaling breakdown was the
                                                                    jaw-dropping 2 per generation. Thus, if                                  direct cause of industry’s transition to multi-
                                                                    we are just bumping up against power limita-                              core in 2005. Because filling chips with cores
                                                                    tions in the current generation, then in eight                            does not fundamentally circumvent utiliza-
                                                                    years, designs will be 93.75 percent dark!                                tion wall limits, multicore is not the final
 .......................................................

                    8                                                             Published by the IEEE Computer Society                                                                           
                                                                                                                                                                       0272-1732/13/$31.00 c 2013 IEEE
solution to dark silicon;3 it is merely indus-          Table 1. Dennard vs. post-Dennard (leakage-limited) scaling.1 In
try’s initial, transitional response to the              contrast to Dennard scaling,5 which held until 2005, under the
shocking onset of the dark silicon age. In-             post-Dennard regime, the total chip utilization for a fixed power
creasingly over time, the semiconductor in-            budget drops by S 2 with each process generation. The result is an
dustry is adapting to this new design                  exponential increase in dark silicon for a fixed-sized chip under a
regime, realizing that multicore chips will                                    fixed area budget.
not scale as transistors shrink and that the
fraction of a chip that can be filled with             Transistor property              Dennard                    Post-Dennard
cores running at full frequency is dropping            D Quantity                           S2                               S2
exponentially with each process genera-                D Frequency                           S                                S
tion.1,3 This reality forces designers to ensure       D Capacitance                        1/S                             1/S
that, at any point in time, large fractions of          2
                                                       VDD                                 1=S 2                              1
their chips are effectively dark—either idle           ) D Power ¼ D QFCV 2                  1                               S2
for long periods of time or significantly              ) D Utilization ¼ 1/Power             1                             1=S 2
underclocked. As exponentially larger frac-
tions of a chip’s transistors become darker,
silicon area becomes an exponentially
cheaper resource relative to power and energy             This shortfall prevents multicore from
consumption. This shift calls for new archi-          being the solution to scaling.1,3 Although
tectural techniques that ‘‘spend’’ area to            advancing a single process generation would
‘‘buy’’ energy efficiency. This saved energy          allow enough transistors to increase core
can then be applied to increase performance,          count by 2, and frequency could be 1.4
or to have longer battery life or lower operat-       faster, the energy budget permits only a
ing temperatures.                                     1.4 total improvement. Per Figure 1, across
                                                      two process generations (S ¼ 2), designers
The utilization wall that causes dark silicon         could increase core count by 2 leaving fre-
    Table 1 shows the derivation of the utiliza-      quency constant, or they could increase fre-
tion wall1 that causes dark silicon.2,3 It            quency by 2 with leaving core count
employs a scaling factor, S, which is the             constant, or they could choose some middle
ratio between the feature sizes of two processes      ground between the two. The remaining 4
(for example, S ¼ 32=22 ¼ 1:4x between 32             potential remains inaccessible.
and 22 nm). In both Dennard and post-                     More positively stated, the true new poten-
Dennard scaling, the transistor count scales          tial of Moore’s law is a 1.4 energy-efficiency
by S 2 , and the transistor switching frequency       improvement per generation, which could be
scales by S. Thus, our net increase in comput-        used to increase performance by 1.4. Addi-
ing performance is S 3 , or 2.8x.                     tionally, if we could somehow make use of
    However, to maintain a constant power             dark silicon, we could do even better.
envelope, these gains must be offset by a cor-            Although the utilization wall is based on a
responding reduction in transistor switching          first-order model that simplifies many fac-
energy. In both cases, scaling reduces transis-       tors, it has proved to be an effective tool
tor capacitance by S, improving energy effi-          for designers to gain intuition about the fu-
ciency by S. In Dennard scaling, we can               ture, and has proven remarkably accurate
scale the threshold voltage and thus the oper-        (see the sidebar ‘‘Is Dark Silicon Real? A Re-
ating voltage, which yields another S 2 energy-       ality Check’’). Follow-up work6-8 has looked
efficiency improvement. However, in today’s           at extending this early work1,3 on dark sili-
post-Dennard, leakage-limited regime, we              con and multicore scaling with more sophis-
cannot scale threshold voltage without expo-          ticated models that incorporate factors such
nentially increasing leakage, and as a result,        as application space and cache size.
we must hold operating voltage roughly con-
stant. The end result is a shortfall of S 2 , or 2   Dark silicon misconceptions
per process generation. This shortfall multi-            Let’s clear up a few misconceptions before
plies with each process generation, resulting         proceeding. First, dark silicon does not mean
in exponentially darker silicon over time.            blank, useless, or unused silicon; it’s just
                                                                                                          .............................................................

                                                                                                 SEPTEMBER/OCTOBER 2013                          9
...............................................................................................................................................................................................
                   DARK SILICON

                   ...............................................................................................................................................................................................
                                                                            Is Dark Silicon Real? A Reality Check
                      A quick survey of recent designs from multicore outfits such as Tilera,                            instruction throughput. Instead, the upcoming 2013 22-nm Intel Core i7
                   Intel, and AMD indicates that industry has pursued core count and fre-                                4960X runs at 3.6 GHz and has six superscalar cores, a 5.7 peak serial
                   quency combinations consistent with the utilization wall. For instance,                               instruction throughput improvement. The darkness ratio is thus 91.74 per-
                   Intel’s 90-nm single-core Prescott chip ran at 3.8 GHz in 2004. Dennard                               cent versus the 93.75 percent predicted by the utilization wall. The latest
                   scaling would suggest that a 22-nm multicore version should run at 15.5                               2012 International Technology Roadmap for Semiconductors also shows
                   GHz, and contain 17 superscalar cores, for a total improvement of 69 in                              that scaling has proceeded consistently with post-Dennard predictions.

                                                                        Spectrum of trade-offs                                                ..
                                                                        between no. of cores and                                               ..
                                                                        frequency
                                                                                                                                                               2×4 cores at 1.8 GHz
                                                                        Example:                                                                               (8 cores dark, 8 dim)
                                                                        65 nm → 32 nm (S = 2)
                                                                                                                                                               (Industry’s choice)

                                                                                                                                              ..
                                                                                                                                               ..
                                                                        4 cores at 1.8 GHz

                                                                                                                   ..                                          4 cores at 2×1.8 GHz
                                                                                                                    ..                                         (12 cores dark)

                                                                                                                                                               75% dark after two generations;
                                                                                                                                                               93% dark after four generations

                                                                                 65 nm                                                    32 nm

                                                                    Figure 1. Multicore scaling leads to large amounts of dark silicon.3 Across two process gen-
                                                                    erations, there is a spectrum of trade-offs between frequency and core count; these include
                                                                    increasing core count by 2 but leaving frequency constant (top), and increasing frequency
                                                                    by 2 but leaving core count constant (bottom). Any of these trade-off points will have
                                                                    large amounts of dark silicon.

                                                                   silicon that is not used all the time, or at its                          energy efficiency can then allow an indirect
                                                                   full frequency. Even during the best days of                              performance improvement because it frees
                                                                   CMOS scaling, microprocessor and other                                    up more of the fixed power budget to be
                                                                   circuits were chock full of ‘‘dark logic’’                                used for even more computation.
                                                                   used infrequently or for only some applica-
                                                                   tions—for instance, caches are inherently                                 The four horsemen
                                                                   dark because the average cache transistor is                                 Recently, researchers proposed a taxon-
                                                                   switched for far less than one percent of                                 omy—the four horsemen—that identifies
                                                                   cycles, and FPUs remain dark in integer                                   four promising directions for dealing with
                                                                   codes.                                                                    dark silicon that have emerged as promising
                                                                       Soon, the exponential growth of dark sil-                             potential approaches as we transition beyond
                                                                   icon area will push us beyond logic targeted                              the initial multicore stop-gap solution. These
                                                                   for direct performance benefits toward                                    responses originally appeared to be unlikely
                                                                   swaths of low-duty cycle logic that exists,                               candidates, carrying unwelcome burdens in
                                                                   not for direct performance benefit, but for                               design, manufacturing, or programming.
                                                                   improving energy efficiency. This improved                                None is ideal from an aesthetic engineering
.............................................................

                   10                      IEEE MICRO
point of view. But the success of complex          low-margin, high-competition markets, and
multiregime devices such as metal-oxide-           their competitor will take the high end and
semiconductor field-effect transistors (MOS-       enjoy high margins. Thus, in scenarios
FETs) has shown that engineers can tolerate        where dark silicon could be used profitably,
complexity if the end result is better. Future     decreasing area in lieu of exploiting it
chips are likely to employ not just one horse-     would certainly decrease system costs, but
man, but all of them, in interesting and           would catastrophically decrease sale price.
unique combinations.                               Hence, the shrinking-chips scenario is likely
                                                   to happen only if we can find no practical
The shrinking horseman                             use for dark silicon.
    When confronted with the possibility of
dark silicon, many chip designers insist that      Power and packaging issues with shrinking
area is expensive, and that they would just        chips. A major consequence of exponentially
build smaller chips instead of having dark sil-    shrinking chips is a corresponding exponen-
icon in their designs. Among the four horse-       tial rise in power density. Recent analysis of
men, these ‘‘shrinking chips’’ are the most        many-core thermal characteristics has
pessimistic outcome. Although all chips            shown that peak hotspot temperature rise
may eventually shrink somewhat, the ones           can be modeled as Tmax ¼ TDP  ðRconv þ
that shrink the most will be those for             k=AÞ, where Tmax is the rise in temperature,
which dark silicon cannot be applied fruit-        TDP is the target chip thermal design power,
fully to improve the product. These chips          Rconv is the heat sink thermal convection re-
will rapidly turn into low-margin businesses       sistance (lower is a better heat sink), k incor-
for which further generations of Moore’s           porates many-core design properties, and A is
law provide small benefit. Below is an exam-       chip area.8 If area drops exponentially, the
ination of the spectrum of second-order            second term dominates and chip tempera-
effects associated with shrinking chips.           tures rise exponentially. This in turn will
                                                   force a lower TDP so that temperature limits
Cost side of shrinking silicon. Understanding      are met, and reduce scaling below even the
shrinking chips requires considering semi-         nominal 1.4 expected energy-efficiency
conductor economics. The ‘‘build smaller           gain. Thus, if thermals drive your shrink-
chips’’ argument has a ring of truth; after        ing-chip strategy, it is much better to hold
all, designers spend much of their time trying     your frequency constant and increase cores
to meet area budgets for existing chip             by 1.4 with a net area decrease of 1.4
designs. But exponentially smaller chips are       than it is to increase your frequency by
not exponentially cheaper; even if silicon         1.4 and shrink your chip by 2.
begins as 50 percent of system cost, after a
few process generations, it will be a tiny frac-   The dim horseman
tion. Mask costs, design costs, and I/O pad            As exponentially larger fractions of a chip’s
area will fail to be amortized, leading to ris-    transistors become dark transistors, silicon
ing costs per mm2 of silicon, which ulti-          area becomes an exponentially cheaper re-
mately will eliminate incentives to move           source relative to power and energy consump-
the design to the next process generation.         tion. This shift calls for new architectural
These designs will be ‘‘left behind’’ on           techniques that spend area to buy energy effi-
older generations.                                 ciency. If we move past unhappy thoughts of
                                                   shrinking silicon and consider populating
Revenue side of shrinking silicon. Shrinking       dark silicon area with logic that we use only
silicon can also shrink the chip selling           part of the time, then we are led to some in-
price. In a competitive market, if there is a      teresting new design possibilities.
way to use the next process generation’s               The term dim silicon refers to techniques
bounty of dark silicon to attain a benefit to      that put large amounts of otherwise-dark
the end product, then competition will             silicon area to productive use by employing
force companies to do so. Otherwise, they          heavy underclocking or infrequent use
will generally be forced into low-end,             to meet the power budget—that is, the
                                                                                                       .............................................................

                                                                                            SEPTEMBER/OCTOBER 2013                          11
...............................................................................................................................................................................................
                   DARK SILICON

                                                                   architecture is strategically managing the                                explored wide-SIMD NTV processors,12
                                                                   chip-wide transistor duty cycle to enforce                                which seek to exploit data parallelism,
                                                                   the overall power constraint.8,9 Whereas                                  along with NTV many-core processors13
                                                                   early 90-nm designs such as Cell and Pre-                                 and an NTV x86 processor.14
                                                                   scott were dimmed because actual power                                        Although NTV per-processor performance
                                                                   exceeded design-estimated power, we are                                   drops faster than the corresponding savings in
                                                                   converging on increasingly more elegant                                   energy-per-instruction (5 energy improve-
                                                                   methods that make better trade-offs.                                      ment for an 8 performance cost), the perfor-
                                                                       Dim silicon techniques include dynami-                                mance loss can be offset by using 8 more
                                                                   cally varying the frequency with the number                               processors in parallel if the workload allows
                                                                   of cores being used, scaling up the amount of                             it. Then, an additional 5 processors could
                                                                   cache logic, employing near-threshold volt-                               turn the energy efficiency gains into additional
                                                                   age (NTV) processor designs, and redesign-                                performance. So, with ideal parallelization,
                                                                   ing the architecture to accommodate bursts                                NTV could offer 5 the throughput im-
                                                                   that temporarily allow the power budget to                                provement by absorbing 40 the area. But
                                                                   be exceeded, such as Turbo Boost and com-                                 this would also require 40 more free paral-
                                                                   putational sprinting.10                                                   lelism in the workload relative to the parallel-
                                                                                                                                             ism consumed by an equivalent energy-
                                                                   Turbo Boost 1.0. Although first-generation                                limited super-threshold many-core processor.
                                                                   multicores had a ship-time-determined top                                     In practice, for many applications, 40
                                                                   frequency that was invariant of the number                                additional parallelism can be elusive. For
                                                                   of currently active cores, Intel’s Turbo                                  chips with large power budgets that can al-
                                                                   Boost 1.0 enabled second-generation multi-                                ready sustain hundreds of cores, applications
                                                                   cores to make real-time trade-offs between                                that have this much spare parallelism are rel-
                                                                   active core count and the frequency the                                   atively rare. Interestingly, because of this ef-
                                                                   cores ran at: the fewer the cores, the higher                             fect, NTV’s applicability across applications
                                                                   the frequency. When Turbo Boost is enabled,                               increases in low-energy environments because
                                                                   it uses the energy gained from turning off                                the energy-limited baseline super-threshold
                                                                   cores to increase the voltage and then the fre-                           design has consumed less of the available par-
                                                                   quency of the active cores. This technique,                               allelism. Furthermore, NTV clearly becomes
                                                                   known as dynamic voltage and frequency                                    more applicable for workloads with extremely
                                                                   scaling (DVFS), increases power proportional                              large amounts of parallelism.
                                                                   to the cube of the increase in frequency.                                     NTV presents several circuit-related chal-
                                                                                                                                             lenges that have seen active investigation, es-
                                                                   NTV processors. In the past, DVFS was also                                pecially because technology scaling will
                                                                   used to save cubic power when frequencies                                 exacerbate rather than ameliorate these factors.
                                                                   were decreased. However, today, processor                                 A significant NTV challenge has been suscep-
                                                                   manufacturers operate transistors at reduced                              tibility to process variability. As operating vol-
                                                                   voltages—around 2.5 the threshold volt-                                  tages drop, variation in transistor threshold
                                                                   age, an energy-delay optimal point. This                                  due to random dopant fluctuation is propor-
                                                                   point is right at the edge of an operating re-                            tionally higher, and leakage and operating fre-
                                                                   gime where frequency starts to drop precipi-                              quency can vary greatly. Because NTV
                                                                   tously as voltage is reduced, which makes                                 designs can expand the area consumption by
                                                                   downward-DVFS much less effective.                                        approximately 8 or more, variation issues
                                                                       Nonetheless, researchers have begun to                                are exacerbated. Other challenges include the
                                                                   explore this regime. One recent approach is                               penalties involved in designing low-operating
                                                                   Near-Threshold Voltage (NTV) logic,11                                     voltage static RAMs (SRAMs) and the
                                                                   which operates transistors in the near-thres-                             increased interconnection energy consump-
                                                                   hold regime slightly above the threshold volt-                            tion due to greater spreading across cores.
                                                                   age, providing more palatable trade-offs
                                                                   between energy and delay than subthreshold                                Bigger caches. An often-proposed dim-silicon
                                                                   circuits, for which frequency drops exponen-                              alternative is to simply allocate otherwise
                                                                   tially with voltage decreases. Researchers have                           dark silicon area for caches. Because only a
.............................................................

                   12                      IEEE MICRO
subset of cache transistors (such as a word-     a general-purpose processor.1 Execution
line) is accessed each cycle, cache memories     hops between coprocessors and general-
have low duty cycles and thus are inherently     purpose cores, executing where it is most ef-
dark. Compared to general-purpose logic, a       ficient. The unused cores are power- and
level-1 (L1) cache clocked at its maximum        clock-gated to keep them from consuming
frequency can be about 10 darker per            precious energy. Unlike dim silicon, which
square millimeter, and larger caches can be      tends to focus on manipulating voltages, fre-
even darker. Thus, adding cache is one way       quencies, and duty cycles as ways to manage
to simultaneously increase performance and       power, specialized logic focuses on reducing
lower power density per square millimeter.       the amount of capacitance that needs to be
We can imagine, for instance, expanding          switched to perform a particular operation.
per-core cache at a rate that soaks up the           The promise for a future of widespread
remaining dark silicon area: 1.4 to 2           specialization is already being realized: we
more cache per core per generation. How-         are seeing a proliferation of specialized accel-
ever, many applications do not benefit           erators that span diverse areas such as base-
much from additional cache, and upcoming         band processing, graphics, computer vision,
TSV-integrated DRAM will reduce the              and media coding. These accelerators enable
cache benefit for those applications that do.    orders-of-magnitude improvements in en-
                                                 ergy efficiency and performance, especially
Computational sprinting and Turbo                for computations that are highly parallel.
Boost. Other techniques employ ‘‘temporal        Recent proposals have extrapolated this
dimness’’ as opposed to ‘‘spatial dimness,’’     trend and anticipate that the near future
temporarily exceeding the nominal thermal        will see systems comprising more coproces-
budget but relying on thermal capacitance        sors than general-purpose processors.1,7 This
to buffer against temperature increases, and     article refers to these systems as coprocessor-
then ramping back to a comparatively dark        dominated architectures, or CoDAs.
state. Intel’s Turbo Boost 2.0 uses this             As specialization usage grows to combat
approach to boost performance up until the       the dark silicon problem, we are faced with
processor reaches nominal temperature, rely-     a modern-day specialization ‘‘Tower of
ing on the heat sink’s innate thermal capaci-    Babel’’ crisis that fragments our notion of
tance. ARM’s big.LITTLE employs four             general-purpose computation and eliminates
A15 cores until the thermal envelope is          the traditional clear lines of communication
exceeded (anecdotally, about 10 seconds),        between programmers and software and the
then switches over to four lower-energy,         underlying hardware. Already, we see the
lower-performance A7 cores. Computational        deployment of specialized languages such as
sprinting carries this a step further, employ-   CUDA that are not usable between similar
ing phase-change materials that let chips ex-    architectures (for example, AMD and Nvi-
ceed their sustainable thermal budget by an      dia). We see overspecialization problems be-
order of magnitude for several seconds, pro-     tween accelerators that cause them to become
viding a short but substantial computational     inapplicable to closely related classes of com-
boost. These modes are especially useful for     putations (such as double-precision scientific
‘‘race to finish’’ computations, such as web-    codes running incorrectly on a GPU’s non-
page rendering, for which response latency       IEEE-compliant floating-point hardware).
is important, or for which speeding up the       Adoption problems are also caused by the ex-
transition of both the processor and its sup-    cessive costs of programming heterogeneous
port logic to a low-power state reduces en-      hardware (such as the slow uptake of Sony
ergy consumption.                                PlayStation 3 versus Xbox). Moreover, spe-
                                                 cialized hardware risks obsolescence as stan-
The specialized horseman                         dards are revised (for example, a JPEG
   The specialized horseman uses dark sili-      standard revision).
con to implement a host of specialized co-
processors, each either much faster or much      Insulating humans from complexity. These
more energy efficient (100 to 1,000) than       factors speak to potential exponential
                                                                                                    .............................................................

                                                                                         SEPTEMBER/OCTOBER 2013                          13
...............................................................................................................................................................................................
                   DARK SILICON

                                                                   increases in design, verification, and pro-                               propose alternative architectures that exploit
                                                                   gramming effort for these CoDAs. Combat-                                  specialization like c-cores, but focus on
                                                                   ing the Tower of Babel problem requires                                   improving reconfigurability at the cost of
                                                                   defining a new paradigm for how specializa-                               energy savings. Recent efforts have also
                                                                   tion is expressed and exploited in future pro-                            examined the use of approximate neural-
                                                                   cessing systems. We need new scalable                                     network-based computing as an elegant
                                                                   architectural schemas that employ pervasively                             way to package programmability, reconfi-
                                                                   specialized hardware to minimize energy and                               gurability, and specialization.18
                                                                   maximize performance while at the same
                                                                   time insulating the hardware designer and                                 The ‘‘deus ex machina’’ horseman
                                                                   programmer from such systems’ underlying                                      Of the four horsemen, this is by far the
                                                                   complexity.                                                               most unpredictable. ‘‘Deus ex machina’’
                                                                                                                                             refers to a plot device in literature or theater
                                                                   Overcoming Amdahl-imposed limits on                                       in which the protagonists seem increasingly
                                                                   specialization. Amdahl’s law provides an ad-                              doomed until the very last moment, when
                                                                   ditional roadblock for specialization. To                                 something completely unexpected comes
                                                                   save energy across the majority of the com-                               out of nowhere to save the day. For dark sil-
                                                                   putation, we must find broad-based special-                               icon, one deus ex machina would be a break-
                                                                   ization approaches that apply to both                                     through in semiconductor devices. However,
                                                                   regular, parallel code and irregular code.                                as we shall see, the breakthroughs that would
                                                                   We must also ensure that communicating                                    be required would have to be quite funda-
                                                                   specialized processors doesn’t fritter away                               mental—in fact, we most likely would have
                                                                   their energy savings on costly cross-chip                                 to build transistors out of devices other
                                                                   communication or shared-memory accesses.                                  than MOSFETs. Why? Because MOSFET
                                                                                                                                             leakage is set by fundamental principles of
                                                                   Recent efforts. The UCSD GreenDroid                                       device physics, and is limited to a subthresh-
                                                                   processor (see Figure 2)3,15 is one such                                  old slope of 60 mV/decade at room temper-
                                                                   CoDA-based system that seeks to address                                   ature; this corresponds to a reduction of 10
                                                                   both complexity issues and Amdahl limits.                                 leakage current for every 60 mV that the
                                                                   GreenDroid is a mobile application processor                              threshold voltage is above the Vss, which is
                                                                   that implements Android mobile environment                                determined by properties of thermionic
                                                                   hotspots using hundreds of specialized cores                              emission of carriers across a potential well.
                                                                   called conservation cores, or c-cores.1,9 C-cores,                        Thus, although innovations such as Intel’s
                                                                   which target both irregular and regular code,                             FinFET/TriGate transistor and high-K
                                                                   are automatically generated from C or C                                   dielectrics represent significant achievements
                                                                   source code, and support a patching mecha-                                maintaining a subthreshold slope close to
                                                                   nism that lets them track software changes.                               their historical values, they still remain with-
                                                                   They attain an estimated 8 to 10 energy-                                in the scope of the MOSFET-imposed limits
                                                                   efficiency improvement, at no loss in serial                              and are one-time improvements rather than
                                                                   performance, even on nonparallel code, and                                scalable changes.
                                                                   without any user or programmer intervention.                                  Two VLSI candidates that bypass these
                                                                       Unlike NTV processors, c-cores need not                               limits because they are not based on thermal
                                                                   find additional parallelism in the workload to                            injection are tunnel field-effect transistors
                                                                   cover a serial performance loss. Thus, c-cores                            (TFETs),19 which are based on tunneling
                                                                   are likely to work across a wider range of work-                          effects, and nanoelectromechanical system
                                                                   loads, including collections of serial programs.                          (NEMS) switches,20 which are based on
                                                                   However, for highly parallel workloads in                                 physical relays. TFETs are reputed to have
                                                                   which execution time is loosely concentrated,                             subthreshold slopes on the order of
                                                                   NTV processors might hold an area advantage                               30 mV/decade—twice as good as the ideal
                                                                   because of their reconfigurability.                                       MOSFET—but with lower on-currents
                                                                       Other specialized processors such as the                              than MOSFETs, limiting their use in
                                                                   University of Wisconsin-Madison’s DySER16                                 high-performance circuits. NEMS devices
                                                                   and the University of Michigan’s Beret17                                  have essentially a near-zero subthreshold
.............................................................

                   14                      IEEE MICRO
L1             L1                  L1                                  L1
                         CPU

                                                            CPU
                                        CPU

                                                                               CPU
                               L1             L1                  L1                                  L1
                         CPU

                                        CPU

                                                            CPU

                                                                               CPU
                               L1             L1                  L1                                  L1
                         CPU

                                                            CPU
                                        CPU

                                                                               CPU
                               L1             L1                  L1                                  L1
                         CPU

                                        CPU

                                                            CPU

                                                                               CPU
           (a)

                                                                                 Tile
                                                   OCN

         OCN                        C                      I-cache           D-cache
                 C       C     C
                                    C
         I$
                                    C
  1 mm                   D$
                                                                           Internal state interface

                                    C                                                                        C-core
          CPU

                                                                                                            C-core
                     C         C    C
                                                            CPU                                            C-core
  (b)                1 mm
                                                             FPU                                             C-core

                                        (c)

Figure 2. The GreenDroid architecture, an example of a coprocessor-dominated architecture
(CoDA). The GreenDroid Mobile Application Processor comprises 16 nonidentical tiles (a).
Each tile (b) holds components common to every tile—the CPU, on-chip network (OCN),
and shared level-1 (L1) data cache—and provides space for multiple conservation cores, or
c-cores, of various sizes. A variety of in-tile networks (c) connect components and c-cores.

slope but slow switching times. Both TFETs               MARCO STARnet program is funding
and NEMS devices thus hint at orders-                    four centers, each focusing on a key direction
of-magnitude improvements in leakage but                 for beyond-CMOS approaches: developing
remain untamed and fall short of being                   electron spin-based memory computation
integrated into real chips.                              devices (C-SPIN), formulating new in-
   Realizing the importance of the fourth                formation-processing models that can lever-
horseman, a recent $194 million DARPA/                   age statistical (that is, nondeterministic)
                                                                                                                                .............................................................

                                                                                                                      SEPTEMBER/OCTOBER 2013                         15
...............................................................................................................................................................................................
                   DARK SILICON

                                                                   beyond-CMOS devices (SONIC), engineer-                                            optimizations. See if they still make sense.
                                                                   ing nonconventional atomic scale engineered                                       Sharing introduces additional energy
                                                                   materials (FAME), and creating new devices                                        consumption because it requires sharers
                                                                   that extend prior work on TFETs to operate                                        to have longer wires to the shared logic,
                                                                   at even lower voltages (LEAST).                                                   and it introduces additional perfor-
                                                                                                                                                     mance and energy overheads from the
                                                                   Evolutionary design principles for dark                                           control logic that manages the sharing.
                                                                   silicon                                                                           For example, architectures that have
                                                                       While researchers work to mature the new                                      repositories of nonshared state that
                                                                   ideas represented by the four horsemen, what                                      share physical pipelines (such as large-
                                                                   principles should guide today’s designs that                                      scale multithreading) pay large wire
                                                                   must tackle dark silicon? Listed below is a                                       capacitances inside these memories to
                                                                   set of evolutionary, rather than revolutionary,                                   share that state. As area gets cheaper, it
                                                                   dark silicon design principles that are moti-                                     will make less sense to pay these over-
                                                                   vated by changing trade-offs created by                                           heads, and the degree of sharing will de-
                                                                   dark silicon:                                                                     crease so that the energy cost of pulling
                                                                                                                                                     state out of these state repositories will
                                                                       Moving to the next generation will pro-                                      be reduced.
                                                                        vide an automatic 1.4 energy-efficiency                                    Multiplexing and RAMs that facilitate
                                                                        increase. Figure out how you will use it.                                    sharing of program data are still a good
                                                                        As a baseline, chip capabilities will                                        idea. Keep them. If different threads of
                                                                        scale with energy, whether it is allocated                                   control are truly sharing data, multi-
                                                                        to frequency or more cores. You can in-                                      plexed structures, such as shared RAM,
                                                                        crease or decrease frequency or transis-                                     or crossbars, are often still more efficient
                                                                        tor counts, but transistors switched per                                     than coherence protocols or other
                                                                        unit time can increase by only 1.4.                                         schemes.
                                                                       The next generation will create a large                                     Architectural techniques for saving tran-
                                                                        amount of dark area. Determine, for                                          sistors should only be applied if they do
                                                                        your domain, how to trade mostly dark                                        not worsen energy efficiency. Transistors
                                                                        area for energy. If the die area is fixed,                                   are getting exponentially cheaper, and
                                                                        any scaling is going to have a surplus                                       we can’t use them all at once. Why
                                                                        of transistors. Which combination of                                         are we trying to save transistors? Lo-
                                                                        the four horsemen is most effective in                                       cally, transistor-saving optimizations
                                                                        your domain? Should you go dim—                                              make sense, but an exponential wind
                                                                        more caches? Underclocked arrays of                                          is blowing against these optimizations
                                                                        cores? NTV on top of that? Add accel-                                        in the long run.
                                                                        erators or c-cores? Use new kinds of de-                                    Power rails are the new clocks. Design
                                                                        vices? Shrink your chip?                                                     with them in mind. Ten years ago, it
                                                                       Pipelining makes less sense than it used to.                                 was a big step to move beyond a few
                                                                        Figure out if faster transistor delays will                                  clock domains. Now, chips can have
                                                                        allow you to fit more in a pipeline stage                                    hundreds of clock domains, all with
                                                                        without reducing frequency. Pipelining                                       their own clock gates. With dark silicon,
                                                                        increases duty cycle and introduces addi-                                    we will see the same effect with power
                                                                        tional capacitance in circuits (registers,                                   rails; we will have hundreds and
                                                                        prediction circuits, bypassing, and clock                                    maybe thousands of power rails in the
                                                                        tree fan out), neither of which is dark sil-                                 future, all with their own power gates,
                                                                        icon friendly. Reducing pipeline depth                                       to manage the leakage for the many het-
                                                                        and increasing FO4 depths reduces                                            erogeneous system components.
                                                                        capacitive overhead. Note, too, that exces-                                 Heterogeneity results from the shift from a
                                                                        sive pipelining and frequency exacerbates                                    1D objective function (performance) to a
                                                                        the gap between processing and memory.                                       2D objective function (performance and
                                                                       Architectural multiplexing and logic shar-                                   energy). Design with the shape of this
                                                                        ing are becoming increasingly questionable                                   function in mind. The past lacked in
.............................................................

                   16                      IEEE MICRO
heterogeneity, because designs were               switches per second. Compare this to
     largely measured according to a single            arithmetic logic unit (ALU) transistors
     axis—performance. To first order, there           that toggle at three billion times per
     was a single optimal design point. Now            second. The most active neuron’s activ-
     that performance and energy are both              ity is a millionth of that of processing
     important, a Pareto curve trades off per-         transistors in today’s processors.
     formance and energy, and there is no             Low-voltage operation. Brain cells oper-
     one optimal design across that curve;             ate at approximately 100 mV, yielding
     there are many optimal points. Optimal            CV 2 energy savings of 100 versus
     designs will incorporate several such             1-V operation, in a clear parallel to
     points across these curves.                       the dim horseman’s NTV circuits.
                                                       Communication is low swing and low
   These rules of thumb will guide our exist-          voltage, saving large amounts of energy.
ing designs along an evolutionary path to be-         Limited sharing and memory multiplex-
come increasingly dark silicon friendly—but            ing. Any given neuron can switch only
what then of more revolutionary approaches?            1,000 times per second, by definition,
                                                       so it must have extremely limited shar-
Insights from the brain: a dark technology             ing, because a point of multiplexing
    Perhaps one promising indicator that low-          would be a bottleneck in parallel pro-
duty cycle, ‘‘dark technology’’ can be mas-            cessing. The human visual system starts
tered, unlocking new application domains, is           with 6M cones in the retina, similar to
the efficiency and density of the human                a 2-megapixel display, processes it with
brain. The brain, even today, can perform              local neurons, and then sends it on the
many tasks that computers cannot, especially           1M-neuron optic nerve to the visual
vision-related tasks. With 80 billion neurons          cortex. There is no central memory
and 100 trillion synapses operating at less            store; each pixel has a set of its own
than 100 mV, the brain embodies an existence           ALUs, so to speak, so energy waste
proof of highly parallel, reliable, and dark           due to multiplexing is minimal.
operation, and embodies three of the                  Data decimation. The human brain
horsemen—dim, specialized, and deus ex                 reduces the data size at each step and
machina. Neurons operate with extremely                operates on concise but approximate
low-duty cycles compared to processors—at              representations. If using 2 megapixels
best, 1 kilohertz. Although computing with sil-        suffices to handle color-related vision
icon-simulated neurons introduces excessive            tasks, why use more than that? Larger
‘‘interpretive’’ overheads—neurons and transis-        sensors would just require more neu-
tors have fundamentally different properties—          rons to store and compute on the
the brain can offer us insight and long-term           data. We should ensure that we are pro-
ideas about how we can redesign systems for            cessing no more data than necessary to
the extremely low-duty cycles and low voltages         achieve the final outcome.
called for by dark silicon. Here are some of          Analog operation. The neuron performs
these properties, which may give us insight            a more complex basic operation than
on more revolutionary extensions to the evolu-         the typical digital transistor. On the
tionary principles proposed in the last section:       input side, neurons combine informa-
                                                       tion from many other neurons; and
   Specialization. As with the specialized            on the output, despite producing rail-
    horseman, different groups of neurons              to-rail digital pulses, encode multiple
    serve different functions in cognitive             bits of information via spikes timings.
    processing, connect to different sensory           Could this suggest that there are more
    organs, and allow reconfiguration,                 efficient ways to map operations onto
    evolving with time synaptic connec-                silicon-based technologies? In RF wire-
    tions customized to the computation.               less front-end communications, analog
   Very dark operation. Neurons fire at a             processing enables computations that
    maximum rate of approximately 1,000                would be impossible to do at speed
                                                                                                  .............................................................

                                                                                       SEPTEMBER/OCTOBER 2013                          17
...............................................................................................................................................................................................
                   DARK SILICON

                                                                        digitally. However, analog techniques                                  3. N. Goulding et al., ‘‘GreenDroid: A Mobile
                                                                        might not scale well to deep nanometer                                     Application Processor for a Future of Dark
                                                                        technology.                                                                Silicon,’’ Hot Chips Symp., 2010.
                                                                       Fast, static, ‘‘gather, reduce, and broad-                             4. M. Taylor, ‘‘Is Dark Silicon Useful? Harness-
                                                                        cast’’ operators. Neurons have fan out                                     ing the Four Horsemen of the Coming Dark
                                                                        and fan in of approximately 7,000 to                                       Silicon Apocalypse,’’ Proc. 49th Ann. Design
                                                                        other neurons that are located signifi-                                    Automation Conf. (DAC 12), ACM, 2012,
                                                                        cant distances away. Effectively, they                                    pp. 1131-1136.
                                                                        can perform efficient operations that                                  5. R.H. Dennard, ‘‘Design of Ion-Implanted
                                                                        combine vector-style gather memory                                         MOSFET’s with Very Small Physical Dimen-
                                                                        accesses to large numbers of static-                                       sions,’’       IEEE      J.    Solid-State         Circuits,
                                                                        memory locations, with a vector-style                                      vol. SC-9, 1974, pp. 256-268.
                                                                        reduction operator and a broadcast.                                    6. H. Esmaeilzadeh et al., ‘‘Dark Silicon and
                                                                        Do more efficient ways exist for imple-                                    the End of Multicore Scaling,’’ ACM
                                                                        menting these operations in silicon? It                                    SIGARCH Computer Architecture News,
                                                                        could be useful for computations that                                     vol. 39, no. 3, 2011, pp. 365-376.
                                                                        operate on finite-sized static graphs.                                 7. N. Hardavellas et al., ‘‘Toward Dark Silicon
                                                                                                                                                   in Servers,’’ IEEE Micro, vol. 31, no. 4,
                                                                       Recently, both the EU and US govern-                                        2011, pp. 6-15.
                                                                   ments have proposed initiatives to enable                                   8. W. Huang et al., ‘‘Scaling with Design Con-
                                                                   greater studies of the computational capabil-                                   straints: Predicting the Future of Big Chips,’’
                                                                   ities of the brain. Although brain-inspired                                     IEEE Micro, vol. 31, no. 4, 2011, pp. 16-29.
                                                                   computing has already come and gone sev-                                    9. J. Sampson et al., ‘‘Efficient Complex Oper-
                                                                   eral times in the brief history of manmade                                      ators for Irregular Codes,’’ Proc. 17th Int’l
                                                                   computers, dark silicon may cause these                                         Symp. High Performance Computer Archi-
                                                                   approaches to become increasingly relevant.                                     tecture (HPCA 11), IEEE CS, 2011,
                                                                                                                                                   pp. 491-502.

                                                                   A     lthough silicon is getting darker, for
                                                                         researchers the future is bright and ex-
                                                                   citing. Dark silicon will cause a transforma-
                                                                                                                                             10. A. Raghavan et al., ‘‘Computational Sprint-
                                                                                                                                                   ing,’’ Proc. IEEE 18th Int’l Symp. High-
                                                                                                                                                   Performance Computer Architecture (HPCA
                                                                   tion of the computational stack and provide                                     12), IEEE CS, 2012, doi:10.1109/HPCA.
                                                                   many opportunities for investigation.      MICRO                              2012.6169031.
                                                                                                                                             11. R. Dreslinski et al., ‘‘Near-Threshold Com-
                                                                   Acknowledgments                                                                 puting: Reclaiming Moore’s Law Through
                                                                      This work was partially supported by                                         Energy Efficient Integrated Circuits,’’ Proc.
                                                                   NSF awards 0846152, 1018850, 0811794,                                           IEEE, vol. 98, no. 2, 2010, pp. 253-266.
                                                                   and 1228992, Nokia and AMD gifts, and                                     12. E. Krimer et al., ‘‘Synctium: A Near-Threshold
                                                                   by STARnet, an SRC program sponsored                                            Stream Processor for Energy-Constrained
                                                                   by MARCO and DARPA. I thank the anon-                                           Parallel Applications,’’ IEEE Computer Archi-
                                                                   ymous reviewers for their valuable insights                                   tecture Letters, Jan. 2010, pp. 21-24.
                                                                   and suggestions.                                                          13. D. Fick et al., ‘‘Centip3de: A 3930 DMIPS/W
                                                                                                                                                   Configurable Near-Threshold 3D Stacked
                                                                   ....................................................................            System with 64 ARM Cortex-M3 Cores,’’
                                                                   References                                                                      Proc. IEEE Int’l Solid-State Circuits Conf.,
                                                                     1. G. Venkatesh et al., ‘‘Conservation Cores:                                 IEEE, 2012, pp. 190-192.
                                                                          Reducing the Energy of Mature Computa-                             14. S. Jain et al., ‘‘A 280 mV-to-1.2 V Wide-
                                                                          tions,’’ Proc. 15th Architectural Support                                Operating-Range IA-32 Processor in 32 nm
                                                                          for Programming Languages and Op-                                        CMOS,’’ Proc. IEEE Int’l Solid-State Circuits
                                                                          erating Systems Conf., ACM, 2010,                                        Conf., IEEE, 2012, pp. 66-68.
                                                                          pp. 205-218.                                                       15. N. Goulding-Hotta et al., ‘‘The GreenDroid
                                                                     2. R. Merrit, ‘‘ARM CTO: Power Surge Could                                    Mobile Application Processor: An Architec-
                                                                          Create ‘Dark Silicon,’’’ EE Times, 22 Oct.                               ture for Silicon’s Dark Future,’’ IEEE Micro,
                                                                          2009.                                                                    vol. 31, no. 2, 2011, pp. 86-95.
.............................................................

                   18                      IEEE MICRO
16. V. Govindaraju, C.-H. Ho, and K. Sankaralin-            Solid-State Circuits Conf., IEEE, 2010,
    gam, ‘‘Dynamically Specialized Datapaths                pp. 150-151.
    for Energy Efficient Computing,’’ Proc. IEEE
    17th Int’l Symp. High-Performance Computer          Michael B. Taylor is an associate professor
    Architecture (HPCA 11), IEEE CS, 2011,              in the Department of Computer Science and
    doi:10.1109/HPCA.2011.5749755.                      Engineering at the University of California,
17. S. Gupta et al., ‘‘Bundled Execution of Re-         San Diego, where he leads the Center for
    curring Traces for Energy-Efficient General         Dark Silicon. His research interests include
    Purpose Processing,’’ Proc. 44th Ann.               dark silicon, chip design, parallelization
    IEEE/ACM Int’l Symp. Microarchitecture,             tools, and Bitcoin computing systems.
    ACM, 2011, pp. 12-23.                               Taylor has a PhD in electrical engineering
18. H. Esmaeilzadeh et al., ‘‘Neural Acceleration       and computer science from the Massachu-
    for General-Purpose Approximate Pro-                setts Institute of Technology.
    grams,’’ Proc. 45th Ann. IEEE/ACM Int’l
    Symp. Microarchitecture, IEEE CS, 2012,                Direct questions and comments about
    pp. 449-460.                                        this article to Michael B. Taylor, 9500
19. A. Ionescu et al., ‘‘Tunnel Field-Effect Transis-   Gilman Drive, MC 0404 EBU 3B 3202, La
    tors as Energy-Efficient Electronic Switches,’’     Jolla, CA 92093-0404; mbtaylor@ucsd.edu.
    Nature, 17 Nov. 2011, pp. 329-337.
20. F. Chen et al., ‘‘Demonstration of Integrated
    Micro-Electro-Mechanical Switch Circuits for
    VLSI     Applications,’’   Proc.    IEEE    Int’l

                                                                                                       .............................................................

                                                                                             SEPTEMBER/OCTOBER 2013                         19
You can also read