Cross-layer Codesign for Resilient Hardware - Xinfei Guo, Ph.D. U of Virginia & NVIDIA Corporation - GitHub Pages

Page created by Eleanor Vega
 
CONTINUE READING
Cross-layer Codesign for Resilient Hardware - Xinfei Guo, Ph.D. U of Virginia & NVIDIA Corporation - GitHub Pages
Boston (and Beyond) Area Architecture Workshop (BARC)

                Cross-layer Codesign for Resilient Hardware

                                            Xinfei Guo, Ph.D.
                                      U of Virginia & NVIDIA Corporation
                                               Westborough, MA

                                                 Jan 29th, 2021

                                              xfguo@ieee.org
                                             www.xinfeiguo.com
© Xinfei Guo | Confidential
Cross-layer Codesign for Resilient Hardware - Xinfei Guo, Ph.D. U of Virginia & NVIDIA Corporation - GitHub Pages
Outline
       n     Why?
       n     What?
       n     How?
       n     Now what?

© Xinfei Guo | Confidential   BARC 2021   2
Cross-layer Codesign for Resilient Hardware - Xinfei Guo, Ph.D. U of Virginia & NVIDIA Corporation - GitHub Pages
Starting from Transistors – Aging/Wearout
       n     Time-dependent device degradations
             q     Transistors (P/NMOS)
             q     Metal layers (Interconnect, PDN)
                                      PDN                   BTI
                                                                       BTI → Bias Temperature
                              EM            BTI
                                                                       Instability
                                                     Vth    Ids
                                                                       EM → Electromigration
                                    PMOS
                                                     BTI

                                                                       PDN → Power Delivery
                                      Interconnect                     Network
                                                                       Vth → Threshold Voltage
                                    NMOS
                                               EM     R      td        Ids → Current
                                                                       R → Resistance
                              BTI                    EM     BTI        td → Propagation Delay
                                      PDN                   EM

© Xinfei Guo | Confidential                                BARC 2021                             3
Cross-layer Codesign for Resilient Hardware - Xinfei Guo, Ph.D. U of Virginia & NVIDIA Corporation - GitHub Pages
Back to applications – lifetime is critical!

                          q   Longer expected lifetime                    q   Higher susceptibility to aging
                          q   Extreme environmental conditions            q   Replacement costs
                          q   Longer run time (higher utilization)        q   …

                                                                           Source: https://semiengineering.com/making-chips-to-last-
© Xinfei Guo | Confidential                                    BARC 2021 their-lifetime/                                               4
Cross-layer Codesign for Resilient Hardware - Xinfei Guo, Ph.D. U of Virginia & NVIDIA Corporation - GitHub Pages
Aging – Industry Attentions

© Xinfei Guo | Confidential   BARC 2021
                                          Source: [semiengineering.com]
                                                                          5
Cross-layer Codesign for Resilient Hardware - Xinfei Guo, Ph.D. U of Virginia & NVIDIA Corporation - GitHub Pages
Why do we care about aging now?
                                                                                Wire broke due to EM
       n     Reliability threat!
             q     Permanent errors
             q     Shorten lifetime
             q     Worsen metrics such as performance, power and area                 Figure: [N. Cheung
                                                                                      et al., UC Berkeley]
       n     Getting worse with technology scaling!
             q     Increased power density → Heat
             q     Increased effective electrical field
                   → More stress
             q     More components → Require                               FinFETs                            More
                                                                                                              Aging
                  lower single failure rate
             q    Advanced nodes → New stress
                  and issues: e.g. self-heating
                                                                      Source: [S.M. Ramey, et al. (Intel), 2018]

© Xinfei Guo | Confidential                               BARC 2021                                                   6
Cross-layer Codesign for Resilient Hardware - Xinfei Guo, Ph.D. U of Virginia & NVIDIA Corporation - GitHub Pages
A cross-layer effect
       n     Device level
             q     Threshold voltage Vth increase (BTI)
             q     Resistance increase (EM)
        n       Circuit level
             q       Performance degradation
             q       Timing failures
             q       Leakage power
        n       Architecture level
             q       Failures
             q       Errors
        n       System level
             q       MTTF

© Xinfei Guo | Confidential                               BARC 2021   7
Cross-layer Codesign for Resilient Hardware - Xinfei Guo, Ph.D. U of Virginia & NVIDIA Corporation - GitHub Pages
Traditional solutions

                                                      Adapting                  Passive Recovery
                              Adding Margins
                                                (Sensing + Actuation)            (More idle time)

                   • Over-estimation           • The worst case is           • Very slow
                   • Under-estimation            getting worse               • Unpredictable
                   • Uncertain operation       • Aging is unchecked          • Permanent part will
                     conditions                • Tracking power (over          keep accumulating
                   • e.g. 10% for a 3-year       10 sensors per
                     lifetime constraint         partition)

         Single-layer solution is not adequate to deal with aging any more as it is becoming a bottleneck!

© Xinfei Guo | Confidential                           BARC 2021                                          8
Cross-layer Codesign for Resilient Hardware - Xinfei Guo, Ph.D. U of Virginia & NVIDIA Corporation - GitHub Pages
Introducing Accelerated Self-Healing
       n     Fact - Aging is partially recoverable under passive recovery, but it is very
             slow.
       n     Key Idea: Reverse the directions of aging and enable active Recovery
                                   t i ve                            t i ve
                              A   c                                Ac ing
                                                                    /Ag

                                                          ??

                                                                                     g /
                                                                                   n
                                                                               eali very
                                       Sleep/                                 H co
                                       Rejuven
                                               ation                           Re

© Xinfei Guo | Confidential                            BARC 2021                            9
Cross-layer Codesign for Resilient Hardware - Xinfei Guo, Ph.D. U of Virginia & NVIDIA Corporation - GitHub Pages
Accelerated Self-Healing
                          Key Idea: Recover by reversing the directions of Aging
                              BTI Accelerated Self-Healing
                                                      1     Active Recovery: 2                           3   Accelerated & Active 4
                              Passive Recovery            Activate the recovery   Accelerated Recovery             Recovery

                                                 ne
                                          s   eli
                                      B a
                                 Vsg = 0, room              Vsg = negative               Vsg = 0, high           Vsg = negative
                                  temperature             room temperature                temperature           high temperature

                              EM Accelerated Self-Healing
                                                      1     Active Recovery: 2                           3   Accelerated & Active 4
                              Passive Recovery            Activate the recovery   Accelerated Recovery             Recovery

                                               ne
                                            eli
                                      a s
                                    B
                                   I = 0, room               I = negative                 I = 0, high             I = negative
                                  temperature             room temperature               temperature            high temperature

© Xinfei Guo | Confidential                                                  BARC 2021                                                10
Experiments for demonstration
                                                                                                                                                                   Thermal Chamber
                                       Circuit Under Test (CUT)
                                                                                       rst
                                                  75 LUTs

                          En                                          En
                                                                               in
                                                                                                 16
                                                                                     16-b              Cout
                                                                                    Counter
                                                                      fref
                                                                                 clk

                               BTI Test Setup(a)                                                                       Data Sampling       Interface Board
                                                                                                                                                (b)
                                                                                                                                                                       Chip

                                                                                                                                         40nm FPGA

                               EM Test Setup
                                                                                                                                                                   Thermal Chamber

                                             Probe                  Technology                180nm
                                             Pads
                                                                      Material                Copper
                                                                     Thickness                0.8um
                                                                      Length                 2.673mm
                                                Metal                  Width                  1.57um
                                                Wire
                                                                  Resistance (@rt)            35.76W

                                                                                                              Resistance Recording       Constant Current Supply   Device (Wire) under test

                                                                                                                                       On-chip Metal Wires

© Xinfei Guo | Confidential                                                                   BARC 2021                                                                                       11
Measurement results summary
       n     Recovery from aging can be made
             active and be accelerated, even the
             irreversible component can be fully
             eliminated or avoided through various
             techniques such as higher
             temperatures, negative voltages,
             active vs. sleep ratio …
       n     What does this mean for chip
             designers and architects?
             A: Cross-layer Accelerated Self-Healing

© Xinfei Guo | Confidential                       BARC 2021   X. Guo, etc. DAC ‘14, ASP-DAC ‘16, DSN ‘17   12
Implication – Metric Improvements
          q    >60x reduction of necessary margin for all cases
          q    The average performance is close to the fresh during the whole lifetime
          q    Both metrics don’t scale with the increase of the lifetime constraint

                                           ~ 2X

                                                  ~ 1X

                              Reduction of Necessary Design Margin               Average Performance Improvement

© Xinfei Guo | Confidential                                          BARC 2021        X. Guo, etc. ASP-DAC ‘16     13
Cross-layer Accelerated Self-Healing

                              System
                                                                                               Virtual
                                            Proactive Scheduler    Load Balancer               Sensors

                                                                                                   +1
                      Architecture
                                                                      Redundant               Program
                                                Dark Silicon                                  Counters
                                                                      Resources

                                         Active                                 Accelerated
                                        Recovery
                                                 EN   -∆V                  EN
                                                                                 Recovery
                              Circuit
                                             Negative Voltage Heating Elements                 Wearout
                                               Generator                                       Sensors

© Xinfei Guo | Confidential                                    BARC 2021        X. Guo, etc. VLSI, Integration ‘17   14
Circuit Components for Self-Healing
                        Non-overlapping clock generator
                                                                                    638ns            Clock frequency = 66.7MHz
                  clk
                                                                                                  BTI Accelerated                                                           EM Accelerated                                             Wearout Sensing Circuits
                                                                                                Self-Healing Circuits                                                     Self-Healing Circuits
                                                          clk1                                                         4.36mV

                                                          clk2
                                                                                                           -300.6mV

                   Vdd                                                                              On-Chip Negative   4.33mV                                                                                                                     BTI Sensing
      clk1              charging      charge
                                                                                                    Voltage Generator
                                                                                                       Ripple: 1.45%                                                                EM and BTI
                                   redistribution
                                                                                                                                                                                  Recovery Assist
                                                                                                                                                                                                                                       RO-based                                 Metastable
             C1                       clk1                C2
                                                                                                      Power Gating                                                                   Circuitry
                                                            Vout                                                                                                                                                                       for N/PBTI                                Element
     clk2
                                                                              -300.6mV
                                                                                                    enables Recovery                                                                                                                    Sensing                                   Based
                                                                                                                                                                                           Reconfigurable Heating Element
                                                                                                                                                                                                                                                                                 Sensor
                                                                                                                                        On-Chip Heater                                                      Accelerated
                                                                                                                                                                                                             Recovery

                                                                                                                                                                                                 Reconfigurab le
                                                                                                                                                                                                                              output

                                                                                                                                                                                                                                                     EM Sensing
                                                                                                                                                                                                     ROs

             Boosted Vdd
                                     Power Gating Block
              Vdd_high                        Vdd
                                                                        Always ON                                                                                                                                         L

                                                                                                                                                                                                          L/2
                           Sleep_buf                Sleep         To other power                                                        C4 Bump    VDD_PAD

                                                                   gating blocks    Retention                                                                                                L/4
     Sleep
                                                            Negative                Registers                          M10
                                                                                                                                                                                                                                                  track                         poll1                 track                        poll1
                                                            Voltage                      D            Q     Global                                                                                                                        path0                     2%                     path1                       10%
                                                                                                                       M9
                                       Vddv                                   Vdd                           PDN                                                                                                                                                                                            poll1
                                                                                                                                                                                                                                                                                poll2                 track                        poll2
                                                                                                                       M8                                                                                                                                           5%                     path2                       10%
                                                                                                                                                                                                                                                                                                        poll2
                                                                                                                       M7
                                   Logic Blocks                       Negative Voltage                                                                                                                                                                                           poll3                track                         poll3
                                                                                                                                        EM                         VDD Grid                                                                                         10%                    path3                       10%
                                                                         Generator En       Sleep
                                                                                                                       M6                                          (EM hazards)
                                                                                                          VDD Via                                                                                                                                                                                  poll3
                                                                                                                       M5
                                                                                                                       M4                                                                                                                                   track

                                                                                                                                                                                           MUX
                                                                                                          Connect to                    EM                   Connect to              Out                                                      outa                        ref                  outb                          ref
                                                                                                          VSS Grid     M3                          EM        VSS Grid
                                                                                                                       M2         VDD                                                                                                                     discharg              discharg                        discharg           discharg
                                                                                                                             P2    P1         P4   P3         Load BTI
                                                                                                                                                                                                                                                                         (a)                                                 (b)

© Xinfei Guo | Confidential                                                                                                               BARC 2021                                   X. Guo, etc. Springer ‘20                                                                                                    15
Costs
                              n    Area ↓ Power ↓ Extra Heat ↓
                                   q   Optimal ways of distributing circuit IPs in a large system
                                   q   Avoid unwanted heat
                                   q   Trigger only when necessary

                                  Design Name    Leakage Power   Dynamic Power      Area            Performance

                                  Neg. Voltage
                                                    68.85nW         64.47uW        4300um2           >66.7MHz
                                   Generator

                              On-Chip Heater         16.8nW          75uW           16um2                -

                                Multi-mode
                                                       -               -           58.24um2   Wakeup time ~170ns
                              Recovery Circuit

                                    Are there any other opportunities beyond circuit level?

© Xinfei Guo | Confidential                                        BARC 2021                                       16
Architectural Simulation Framework for Architecture Level
       Exploration – “OldSpot”

                                                        More Aging Critical
                                            A CPU
                                            Layout

                                             Example Output of the tool -> Aging HotSpot!

                              https://github.com/hplp/oldspot

© Xinfei Guo | Confidential                 BARC 2021                         A. Roelke, X. Guo, etc. ICCD ‘18   17
Unit-level Accelerated Self-Healing
       n     Goal
             q Less area and power overhead

       n     Solution
             q Placing self-healing IPs only for aging-critical units

                                                                                                        heaters
                                                                                              rob

                                                                                                       Voltage
                                                                                                        Neg.
                                     Heat Map                      Wearout Map
                                 (From “HotSpot”)               (From “OldSpot*”)

© Xinfei Guo | Confidential                             BARC 2021          X. Guo, etc. Springer ‘20              18
Utilize Intrinsic Heat
       n     Goal
             q Avoid power overhead for generating extra heat

       n     Solution
             q Take advantage of dark silicon or core redundancy

             q Utilize intrinsic sleep behaviors
                              Shared Memory

                                              Shared Memory

                                 t1              t2

© Xinfei Guo | Confidential                                   BARC 2021   X. Guo, etc. Springer ‘20   19
Scheduling for Recovery
   n       Goal
           q Recover effectively only when necessary

                                       Normal Operation       Accelerated and Active Recovery                                                 344.3min
                                       Passive Recovery       Wearout Sensors Trigger Point                           57.8min

                                                                                                            23.2min
                                                                                        Reactive
                                                                                        Recovery
        Performance (a.u.)

                                                                                 a% - Preset
                                                                                 Threshold

                                                                                    Proactive                                      104.5min

                                                                                    Recovery
                                 Preset
                                 Schedule

                                                          Accumulated                  AC                       Full recovery time after 12-hour
                                  Clock                   Wearout                     Stress                    constant stress under normal
                                  Frequency
                                                                                                                condition
                             0                                                          Time

                                                    Proactive Recovery                                      Application-dependent Scheduling

© Xinfei Guo | Confidential                                                                     BARC 2021                                                20
Putting it All Together: CLASH - Cross-layer
       Accelerated Self-Healing System
                              •   Application dependent scheduling

                              •   Scheduling for Recovery
                              •   Proactive Recovery
                              •   Unit-level Accelerated Self-Healing
                              •   Take advantage of dark silicon or core redundancy
                              •   Utilize intrinsic sleep behaviors

                              •   Accelerated Self-Healing Assist Circuitry
                                   • On-chip negative voltage generator
                                   • Power gating recovery enabler
                                   • On-chip Heater
                              •   Aging and Recovery Sensing Circuitry
                                   • BTI Sensor
                                        • RO Based
                                        • Metastable element based
                                   • EM Sensor

© Xinfei Guo | Confidential         BARC 2021                                         21
CLASH System – Hardware view
                              Normal      EM Active   BTI Active     Recovery   Heat Flow
                              Operation   Recovery    Recovery       Circuits

                                                                                  Wearout
                                                                                  Sensors &
                                                                                  Recovery
                                                                                  Circuitry

                                                                                 EM/BTI
                                                                                 Assist
                                                                                 Circuitry

© Xinfei Guo | Confidential                              BARC 2021                            22
What is the key benefit of doing cross-layer codesign here?

                                           Before…                               Accelerated Self-Healing
                              • Margin (e.g. 10 – 20 %)                    • Margin (0.21%)
                              • Track and Adaptation (Track                • Only track the reversible part (~ 8X
                                during the entire lifetime)                  tracking power reduction)
                              • Passive recovery (
Sleep for rejuvenating and healing neurons?

                              https://www.scientificamerican.com/article/lack-of-sleep-could-be-a-
© Xinfei Guo | Confidential
                              problem-for-ais/
                                                               BARC 2021                             24
Key takeaways
       n     Device level behaviors will have a lasting impact to all
             upper layers
       n     Cross-layer codesign is an essential way of enlarging the
             search space
       n     Challenges co-exist with opportunities
             q     Infrastructures
             q     Transparency
             q     Design cycle
             q     …

© Xinfei Guo | Confidential           BARC 2021                          26
References
       1.    X. Guo, W. Burleson, M. Stan, “Modeling and Experimental Demonstration of
             Accelerated Self-Healing Techniques,” ACM/IEEE Design Automation
             Conference (DAC), 2014.
       2.    X. Guo, M. Stan, “Work hard, sleep well - Avoid irreversible IC wearout with
             proactive rejuvenation,” ACM/IEEE Asia and South Pacific Design Automation
             Conference (ASP-DAC), 2016.
       3.    X. Guo, M. Stan, “Deep Healing: Ease the BTI and EM Wearout Crisis by
             Activating Recovery,” International Conference on Dependable Systems and
             Networks (DSN), 2017.
       4.    X. Guo, M. Stan, "Implications of Accelerated Self-Healing as a Key Design
             Knob for Cross-Layer Resilience", INTEGRATION, the VLSI journal (VLSI),
             vol. 56, pp. 167-180, 2017.
       5.    A. Roelke, X. Guo, M. Stan, “OldSpot: A Pre-RTL Model for Aging and
             Lifetime Optimization,” ICCD, 2018.
       6.    X. Guo, V. Verma, P. Guerrero, M. Stan, “When things get older - Exploring
             Circuit Aging in IoT Applications," International Symposium on Quality
             Electronic Design (ISQED), 2018.
       7.    X. Guo, M. Stan, “Circadian Rhythms for Future Resilient Electronic Systems -
             - Accelerated Active Self-Healing for Integrated Circuits”, Springer, 2020.

© Xinfei Guo | Confidential                                    BARC 2021                     27
Stay Safe and Healthy!

                                      xfguo@ieee.org

© Xinfei Guo | Confidential            BARC 2021       28
You can also read