Algorithms for VLSI physical design automation 046918

Page created by Erin Schroeder
 
CONTINUE READING
046918
            Algorithms for VLSI physical design automation

Partly adapted from:
Eduardo Maayan
Rajeev Murgai
Anrew B. Kahng
Jens Leinig
Igor L. Markov
Jin Hu
D. Pan
Shiyan Hu
Rupesh Shilar
VLSI design flow overview
Introduction to algorithms and optimization
Backend CAD optimization problems:
◦   Design partitioning
◦   Technology mapping
◦   Floorplanning
◦   Placement
◦   Clock tree synthesis
◦   Routing
Introduction to layout
Layout optimization and verification
◦ Layout analysis
◦ DRC and LVS checks
◦ Finding objects in the layout
Intro to clock networks
◦ Basic definitions
◦ Clock topologies
   Tree
   Mesh
   Mixed topologies
◦ Design considerations
Zero-skew clock network generation
◦ Means and medians
◦ Recursive geometric matching
Low-power clock tree synthesis
System Specification

                                                Partitioning
                   Architectural Design
 ENTITY test is
 port a: in bit;
end ENTITY test;
                    Functional Design         Floor Planning
                    and Logic Design

                      Circuit Design            Placement

                      Physical Design
                                            Clock Tree Synthesis

                    Physical Verification
    DRC                 and Signoff
    LVS                                        Signal Routing
    ERC
                        Fabrication

                                              Timing Closure

                   Packaging and Testing

                           Chip

                                                                   4
Different from other signal nets, clock and
power are special routing problems
◦ For clock nets, need to consider clock skew as
  well as delay.
◦ For power nets, need to consider current density
  (IR drop)
=> specialized routers for these nets.
Automatic tools for ASICs
Often manually routed and optimized for
microprocessors, with help from automatic
tools
For synchronized designs, data transfer between
functional elements are synchronized by clock
signals
Clock signal are generated externally (e.g., by PLL)
Two types of synchronizing elements:
◦ FLIP-FLOPS
    Edge sensitive – captures data on switching clock edge and is
    closed rest of the time

◦ LATCHES
    Level sensitive – captures data on switching clock edge , has
    transparent phase and closed phase
Clock skew is the maximum difference in the arrival
time of a clock signal at two different components.
Clock skew forces designers to use a large time
period between clock pulses. This makes the
system slower.
So, in addition to other objectives, clock skew
should be minimized during clock routing.
What are the main concerns for clock design?
Skew
◦ No. 1 concern for clock networks
◦ For increased clock frequency, skew may contribute over
  10% of the system cycle time
Power
◦ very important, as clock is a major power consumer!
◦ It switches at every clock cycle!
Noise
◦ Clock is often a very strong aggressor
◦ May need shielding
Delay
◦ Not really important
◦ But slew rate is important (sharp transition)
Given a source and n sinks.
Connect all sinks to the source by an
interconnect network (tree or non-tree) so
as to minimize:
◦   Clock Skew = maxi,j |ti - tj|
◦   Delay = maxi ti
◦   Total wirelength
◦   Noise and coupling effect
Clock signal is global in nature, so clock nets
are usually very big
◦ Significant interconnect capacitance and resistance
So what are the techniques?
◦ Routing
   Clock tree versus clock mesh (non-tree or grid)
   Balance skew and total wire length
◦ Buffer insertion
   Clock buffers to reduce clock skew, delay, and distortion in
   waveform.
◦ Wire sizing
   To further tune the clock tree/mesh
A path from the clock source to clock sinks

                    Clock Source

FF   FF   FF   FF     FF    FF     FF   FF   FF   FF
H-tree
A path from the clock source to clock sinks

                    Clock Source

FF   FF   FF   FF     FF    FF     FF   FF   FF   FF
Clock sinks or local sub-networks
                                    Spines                          [Su et. al, ICCAD’01]

Clock sinks or local sub-networks
Applied in Pentium processor
        [Kurd et. al. JSSC’01]

  Applied in IBM microprocessor
    Very effective, huge wire                Clock sinks or local sub-networks
                                                      [Restle et. al, JSSC’01]
Non-tree = tree + links
How to select link pairs is the key problem
Link = link_capacitors + link_resistor
Key issue: find the best links that can help the skew
variation reduction the most!

                                                u            i
                                                    C/2
                u
                                                    Rl
                w
                                             w      C/2
              Rl
        u             w
       C/2      C/2
                                   [Rajaram et al, DAC’04]
Clock source

                            n x n uniform mesh
                             Distributed array of k x
                            k buffers drives the
                            mesh.
               flip flops
                            Buffers driven by global
                            H-tree.
                             Flip-flops directly
                            connected to the
                            nearest mesh segment
                            Advantages
                            ◦ Excellent for low skew
                            ◦ Robust to variations
                            Disadvantages
                            ◦ Higher wiring area,
                              capacitance, power
                            ◦ Difficult to analyze
                               Loops and redundancy
Hybrid Structured Clock
Network Construction [Hu
& Sapatnekar, ICCAD 01]
◦ Hybrid clock topology
   simple top-level global mesh
   zero-skew local trees at
   bottom                                    source
◦ Presents wire sizing scheme            a                c
  to achieve latency and skew
  reduction.
   iterative LP to minimize wire                      d
   width (area) of top-level mesh,   b
   given delay bound
     uses Elmore delay τ = G-1C
   sensitivity-based post-layout
   clock tree tuning to reduce
   skew.
Clock source

                                                                                                                                        Mesh
                                                                                                                           -- excellent for low skew, jitter
                                                                                                                           -- high power, area, capacitance
                                    Flip-flops
                                                                                                              flip flops
                                                                                                                           -- difficult to analyze
                                                                                                                           -- clock gating not easy
              Tree
-- low cost (wiring, power, cap)
-- higher skew, jitter than mesh
-- widely used in ASIC designs                                                                 Clock source

              Best architecture depends on the application
-- clock gating easy to incorporate
                       Flip flops

                                                 crosslink
                       crosslink
                                      tree

                                                              Local trees

            Hybrid: tree + cross-links                             Flip flops

          -- low cost (wiring, power, cap)                   Hybrid: mesh + local trees
          -- smaller skew, jitter than tree                  -- suitable for coarse mesh
          -- difficult to analyze
H-tree

 Exact zero skew due to the symmetry of the H-tree
 Used for top-level clock distribution, not for the entire clock tree
 ◦   Blockages can spoil the symmetry of an H-tree

                                                                        © 2011 Springer Verlag
 ◦   Non-uniform sink locations and varying sink capacitances
     also complicate the design of H-trees

                                                          26
Method of Means and Medians (MMM)

Can deal with arbitrary locations of clock sinks
Basic idea:
◦ Recursively partition the set of terminals into two subsets of equal
  size (median)
◦ Connect the center of gravity (COG) of the set to the centers of
  gravity
  of the two subsets (the mean)

                                                    27
Method of Means and Medians (MMM)

Find the       Partition S by   Find the center        Connect the       Final result after
center of       the median      of gravity for the   center of gravity      recursively
 gravity                          left and right       of S with the     performing MMM
                                  subsets of S           centers of       on each subset
                                                      gravity of the

                                                                                         © 2011 Springer Verlag
                                                       left and right
                                                          subsets

                                                                  28
Method of Means and Medians (MMM)
Input: set of sinks S, empty tree T
Output: clock tree T

if (|S| ≤ 1)
    return
(x0,y0) = (xc(S),yc(S))       //   center of mass for S
(SA,SB) = PARTITION(S)        //   median to determine SA and SB
(xA,yA) = (xc(SA),yc(SA))     //   center of mass for SA
(xB,yB) = (xc(SB),yc(SB))     //   center of mass for SB
ROUTE(T,x0,y0,xA,yA)          //   connect center of mass of S to
ROUTE(T,x0,y0,xB,yB)          //    center of mass of SA and SB
BASIC_MMM(SA,T)               //   recursively route SA
BASIC_MMM(SB,T)               //   recursively route SB

                                                      29
Recursive Geometric Matching (RGM)

  RGM proceeds in a bottom-up fashion
  ◦ Compare to MMM, which is a top-down algorithm
  Basic idea:
  ◦ Recursively determine a minimum-cost geometric matching
    of n sinks
  ◦ Find a set of n / 2 line segments that match n endpoints and
    minimize total length (subject to the matching constraint)
  ◦ After each matching step, a balance or tapping point is found
    on each matching segment to preserve zero skew to the
    associated sinks
  ◦ The set of n / 2 tapping points then forms the input to the
    next matching step

                                               30
Recursive Geometric Matching (RGM)

 Set of n         Min-cost       Find balance or       Min-cost    Final result after
 sinks S         geometric        tapping points      geometric       recursively
                 matching      (point that achieves   matching     performing RGM
                                 zero skew in the                   on each subset
                               subtree, not always

                                                                                   © 2011 Springer Verlag
                                    midpoint)

                                                              31
Recursive Geometric Matching (RGM)
Input: set of sinks S, empty tree T
Output: clock tree T

if (|S| ≤ 1)
    return
M = min-cost geometric matching over S
S’ = Ø
foreach ( €M)
    TPi = subtree of T rooted at Pi
    TPj = subtree of T rooted at Pj
    tp = tapping point on (Pi,Pj) // point that minimizes the skew of
                                   // the tree Ttp = TPi U TPj U (Pi,Pj)
    ADD(S’,tp)                     // add tp to S’
    ADD(T,(Pi,Pj))                 // add matching segment (Pi,Pj) to T
if (|S| % 2 == 1)                  // if |S| is odd, add unmatched node
    ADD(S’, unmatched node)
RGM(S’,T)                          // recursively call RGM

                                                           32
Exact Zero Skew

    Adopts a bottom-up process of matching subtree roots and
    merging
    the corresponding subtrees, similar to RGM
    Two important improvements:
   ◦ Finds exact zero-skew tapping points with respect to the Elmore
     delay model rather than the linear delay model
   ◦ Maintains exact delay balance even when two subtrees with very
     different
     source-sink delays are matched (by wire elongation)

                                                    33
Exact Zero Skew

 Tapping point tp                                     R(w1)
                                                                       t(Ts1 )
                                                  z   C(w1)   C(w1)   C(s1)
                                                        2       2
      z           1–z     Tapping point tp,
          w1 w2           where Elmore delay
 s1                 s2                                R(w2)
                          to sinks is equalized                        t(Ts2 )
                                              1–z     C(w2)   C(w2)   C(s2)
                                                        2       2
Subtree Ts1 Subtree Ts2

                                                                                 © 2011 Springer Verlag
                                                         34
Local Clock Capacitance Distribution in a
Microprocessor

                                         •Interconnects contribute to major
                          Interconnect   portion of total capacitance
 Sequentials                  42%
    48%
                                         •Clocks are the most active nets in the
                                         design

                Buffers
                                         •Minimizing interconnect capacitance
                 10%                     in clocks leads to reduction in dynamic
                                         power

  Distribution generated from several
  blocks in a microprocessor

                                                           36
Local Clock Network:
                                                                         CTS Solution Space

     Global Clock Distribution
       Using Multiple spines

                                          RCBs         LCBs

                                        Regional
                                      Clock Buffers

PL                                                        Local
L                                                     Clock Buffers

                                           RCBs        LCBs

                                                              To state
             Tunable Grid Buffers                             elements
                                     Clock Grid

                                 •   Clock network in a processor:
                                         Distributed as a grid followed by tree

37
Logical       Sequentials    •   Performed after the
          RTL
                Clock         (x,y), sizes
                                                 placement/sizing of
                Tree
    Logic
                                                 sequentials
                      Clock Buffer
    Synthesis         Duplication
                                             •   Converts logical clock
                                                 tree into physical one
    Physical          Routing                •   Flow employed in
    Synthesis         Clock Nets                 several microprocessor
                                                 designs
    CTS              Sizing
                     Clock Buffers

    Routing
                          CTS
                   (Simplified version)

3
8
•   Given a clock buffer, duplicate it to
  Duplication           meet delay, slope, RC, skew constraints
                            Decides
 K-stage buffers               receivers driven by the same
                               driver
                               the clock tree topology
   Duplication
                    •   Applied recursively in reverse topological
                        order
K-stage receivers   •   Driven by clustering or partitioning
                            Often intractable when capacity constraints
                            specified
                            Many heuristics available

                                                39
Effect of Clustering on Capacitance

     4 placed      Solution 1     Solution 2        Solution 3
    sequentials

  • A cluster implies a clock buffer
  • Interconnect capacitance varies significantly for
    different solutions even with same number of clusters

                                               40
Clustering Targeting Power

•Find the clusters such that total local clock power is minimum
  –   Power in local clock, PLocal Clock = PDynamic+ PLeakage
  –   PDynamic = PSequential Cap + PBuffer Cap + PRouting Cap
  –   PLeakage and PBuffer Cap can be shown proportional to total cap
  –   PSequential Cap is fixed for CTS purposes
  –   Reducing PLocal Clock is equivalent to minimizing interconnect cap

•Find the clusters such that total interconnect capacitance is
minimum

                                                       41
Routing-aware Clustering: Chicken-
and-Egg Problem

Routing cap is unknown till the clustering is performed

                            ?
Clustering cannot be performed till routing cap is known

                                                   42
•   Let’s assume minimum spanning tree (MST)
    routing estimates
       Other candidates: metrics we saw in placement lecture
       Correlated with actual clock tree wirelength
       MST possesses submodularity property suitable for greedy
       optimization
•   Can the problem be solved optimally, i.e., can
    we perform clustering such that the routing
    cap./overall power is minimum?
    • Yes, it can be (if capacity constraints are dropped)

                                                     43
•   Given: Set of receivers S = {s1, …, sn}, their loads (csi), and
    locations (xsi, ysi)
•   Find: A set of clusters, Sclusters = {c1, …, cm} such that Σi α +
    MST (ci) is minimum
•   Subject to Constraints (or Design Parameters):
        Maximum # of receivers
          Due to process, routing, etc.
        Maximum load in a cluster
          Due to library
        Bounding box width/height
          To control RC delay and variations in it

                                                     44
•   Similar to Kruskal’s MST construction algorithm
•   Steps in algorithm:
       Create complete graph G(S, E, W)
       Assign each edge estimated capacitance as the weight
       Create trivial solution with each cluster containing a
       receiver
       For each edge, in ascending order of weights
         Merge clusters till the cost function is minimized

                                             45
Example
           1           A cluster
                       An edge
   4   5       5   4     The weight
           2

 •Constraint: maximum # of receivers constraint 3

                                                    47
Example
              1

      4   5       5   4

              2

    •Constraint: maximum # of receivers constraint 3

4
8
Example
           1

   4   5       5   4

           2

 •Constraint: maximum # of receivers constraint 3

                                                    49
Example
            1

    4   5       5   4

            2

 •Constraint: maximum # of receivers constraint 3
 •Power-aware clustering results in clusters with total MST value of 3, which is
 optimal in this case

                                                          50
•   Ensures optimality when no capacity constraints (max.
    load, # of receivers) specified
       Reduces to minimum spanning forest problem
•   Runs in O(n2 log n) time in number of receivers
       Handles blocks with ~5K sequentials easily
       1.34 seconds for clustering of 1037 sequentials
•   Run-times practical and comparable to competitive
    algorithms
       Clock buffer duplication takes minutes on ~5K sequential
       blocks

                                               51
You can also read