Optics for disaggregating Datacenters and disintegrating Computing - ONDM 2019

Page created by Ricardo Williams
 
CONTINUE READING
Optics for disaggregating Datacenters and disintegrating Computing - ONDM 2019
Optics for disaggregating Datacenters and
                  disintegrating Computing
    N.Terzenidis, M.Moralis-Pegios, S. Pitris, G.Mourgias-Alexandris, A. Tsakyridis,
    C. Vagionas, K. Vyrsokinos, T. Alexoudi, C. Mitsolidou, N. Pleros

ARISTOTLE UNIV. OF THESSALONIKI                    Wireless and Photonics Systems and Networks (WinPhos) Lab
                                                              Dept. of Informatics, Aristotle Univ. of Thessaloniki,
                                                Center for Interdisciplinary Research & Innovation (CIRI), Greece
Optics for disaggregating Datacenters and disintegrating Computing - ONDM 2019
The energy problem: World’s No. 1 HPC
No.#1: IBM Summit (top500.org Nov. 2018 List)
                                                   200 Pflops (20%
                                                    of Exascale target)
                                                   10MW (50% of the
                                                    20MW limit)

   Energy in DataCenters (% of global power) :
     1.6%            >3%(420TWh)                    ~20%
     2014           2016                              2025
                                                                      Nikos Pleros
Optics for disaggregating Datacenters and disintegrating Computing - ONDM 2019
The energy efficiency problem

                                                       Intel Whitepaper, 2015

   High diversity of workloads with different resource requirements
   up to 50% underutilized resources..impacting cost and energy !
                                                                          Nikos Pleros
Optics for disaggregating Datacenters and disintegrating Computing - ONDM 2019
The way-out
     Energy        Resource utilization

      Technology       Architecture

      Photonics      Disaggregation

                                          Nikos Pleros
Optics for disaggregating Datacenters and disintegrating Computing - ONDM 2019
Challenges across the hierarchy

                                                                            hierarchy
   >2μsec latency in    < 8-node connectivity   Non-modular settings with
   high-port switches   in p2p QPI-based        >40% on-die caches
   Energy increases     High-latency in non-    Energy dominated by
   with capacity        p2p switched-based      memory access

                                                                            Nikos Pleros
Optics for disaggregating Datacenters and disintegrating Computing - ONDM 2019
Our work
T. Alexoudi et al, “Optics in Computing: from Photonic Network-on-Chip to Chip-to-Chip Interconnects and Disintegrated Architectures “, JLT 2019

                                                                                      disaggregate at rack-level…
                                                                                       down to disintegration at
                                                                                       chip-level

                                                                                                                                                  hierarchy
     Hipoλaos Optical Packet                           On-board                                 Optical Static RAM                                     1.48 cm

      Switch                                             disaggegration with                       technology
                                                                                                                                      Test
                                                                                                                                   structures

                                                                                                                                                18 SOAs

                                                         >8 sockets in p2p
                                                                                                                                                 44 I/O

     1024x1024 ports, 10T                                                                        Disintegrate via off-
      capacity scalable to >40T                         Low-energy, low-                          chip high-speed
     sub-μsec latency                                   latency with Si-Pho                       optical caches
                                                                                                                                                          Nikos Pleros
Optics for disaggregating Datacenters and disintegrating Computing - ONDM 2019
The Hipoλaos Optical Packet Switch
     1024x1024 -port                              Multicasting                          Si-integration ?

 N. Terzenidis et al, ECOC, Sept. 2018   N. Terzenidis et al, IEEE PTL, pp. 1535-   M. Moralis-Pegios et al, IEEE PTL
                                         1538, Sept. 2018                           pp. 712-715, April, 2018

 N. Terzenidis et al, ONDM 2019 to demonstrate performance in a 256-
   node network
                                                                                                             Nikos Pleros
Optics for disaggregating Datacenters and disintegrating Computing - ONDM 2019
Disaggregate at board-level

Beating QPI: a p2p board-
level optical interconnect
for >8 sockets

                              Nikos Pleros
Optics for disaggregating Datacenters and disintegrating Computing - ONDM 2019
The inner-anatomy: board-level
       Switch-based              Direct P2P - High-End Servers
                                                          CISC processors INTEL (x86)
                                                                Intel® Xeon®
                                                              processor E5-4600

          •   PCIe
                               INTEL QPI, AMD HyperTransport, IBM POWER8 SMP
          •   RapidIO          Links, Oracle SPARK Coherence links, NVIDIA NVlink

  Unlimited connectivity          Low-latency, low-power
   Increased latency, power,        Limited connectivity
   cost, size
                                                                           Nikos Pleros
Optics for disaggregating Datacenters and disintegrating Computing - ONDM 2019
QPI
                                                                    102.4Gb/s
                              4s-“glueless”
…the dominant P2P link
                              architecture

 4 sockets P2P connected
 9.6 Gb/s line rate          8s- “glueless”
 38.4 GB/s BW (307.2 Gb/s     architecture
                                up to 2-hop
  or 102.4 Gb/s /direction)
                                Increased latency
 60ns-240ns Latency
 16 pJ/bit TxRx power
  efficiency
                                       Intel Xeon E7-8800v4: 1st Intel Processor to support
                                       up to 8-socket ! (released June 2016)
                                                                                Nikos Pleros
Going beyond 8 sockets?
             Bixby (BX) switches
                                          January 2017

 Price/performance drastically decreases as #processors increases
 Latency increases
 65% of QPI Bandwidth wasted for cache coherency updates
                                                              Nikos Pleros
The ICT-STREAMS O-band technology
                                    www.ict-streams.eu

  16x50Gb/s WDM
  TxRx SiP O-band

                                  16x16 AWGR- O-band
    1300nm SM EOPCB with 50G RF                Nikos Pleros
The ICT-STREAMS p2MP architecture

  Strictly non-blocking                     High bandwidth optical links
  All-to-All connectivity                   No O/E/O conversions
  Suitable for broadcasting/multicasting    Passive (no power consumption)
  High number of ports (up to 32)           Time of flight latency
                                                                             Nikos Pleros
STREAMS vs QPI
                       QPI        STREAMS       gain

        Sockets/
                     Up to 8        >16         x2
         nodes

         #hops       up to 2          1         ÷2

        Line rate    9.6 Gb/s     50 Gb/s       x 10

        Link BW     307.2 Gb/s   16x50 Gb/s     x 2.6

       Throughput   2.45 Tb/s     12.8 Tb/s     x 5.2

         Power
                     16 pj/bit   < 3.5 pj/bit   ÷4
       efficiency

                                                        Nikos Pleros
The on-board routing platform

                     8×8 O-band Si-AWGR

                 S. Pitris et al., Opt. Express 26(5), 2018
                                                              Nikos Pleros
Si-based 8x8 O-band AWGR technology

                                      S. Pitris et. al., OFC 2018
                                      S. Pitris et. al., OpEx 2018

     10nm ch. spacing
     channel crosstalk : 11 dB
     3-dB bandwidth 5.7nm
     non-uniformity : 3.5 dB loss,
     insertion losses: 2.5 dB loss
                                                                     Nikos Pleros
Si-AWGR on board

                                                      AWGR chip

                                                                    18mm
                                               30mm

 Si-AWGR onto an EOPCB for on-board routing
 Testing currently on-going
                                                          Nikos Pleros
Multisocket routing @40Gb/s

     40 Gb/s O-band RM                    8×8 O-band Si-AWGR                  40G PD & BICMOS TIA

     S. Pitris et al., IEEE PJ 2018   S. Pitris et al., Opt. Express, 2018.     S. Pitris et al., PJ 2018

                                                                                                            Nikos Pleros
8x40Gb/s multi-socket Tx/Rx/routing

S. Pitris et al., IEEE Photon. J., Oct. 2018

                                                    RM: 1.7pJ/bit @40Gbps
                                                    TIA: 4 pJ/bit @40Gbps

                                                5.7pJ/bit for Tx/Rx link
                                                  (incl. thermal tuning)

                                                                           Nikos Pleros
The WDM Transceiver engine

      4ch Si-pho WDM TxRx

     S. Pitris et al., OFC 2019

                                  Nikos Pleros
4x40Gb/s O-band Si WDM transmitter
                                                                S. Pitris et. al., OFC 2019

       Mask Layout

                ER = 4.4 dB    ER = 4.1 dB    ER = 4 dB       ER = 4.2 dB
                      Tx1         Tx2            Tx3              Tx4
2mV 10ps/div

               λ1=1310.25 nm   λ2=1303.23    λ3=1317.28 nm   λ4=1324.59 nm
                                                                               Nikos Pleros
4x50Gb/s on-board WDM transmitter
                                            4-channel 55nm BiCMOS DR   Tx out 1 @ 50 Gb/s NRZ
                                                                               (λ=1314.9 nm)
        High Speed TX RF traces
                                                                               ER = 4.3 dB

 HF                                                                           4mV-10ps/div
Board                         DC traces                                  RM-driving voltage = 1.9 Vpp
                                                                           EE=1.525pJ/bit/RM +
                                                                              0.8 pJ/bit (tun)

                                                                       Rx TIA out 1 @ 30 Gb/s NRZ
                                                                                (λ=1318.4 nm)

        High Speed RX RF traces
                                          4-channel 55nm BiCMOS TIA

                                                                             10mV-20ps/div

 All 4-channels capable of >50G operation                             TIA output voltage = 120 mVpp
                                                                             PCTIA=130mW/ch

                                                                                               Nikos Pleros
1-Volt 50Gb/s x 52km transmission
                    FDSOI CMOS DR         H. Ramon et al., IEEE PTL 30, 2018.

               Micro-ring Modulator                            M. Moralis-Pegios et al., OECC/PSC 2019

                  BER measurements @ 40 Gb/s
                                                         1V-driven 50 Gb/s NRZ operation &
                                                          transmission at 52 km through SMF
                                                         EE = 0.8 pJ/bit @ 50 Gb/s

                                                            Record-high BxL
                                                             = 2600 Gb-km/s

                    Power penalty = 0.2 dB @10-9

                                                                                            Nikos Pleros
The energy-latency gain
                                                              Power
                                 Device        Ref.                       EE@40Gb/s     EE@50Gb/s
                                                           consumtpion

                                   LD           [1]           6.1 dBm      1 pJ/bit      0.8 pJ/bit

                               RM DRIVER +
                                             This work    40 mW + 40 mW    2 pJ/bit      1.6 pJ/bit
                                 TUNING

                                 PD TIA      This work        130 mW      3.25 pJ/bit    2.6 pJ/bit

                                 SerDes         [2]           11.6 mW     0.29 pJ/bit   0.232 pJ/bit
1   S. Pitris et al, PJ 2018
2 V.   Stojanovic, OPEX 2018                             Totals           6.54 pJ/bit   5.23 pJ/bit

                                        %Reduction to QPI                  ~60%          ~68%

                           ~68% energy savings compared to QPI
                           with >4-node direct connectivity !
                                                                                                       Nikos Pleros
Disintegrate at chip-level

Optical RAMs: disintegrate
via off-chip optical caching

                               Nikos Pleros
Why to disintegrate?

Yigit Demir et al, “Galaxy: A High-Performance Energy-Efficient   ORACLE, Pranay Koka et al, “Silicon-Photonic Network Architectures
Multi-Chip Architecture Using Photonic Interconnects”, ICS 2014   for Scalable, Power-Efficient Multi-Chip Systems”, ISCA 2010

  Smaller dies (chiplets) interconnected through optics
  Relieve from area & power constraints – ease scaling to many-core
  >2x speed-up in512-core (Oracle) and 4k-core (Galaxy) settings
                                                                                                                             Nikos Pleros
On-chip caches and memory bandwidth
     Evolution of on-die caches
                                                    40%

[Source: S. Borkar and A.A.Chien, Com. ACM, 2011]                     [Source] INTEL

     Memory speed-up mainly through hierarchical levels
             Large and complex two or three level cache hierarchies
             Up to 40% of chip real-estate consumed by caches and X-bar
             Still
Our goal
Now!

CMP:
cores                        p-NoC
only!                  o/e                           o/e
                                     optical
                                     cache

 Modular and granular setting with cache-free cores and a unified,
  shared pool of cache resources
 Needs an ultra-fast optical cache for time-sharing between lower-
  clock cores                                                   Nikos Pleros
10 Gbps Optical RAM
                      A. Tsakyridis et
                      al, CLEO 2019       First to exceed electronics
                                          2x the Speed: 10Gbps

                                            Up to 5GHz RAM speed
                                            Fail to exceed electronics

                                           Give up access times for
                                           lower footprint/power

                                                              Nikos Pleros
10 Gbps Optical RAM
10 Gb/s monolithic InP Flip-Flop

  Packaged Optical Memory

                                                                                 BER penalties: Write 6.2 dB, Read 0.4 dB
                                   A. Tsakyridis et al, CLEO 2019, A. Tsakyridis et al, OSA Optics Letters 2019
                                                                                                                  Nikos Pleros
10 Gbps Optical RAM

                      Nikos Pleros
Conclusions

                                                                                   hierarchy

     Hipoλaos Switch        On-board p2p >8 sockets     10GHz Optical SRAM: 2x
                             5.2pJ/bit link @50Gb/s       the speed of electronics
      1024x1024-port
                             Synergy between             Si-based RAM layout +
      sub-μsec latency
                              technology+architecture:     Optical cache design
      Scalable to 10Tbps
                              68% energy save over QPI    Modular, granular, flexible
       with 113pJ/bit
                                                                                   Nikos Pleros
www.ict-streams.eu
                     Contacts:
                     Prof. Nikos Pleros : npleros@csd.auth.gr
                     Dr. Theoni Alexoudi : theonial@csd.auth.gr

                                                       Nikos Pleros
Back-up slides

                 Nikos Pleros
The Si-based Optical SRAM
Hybrid PhC-based laser Flip Flop with only 13fJ/bit
  *In collaboration with F. Raineri and R. Raj, C2N/CNRS
                                                                         Lowest power consumption FF

                                              Switching energy 6.4 fJ/bit@ 5 GHz    3.2 fJ/bit@ 10 GHz

                                                                        Input

 *T. Alexoudi et al, JSTQE (2016)
 * D. Fitsios et al, Optics Express, (2016)

  6.2μm2 footprint                                                     Output

 
The first optical cache design
* P. Maniotis et.al., "Optical Buffering for Chip Multiprocessors: A 16GHz Optical Cache Memory Architecture,” JLT, pp.4175-4191, 2013

 Optical cache simulations @ 16GHz                                                                Optical RAM
                                                                                                  Peripheral Circuits
 WDM                                                                                               Row Access Gate +
                                                     Access Gate + Col. Decoder
optical                                                                                             Column Decoder
memory                      Row Decoder                                                            Row Decoder
address                                                                                            Tag Comparator
                                                                                                  Experiments @10Gb/s

                          Tag Comparator
 WDM
optical
word /                                                          Fast Optical RAM cell:
 data
                                                                 design+theory > 40GHz
                                                                 experiment @10GHz
                                                                 Record-low power
                                                                                                                           Nikos Pleros
Caching with light at 16Gbps
 VPI   simulations with exp.verified SOA model
 Write operation
                           Write to row “00”: Only when both λ1,λ2=0 &
                           RW=0 (green area) the FF contents follow the
                           Bit contents (blue area)

                                 average ER  6.2 dB
                                 average AM  1.9 dB
                                 close to experimental values!
                                                                     Nikos Pleros
And compare…
 CONVENTIONAL TOPOLOGY     OPTICAL TOPOLOGY

  8 cores, 2GHz clock    8 cores, 2GHz clock
  Dedicated L1           16GHz cache clock
  Shared L2=8xL1         Off-die shared L1 – no L2

                                                  Nikos Pleros
Execution times for PARSEC                                                                       L1 Conventional
                                                                                                 L1+L2 Conventional
                                                                                                 2xN GHz Optical

               MEAN                     63.4               62.7
 *P. Maniotis et al."High-Speed Optical Cache Memory as Single-Level Shared
                                                                              19.4                10.7
  Cache in Chip-Multiprocessor architectures," in Proceedings of HIPEAC15, 19-21 January, 2015
                                                                                                             Nikos Pleros
You can also read