Realizing One Big Switch Performance Abstraction using P4 - Jeongkeun "JK" Lee Principal Engineer, Intel Barefoot Petr Lapukhov Network Engineer ...

Page created by Tracy Figueroa
 
CONTINUE READING
Realizing One Big Switch Performance Abstraction using P4 - Jeongkeun "JK" Lee Principal Engineer, Intel Barefoot Petr Lapukhov Network Engineer ...
Realizing One Big Switch
Performance Abstraction
        using P4
       Jeongkeun “JK” Lee
Principal Engineer, Intel Barefoot

        Petr Lapukhov
  Network Engineer, Facebook
Agenda
•   Need for high-performance cluster for ML training
•   Vision: One Big Switch with Distributed VoQ
•   Problem: network congestion
•   Switch-side building blocks and P4 constructs

• Contributions from
     • Anurag Agrawal, Jeremias Blendin, Andy Fingerhut, Grzegorz Jereczek, Yanfang Le, Georgios
        Nikolaidis, Rong Pan, Mickey Spiegel

                                                2
A Training Cluster Primer
                                         Storage                                BE – Backend
                                          Fabric                                FE – Frontend
                                                                                ToR – Top of Rack

                  FE                                                     FE
                 ToR                                                    ToR

         NIC1            NIC4                                   NIC1            NIC4
                   …                                                      …
         CPU1            CPU4                                   CPU1            CPU4
                                                   …
Accel1          Accel2          Accel8                 Accel1          Accel2          Accel8
                         …                                                      …
RNIC1           RNIC2           RNIC8                  RNIC1           RNIC2           RNIC8

                 BE                                                      BE
                 ToR                                                     ToR

                                         Compute
                                          Fabric
                                                   3
Distributed ML training

•   Combines data & model parallelism
•   Bulk-synchronous parallel realization (BSP)
•   Collective-style communications
•   All trainers arrive to a barrier in each cycle

Ref: https://arxiv.org/pdf/2104.05158.pdf

                                      4
The problem statement

•   Synchronous training cycle = [Compute + Network]
•   Network Collectives: All-to-all, All-Reduce,…
•   Collective produces a combination of flows
•   Objective: min(collective completion time)
•   Large messages – O(1MB)

                                    5
What’s the challenge?

•   Accelerators require HW offloaded transport
•   All-to-all is bisection BW hungry
•   All-reduce may cause incast; flow entropy varies
•   Large Clos fabric requires tuning with RoCE
•   Small scale is fine – single large crossbar switch

                                     6
Vision: One Big Switch with Distributed VoQ

• “Network as OBS” model has been used in                             Hose model                Non-blocking
                                                                                                fabric
    • SDN for network-wide control plane and policy program
    • OVS/OVN network virtualization                           BX                      BZ         Hose
                                                                         BY                       bandwidth
    • Service Mesh for uServices
                                                                 X            Y         Z
• OBS performance model = hose model
    • Non-blocking fabric
    • Congestion only at network egress (output queue)
• Distributed VoQ (Virtual Output Queue)
    • VoQ reflects egress congestion & BW towards receivers
    • Move queueing from egress to ingress
    • OBS ingress = 1st-hop switches or senders
                                               7              Source: http://lib.cnfolio.com/
Reality: Congestions in DC CLOS
•   Uplink/core ç challenge for non-blocking fabric
     •   Cause: ECMP/LAG hash collision
     •   Worse at oversubscribed networks

•   Incast ç challenge for VoQ
     •   Cause: many-to-one traffic pattern
     •   Congestion surge at the last-hop
     •   Slows down e2e signal loop

•   Receiver NIC ç challenge for VoQ
     •   Cause: slow software/CPU, PCIe bottleneck
                                              8
Switch-side building blocks
1. Provide rich congestion, BW metrics
    è for applications to acquire available BW asap while controlling congestion
2. Fine-grained load-balancing w/ minimal or no out-of-order delivery
    è For non-blocking, full-bisection fabric
3. Cut-payload signaling to receivers
    è For NDP-style receiver pulling
4. Sub-RTT signaling back to senders
    è Sudden change of congestion/BW state
5. React to rx NIC congestion
    è Leverage switch programmability where smart NIC is not available or applicable
                                            9
1. Provide Congestion and BW metrics
• Goal: switch provides rich info, for sender/receiver to consume and control

• For whom: incoming data pkts, control pkts (RTS/CTS solicitation)
• What to provide: queue depth, drain time, TX rate or avail BW, arrival rate
  (incast rate), # of flows, congestion locator (node/port/Q IDs)
• How: in-band on forwarding pkts, separate signal pkt back to sender, or signal
  pkt receiver
• Where: last-hop or core switch, ingress or egress
• When: threshold crossing, bloom-filter suppression, when switch metric is off
  from in-pkt metric, when NIC sends PFC, upon pkt drops, …
                                        10
P4 Primitives (w/ PSA/TNA/T2NA)
•   For whom: incoming data pkts, control pkts (RTS/CTS solicitation)
           è P4 parser
•   What to provide: queue depth, drain time, per-port or -q TX rate, arrival rate (incast rate), # of flows, congestion
    locator (node/port/Q IDs)
           è standard/intrinsic metadata, register with HW reset, meter or LPF extern
•   How: in-band on forwarding pkts, separate signal pkt back to sender, or signal pkt receiver (at high-priority like
    NDP cut-payload)
           è modify, mirror, recirculate, multicast actions
•   Where: last-hop or core switch, ingress or egress, depending on use case
           è T2NA: ingress visibility of egress queue status (P4 Expert Round Table 2020)
•   When: threshold crossing, bloom-filter suppression, when switch metric is off from in-pkt metric, when NIC sends
    PFC, upon pkt drops, …
           è register, meter, P4 processing of RX PFC frame, TNA: Deflect on Drop
                                                           11
2. Fine-grained Load Balancing
• For non-blocking fabric
• Sender NIC may spray packets by changing L4 src port number (e.g., AWS SRD)
    • Pros: ECMP switch doesn’t need to change; better handle brownfield w/ path tracking at NIC
    • Cons: transport & congestion control change; 5tuple connection ID may alter
• Switch alternative 1: flowlet switching (e.g., Conga)
    • Pros: no change to transport, no OOO delivery
    • Cons: flow scale is limited
• Switch alternative 2: DRR (Deficit Round Robin) packet spraying
    • Pros: DRR minimizes load imbalance, hence OOO delivery window; DRR possible in P4
    • Cons: need greenfield (at least ToR layer); need packet re-ordering at RX NIC/host

                                              12
3. Cut-Payload to Receiver
• Signal last-hop switch congestion & drop events to receiver at high-priority
• NDP P4 design by Correct Networks and Intel, using
    • ingress meter + mirror
    • deflect-on-drop + multicast
    • on Tofino1
• P4 source will be posted soon at https://github.com/p4lang/p4-applications

                    Check out “NDP with SONiC-PINS: A low latency and high performance data-
                    center transport architecture integrated into SONiC.“ by Rong P. and Reshma S.

                                              13
4. Sub-RTT, L3 Signaling back to sender
•   Edge-to-Edge signaling of congestion info
•   5. able to carry rcv NIC congestion signal (e.g., NIC-to-switch PFC) to sender hosts
•   Senders can use the signal (VoQ) in various ways, e.g., instant flow control to ‘flatten the curve’
         •   Example below: SFC (Source Flow Control) = new L3 switch-to-switch signaling + PFC as existing flow control at
             sender NIC. (No PFC btw switches)
         •   Future work: fine-grained per-receiver VoQ at sender
                                                                                         Last-hop
                                1st-hop Switch                                            Switch
                                                                                                                      Rcv
host 1                 Port 1                            2. Congestion signal      Ingress     Egress
                                                                                                                     host
    q1       3. PFC     q1                               [IP reversed, pause time]              q1        7. ACK     q1
    q2                  q2                                                                      q2                   q2
                                                                          1. Incast traffic     q3      6. ECN       q3
    q3                  q3                                Agg + Core
                                         cache                                                          after q
                                   [dstIP: pause time]     switches                                     delay
host 2                 Port 2               …
    q1                  q1
    q2       5. PFC     q2
    q3                  q3

                                                                14
SFC: System Demo

                                          5                         6
                                       swc                       swc              victim
              16                    ia-                       ia-
H0              0                 an                        an
                                sm
                                                                       0
                                                          sm
                    flo
       100                                                         16 ws
                       w
                        s     ta                        ta          flo
                                                                              G
          G                         160 flows                               25
         G                  SW5     100G          SW6                      25G
      100      s
           flow                     640 flows                        80
       160                                                        flow 0
H1                                                                     s

                                                         100G
                                                                                  normal
         0G

                                                flows
                                                 160
       10
H3
                                                    H2
       0G
     10

H4
   0G
 10

                             Each host issues 160 flows (RDMA Reliable Connection).
H5

                                                         15
Flow Completion Time
      1.00       PFC
                 SFC
      0.75
CDF

      0.50

      0.25

      0.00
             0    5000   10000 15000 20000 25000 30000
                       Flow Completion Time (ms)
Queue Depth

                                                       victim
                   10
              H0        0G

                                                   G
                                                  25
                             SW5   100G   SW6
              H1

                                                  25
                                                  G
                                           100G
              H3
                                                       normal
              H4                          H2

                         100G
              H5
Summary
• True One-Big-Switch with performance is possible
• P4 primitives can build
    • non-blocking CLOS fabric
    • sub-RTT signaling of congestion & BW metrics

• Research questions
    • What is the scope of OBS; granularity of distributed VoQ
         • From last-hop switch queue to rxNIC port to service node (ML worker or µService)
         • Should consider major incast bottleneck point and VoQ scale
    • NIC-side building blocks to enable distributed VoQ at senders
    • How QoS infra and congestion control benefit from VoQ

                                                   18
Thank You
 jk.lee@intel.com
   petr@fb.com
Simulation setup

Custer: 3-tier, 320 servers, full bisection, 12us base RTT
Switch buffer: 16MB, Dynamic Threshold
Congestion control: DCQCN+window, HPCC
SFC Parameters
 •   SFC trigger threshold = ECN threshold = 100KB, SFC drain target = 10KB
Workload: RDMA writes
 •   50% Background load: shuffle, msg size follows public traces from RPC, Hadoop, DCTCP
 •   8% incast bursts: 120-to-1, msg size 250KB, synchronized starts within 145us
Metrics
 •   FCT slowdown: FCT normalize to the FCT of same-size flow at line rate
 •   Goodput, switch buffer occupancy
Simulation with RPC-inspired msg size dist.

                                                                                  7R5-3)C                         )uOO-3)C      7R5-6)C-2             12-3)C
50% )C7 sORwGRwn

                                               95% )C7 sORwGRwn

                                                                                              99% )C7 sORwGRwn
                   10
                        2
                                                                  10
                                                                       3
                                                                                                                 10
                                                                                                                      3                            1.0                               1.0
                                Better

                                                                                                                                                   0.8                               0.8
                                                                       2                                              2
                                                                  10                                             10                                0.6                               0.6

                                                                                                                                             CD)

                                                                                                                                                                               CD)
                        1
                   10
                                                                                                                                                   0.4           Better              0.4
                                                                  10
                                                                       1
                                                                                                                 10
                                                                                                                      1
                                                                                                                                                                                                  Better
                                                                                                                                                   0.2                               0.2
                        0                                              0                                              0
                   10 0 1 2 3 4 5 6 7                             10 0 1 2 3 4 5 6 7                             10 0 1 2 3 4 5 6 7                0.0                               0.0
                     10 10 10 10 10 10 10 10                        10 10 10 10 10 10 10 10                        10 10 10 10 10 10 10 10             0    4    8   12   16               0    20  40   60   80
                            6ize (%ytes)                                   6ize (%ytes)                                   6ize (%ytes)                 %uffer RFFuSDnFy (0%)                   GRRGSut (GbSs)

                                                                                                                          DCQCN+Window

                                                                                  7R5-3)C                         )uOO-3)C      7R5-6)C-2             12-3)C
50% )C7 sORwGRwn

                                               95% )C7 sORwGRwn

                                                                                              99% )C7 sORwGRwn
                   10
                        2
                                                                  10
                                                                       3
                                                                                                                 10
                                                                                                                      3                            1.0                               1.0
                               Better

                                                                                                                                                   0.8                               0.8
                                                                       2                                              2
                                                                  10                                             10                                0.6                               0.6

                                                                                                                                             CD)

                                                                                                                                                                               CD)
                   10
                        1
                                                                                                                                                   0.4           Better              0.4
                                                                  10
                                                                       1
                                                                                                                 10
                                                                                                                      1
                                                                                                                                                                                                  Better
                                                                                                                                                   0.2                               0.2
                        0                                              0                                              0
                   10 0 1 2 3 4 5 6 7                             10 0 1 2 3 4 5 6 7                             10 0 1 2 3 4 5 6 7                0.0                               0.0
                     10 10 10 10 10 10 10 10                        10 10 10 10 10 10 10 10                        10 10 10 10 10 10 10 10             0    4    8   12   16               0    20  40   60   80
                            6ize (%ytes)                                   6ize (%ytes)                                   6ize (%ytes)                 %uffer RFFuSDnFy (0%)                   GRRGSut (GbSs)

                                                                                                                             HPCC
You can also read