Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Page created by Bonnie Jackson
 
CONTINUE READING
Accelerating Spark with RDMA for Big Data Processing: Early Experiences
Accelerating Spark with RDMA for
Big Data Processing: Early Experiences

       Xiaoyi	
  Lu,	
  Md.	
  Wasi-­‐ur-­‐Rahman,	
  Nusrat	
  Islam,	
  
       Dip7	
  Shankar,	
  and	
  Dhabaleswar	
  K.	
  (DK)	
  Panda	
  
                                        	
  
                  Network-­‐Based	
  Compu2ng	
  Laboratory	
  
             Department	
  of	
  Computer	
  Science	
  and	
  Engineering	
  
              The	
  Ohio	
  State	
  University,	
  Columbus,	
  OH,	
  USA	
  
Accelerating Spark with RDMA for Big Data Processing: Early Experiences
Outline	
  	
  

• Introduc7on	
  
• Problem	
  Statement	
  
• Proposed	
  Design	
  
• Performance	
  Evalua7on	
  
• Conclusion	
  &	
  Future	
  work

                                             2
                               HotI 2014
Digital Data Explosion in the Society

       Figure Source (http://www.csc.com/insights/flxwd/78931-big_data_growth_just_beginning_to_explode)   3
                                            HotI 2014
Big Data Technology - Hadoop
•     Apache Hadoop is one of the most popular Big Data technology
       – Provides frameworks for large-scale, distributed data storage and
         processing
       – MapReduce, HDFS, YARN, RPC, etc.

             Hadoop 1.x                                              Hadoop 2.x
                                                           MapReduce            Other Models
                                                        (Data Processing)     (Data Processing)

                   MapReduce                                             YARN
      (Cluster Resource Management & Data                 (Cluster Resource Management & Job
                   Processing)                                        Scheduling)

    Hadoop Distributed File System (HDFS)               Hadoop Distributed File System (HDFS)

        Hadoop Common/Core (RPC, ..)                         Hadoop Common/Core (RPC, ..)

                                                                                                  4
                                            HotI 2014
Big Data Technology - Spark	
  
• An	
  emerging	
  in-­‐memory	
  
                                                                                               Worker

  processing	
  framework	
                               Zookeeper

    –   Itera7ve	
  machine	
  learning	
  	
                                                            Worker

    –   Interac7ve	
  data	
  analy7cs	
  
    –   Scala	
  based	
  Implementa7on	
                                                                Worker
                                                                                                                       HDFS
    –   Master-­‐Slave;	
  HDFS,	
  Zookeeper	
             Driver

• Scalable	
  and	
  communica7on	
  
                                                          SparkContext

  intensive	
  
                                                                                      Worker
                                                                                                Worker
                                                           Master

     – Wide	
  dependencies	
  between	
  
       Resilient	
  Distributed	
  Datasets	
  
       (RDDs)	
  
     – MapReduce-­‐like	
  shuffle	
  
       opera7ons	
  to	
  repar77on	
  RDDs	
  	
  
     – Same as Hadoop, Sockets-
       based communication	
  
                                                                Narrow Dependencies                      Wide dependencies    5
                                              HotI 2014
Common	
  Protocols	
  using	
  Open	
  Fabrics	
  
                                                                                  Applica-on	
  

 Applica7on	
  Interface	
  

                                             Sockets	
                                                                     Verbs	
  
 Protocol	
  

   Kernel Space
                                                      TCP/IP	
        RSockets	
             SDP          iWARP	
         RDMA	
          RDMA	
  
                    TCP/IP	
  

                                                     Hardware	
  
                                                      Offload	
            User                               User           User            User
     Ethernet	
                                                                             RDMA	
  
                                 IPoIB	
                                 space                              space          space           space
      Driver	
  
Adapter	
  
     Ethernet	
            InfiniBand	
              Ethernet	
        InfiniBand	
         InfiniBand	
     iWARP	
         RoCE	
        InfiniBand	
  
     Adapter	
              Adapter	
               Adapter	
          Adapter	
           Adapter	
      Adapter	
      Adapter	
       Adapter	
  
Switch	
  
     Ethernet	
            InfiniBand	
              Ethernet	
        InfiniBand	
         InfiniBand	
     Ethernet	
     Ethernet	
     InfiniBand	
  
      Switch	
               Switch	
                Switch	
           Switch	
            Switch	
       Switch	
       Switch	
        Switch	
  
      1/10/40	
                  IPoIB	
           10/40	
  GigE-­‐    RSockets	
             SDP	
        iWARP	
         RoCE	
        IB	
  Verbs	
  
        GigE	
                                        TOE	
  

                                                                                                                                                           6
                                                                              HotI 2014
Previous Studies

• Very	
  good	
  performance	
  improvements	
  for	
  Hadoop	
  (HDFS
  /MapReduce/RPC),	
  HBase,	
  Memcached	
  over	
  InfiniBand
    – Hadoop Acceleration with RDMA
        • N. S. Islam, et.al., SOR-HDFS: A SEDA-based Approach to Maximize
          Overlapping in RDMA-Enhanced HDFS, HPDC’14
        • N. S. Islam, et.al., High Performance RDMA-Based Design of HDFS over
          InfiniBand, SC’12
        • M. W. Rahman, et.al. HOMR: A Hybrid Approach to Exploit Maximum
          Overlapping in MapReduce over High Performance Interconnects, ICS’14
        • M. W. Rahman, et.al., High-Performance RDMA-based Design of Hadoop
          MapReduce over InfiniBand, HPDIC’13
        • X. Lu, et.al., High-Performance Design of Hadoop RPC with RDMA over
          InfiniBand, ICPP’13
    – HBase Acceleration with RDMA
        • J. Huang, et.al., High-Performance Design of HBase with RDMA over InfiniBand,
           IPDPS’12
    – Memcached Acceleration with RDMA
        • J. Jose, et.al., Memcached Design on High Performance RDMA Capable
           Interconnects, ICPP’11
                                                                                          7
                                       HotI 2014
The High-Performance Big Data (HiBD) Project

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• RDMA for Memcached (RDMA-Memcached)
• OSU HiBD-Benchmarks (OHB)
•    http://hibd.cse.ohio-state.edu

                                                   8
                                 HotI 2014
RDMA for Apache Hadoop
•   High-Performance Design of Hadoop over RDMA-enabled Interconnects
     – High performance design with native InfiniBand and RoCE support at the
        verbs-level for HDFS, MapReduce, and RPC components
     – Easily configurable for native InfiniBand, RoCE and the traditional sockets
       -based support (Ethernet and InfiniBand with IPoIB)
•   Current release: 0.9.9 (03/31/14)
     – Based on Apache Hadoop 1.2.1
     – Compliant with Apache Hadoop 1.2.1 APIs and applications
     – Tested with
         • Mellanox InfiniBand adapters (DDR, QDR and FDR)
         • RoCE support with Mellanox adapters
         • Various multi-core platforms
         • Different file systems with disks and SSDs   http://hibd.cse.ohio-state.edu

• RDMA for Apache Hadoop 2.x 0.9.1 is released in HiBD!
                                                                                         9
                                          HotI 2014
Performance Benefits – RandomWriter & Sort in SDSC-Gordon

                                          IPoIB (QDR)           OSU-IB (QDR)                                                                   IPoIB (QDR)              OSU-IB (QDR)
                           160                                                                                               400
                                                         Reduced by 20%                                                                                          Reduced by 36%

                                                                                                  Job Execution Time (sec)
Job Execution Time (sec)

                           140                                                                                               350
                           120                                                                                               300
                           100                                                                                               250
                            80                                                                                               200
                            60                                                                                               150

                                Can RDMA benefit Apache Spark
                            40                                                                                               100
                            20                                                                                                50
                             0                                                                                                   0
                              on High-Performance Networks also?
                                     50 GB
                                   Cluster: 16
                                                    100 GB
                                                  Cluster: 32
                                                                   200 GB
                                                                 Cluster: 48
                                                                                   300 GB
                                                                                 Cluster: 64
                                                                                                                                          50 GB
                                                                                                                                       Cluster: 16
                                                                                                                                                           100 GB
                                                                                                                                                        Cluster: 32
                                                                                                                                                                            200 GB
                                                                                                                                                                          Cluster: 48
                                                                                                                                                                                             300 GB
                                                                                                                                                                                           Cluster: 64

•                          RandomWriter	
  in	
  a	
  cluster	
  of	
  single	
  SSD	
  per	
                                •       Sort	
  in	
  a	
  cluster	
  of	
  single	
  SSD	
  per	
  node	
  
                           node	
                                                                                                      – 20%	
  improvement	
  over	
  IPoIB	
  for	
  
                            – 16%	
  improvement	
  over	
  IPoIB	
  for	
  50GB	
  in	
                                                 50GB	
  in	
  a	
  cluster	
  of	
  16	
  nodes	
  
                              a	
  cluster	
  of	
  16	
  nodes	
                                                                      – 36%	
  improvement	
  over	
  IPoIB	
  for	
  
                            – 20%	
  improvement	
  over	
  IPoIB	
  for	
  300GB	
  in	
                                                300GB	
  in	
  a	
  cluster	
  of	
  64	
  nodes	
  
                              a	
  cluster	
  of	
  64	
  nodes	
  
                                                                                                                                                                                                            10
                                                                                          HotI 2014
Outline	
  	
  

• Introduc7on	
  
• Problem	
  Statement	
  
• Proposed	
  Design	
  
• Performance	
  Evalua7on	
  
• Conclusion	
  &	
  Future	
  work

                                             11
                               HotI 2014
Problem Statement
• Is it worth it?
   – Is the performance improvement potential high enough, if we can
      successfully adapt RDMA to Spark?
   – A few percentage points or orders of magnitude?
• How difficult is it to adapt RDMA to Spark?
   – Can RDMA be adapted to suit the communication needs of
      Spark?
   – Is it viable to have to rewrite portions of the Spark code with
      RDMA?
• Can Spark applications benefit from an RDMA-enhanced
  design?
   – What are the performance benefits that can be achieved by using
     RDMA for Spark applications on modern HPC clusters?
   – Can RDMA-based design benefit applications transparently?
                                                                       12
                                HotI 2014
Assessment of Performance Improvement Potential
• How much benefit RDMA can bring in Apache Spark
  compared to other interconnects/protocols?

• Assessment Methodology
   – Evaluation on primitive-level micro-benchmarks
      • Latency, Bandwidth
      • 1GigE, 10GigE, IPoIB (32Gbps), RDMA (32Gbps)
      • Java/Scala-based environment
   – Evaluation on typical workloads
      • GroupByTest
      • 1GigE, 10GigE, IPoIB (32Gbps)
      • Spark 0.9.1

                                                       13
                               HotI 2014

                                     Evaluation on Primitive-level Micro-benchmarks
                                    120                                                                              40
                                                                                1GigE                                                                     1GigE
                                                                               10GigE                                                                    10GigE
                                                                                                                     35                          IPoIB (32Gbps)
                                    100                                IPoIB (32Gbps)
                                                                      RDMA (32Gbps)                                                             RDMA (32Gbps)
                                                                                                                     30
                                     80
                                                                                                                     25

                                                                                                        Time (ms)
                    Time (us)

                                     60                                                                              20

                                                                                                                     15
                                     40
                                                                                                                     10

                                Can these benefits of High-Performance
                                     20

                                      0
                                                                                                                      5

                                                                                                                      0

                                       Networks be achieved in
                                          1     2     4   8    16 32 64 128 256 512 1K       2K                        4K   8K   16K 32K 64K 128K 256K 512K 1M    2M   4M
                                                              Message Size (bytes)                                                     Message Size (bytes)

                                                               Small Message                                                            Large Message
                   3,000
                                           Apache   Spark?
                                                  • Obviously, IPoIB and 10GigE are
                   2,500
                                                                                                                     much better than 1GigE.
Bandwidth (MBps)

                   2,000
                                                                                                    •               Compared to other interconnects
                   1,500
                                                                                                                    /protocols, RDMA can significantly
                   1,000
                                                                                                                     – reduce the latencies for all the
                    500                                                                                                 message sizes
                                0
                                              1GigE       10GigE       IPoIB       RDMA                              – improve the peak bandwidth
                                                                     (32Gbps)     (32Gbps)                                                                                  14
                                                                                                  HotI 2014
Evaluation on Typical Workloads

                     30

                     25                                        1GigE                        65%            •    For High-Performance Networks,
                                                               10GigE
                                                                                                                                   – The execution time of GroupBy is
GroupBy Time (sec)

                                                               IPoIB (32Gbps)
                     20
                                                                                                                                     significantly improved
                     15                                                                                                            – The network throughputs are much
                     10                                                                           22%                                higher for both send and receive
                        5                                                                                  •    Spark can benefit from high
                        0                           Can RDMA further benefit Spark
                                                       4          6            8             10
                                                                                                                -performance interconnects to a
                                                                                                                 greater extent!
                                                                  Data Size (GB)

                                                 performance compared with IPoIB and
                                                 600
                                                                                         1GigE
                                                                                        10GigE
                                                                                                                                            600
                                                                                                                                                                             1GigE
                                                                                                                                                                            10GigE

                                                              10GigE?
                                                 500                            IPoIB (32Gbps)                                              500                     IPoIB (32Gbps)

                                                                                                                Network Throughput (MB/s)
                     Network Throughput (MB/s)

                                                 400                                                                                        400

                                                 300                                                                                        300

                                                 200                                                                                        200

                                                 100                                                                                        100

                                                   0                                                                                         0
                                                           5           10          15            20                                               5          10          15          20
                                                                     Sample Times (s)                                                                      Sample Times (s)

                                                               Network Throughput in Recv                                                             Network Throughput in Send
                                                                                                                                                                                          15
                                                                                                        HotI 2014
Outline	
  	
  

• Introduc7on	
  
• Problem	
  Statement	
  
• Proposed	
  Design	
  
• Performance	
  Evalua7on	
  
• Conclusion	
  &	
  Future	
  work

                                             16
                               HotI 2014
Architecture Overview
                                                              Spark Applications
• Design Goals                                                   (Scala/Java/Python)

  – High Performance               Task                   Task                           Task                 Task

                                                                        Spark
  – Keeping the existing                                              (Scala/Java)

    Spark architecture          Java NIO
                                        BlockManager
                                               Netty       RDMA
                                                                                         BlockFetcherIterator
                                                                                     Java NIO      Netty         RDMA
    and interface intact         Shuffle
                                 Server
                                              Shuffle
                                               Server
                                                           Shuffle
                                                           Server
                                                                                      Shuffle
                                                                                      Fetcher
                                                                                                  Shuffle
                                                                                                  Fetcher
                                                                                                                 Shuffle
                                                                                                                Fetcher
                                (default)    (optional)   (plug-in)                  (default)   (optional)     (plug-in)

  – Minimal code
    changes                                 Java Socket
                                                                                      RDMA-based Shuffle Engine
                                                                                             (Java/JNI)

                               1/10 Gig Ethernet/IPoIB (QDR/FDR)                            Native InfiniBand
• Approaches                                Network                                           (QDR/FDR)

   – Plug-in based approach to extend shuffle framework in Spark
      • RDMA Shuffle Server + RDMA Shuffle Fetcher
      • 100 lines of code changes inside Spark original files
   – RDMA-based Shuffle Engine
                                                                                                                            17
                                  HotI 2014
SEDA-based Data Shuffle Plug-ins
•   SEDA - Staged Event-Driven             Connection
                                             Pool
                                                                  Request             Response
                                                                  Queue                Queue
    Architecture
•   A set of stages connected by                        Request             Request
    queues                                              Enqeue              Dequeue
                                                                                      Response      Response
•   A dedicated thread pool will be in                                                Enqueue       Dequeue

    charge of processing events on          Listener      Receiver          Handler              Responder

    the corresponding queue
     – Listener, Receiver, Handlers,
       Responders
                                            Accept           Receive
•   Performing admission controls          Connection        Request            Send
                                                                                                  Read Data
                                                                              Response
    on these event queues
•   High throughput through                    RDMA-based Shuffle Engine                           Block
    maximally overlapping different                                                                Data

    processing stages as well as
    maintain default task-level
    parallelism in Spark
                                                                                                             18
                                       HotI 2014
RDMA-based Shuffle Engine

• Connection Management
  – Alternative designs
     • Pre-connection
         – Hide the overhead in the initialization stage
         – Before the actual communication, pre-connect processes to each other
         – Sacrifice more resources to keep all of these connections alive
     • Dynamic Connection Establishment and Sharing
         – A connection will be established if and only if an actual data exchange
           is going to take place
         – A naive dynamic connection design is not optimal, because we need to
           allocate resources for every data block transfer
         – Advanced dynamic connection scheme that reduces the number of
           connection establishments
         – Spark uses multi-threading approach to support multi-task execution in
           a single JVM à Good chance to share connections!
         – How long should the connection be kept alive for possible re-use?
         – Time out mechanism for connection destroy

                                                                                     19
                                   HotI 2014
RDMA-based Shuffle Engine

• Data Transfer
  – Each connection is used by multiple tasks (threads) to
    transfer data concurrently
  – Packets over the same communication lane will go to
    different entities in both server and fetcher sides
  – Alternative designs
     • Perform sequential transfers of complete blocks over a
       communication lane à Keep the order
         – Cause long wait times for some tasks that are ready to transfer data
           over the same connection
     • Non-blocking and Out-of-order Data Transfer
         –   Chunking data blocks
         –   Non-blocking sending over shared connections
         –   Out-of-order packet communication
         –   Guarantee both performance and ordering
         –   Efficiently work with the dynamic connection management and sharing
             mechanism
                                                                                   20
                                    HotI 2014
RDMA-based Shuffle Engine

• Buffer Management
  – On-JVM-Heap vs. Off-JVM-Heap Buffer Management
  – Off-JVM-Heap
     • High-Performance through native IO
     • Shadow buffers in Java/Scala
     • Registered for RDMA communication
  – Flexibility for upper layer design choices
     • Support connection sharing mechanism à Request ID
     • Support packet processing in order à Sequence number
     • Support non-blocking send à Buffer flag + callback

                                                               21
                            HotI 2014
Outline	
  	
  

• Introduc7on	
  
• Problem	
  Statement	
  
• Proposed	
  Design	
  
• Performance	
  Evalua7on	
  
• Conclusion	
  &	
  Future	
  work

                                             22
                               HotI 2014
Experimental Setup
• Hardware	
  
   – Intel	
  Westmere	
  Cluster	
  (A)	
  
       • Up	
  to	
  9	
  nodes	
  
       • Each	
  node	
  has	
  8	
  processor	
  cores	
  on	
  2	
  Intel	
  Xeon	
  2.67	
  
         GHz	
  quad-­‐core	
  CPUs,	
  24	
  GB	
  main	
  memory	
  
       • Mellanox	
  QDR	
  HCAs	
  (32Gbps)	
  +	
  10GigE	
  
   – TACC	
  Stampede	
  Cluster	
  (B)	
  	
  
       • Up	
  to	
  17	
  nodes	
  
       • Intel	
  Sandy	
  Bridge	
  (E5-­‐2680)	
  dual	
  octa-­‐core	
  processors,	
  
         running	
  at	
  2.70GHz,	
  32	
  GB	
  main	
  memory	
  
       • Mellanox	
  FDR	
  HCAs	
  (56Gbps)	
  
• Soiware	
  
   – Spark	
  0.9.1,	
  Scala	
  2.10.4	
  	
  and	
  JDK	
  1.7.0	
  
   – GroupBy	
  Test	
                                                                            23
                                            HotI 2014
Performance Evaluation on Cluster A
                     9                                                                                                10
                     8                  10GigE
                                        IPoIB (32Gbps)                                                                                 10GigE
                                        RDMA (32Gbps)                                                                  8               IPoIB (32Gbps)
                     7                                                               18%
GroupBy Time (sec)

                                                                                                 GroupBy Time (sec)
                                                                                                                                       RDMA (32Gbps)                               20%
                     6
                                                                                                                       6
                     5
                     4
                                                                                                                       4
                     3
                     2                                                                                                 2
                     1
                     0                                                                                                 0
                            4                6            8                     10                                         8               12          16                     20
                                             Data Size (GB)                                                                                Data Size (GB)
                            4	
  Worker	
  Nodes,	
  32	
  Cores,	
  (32M	
  32R)	
                                        8	
  Worker	
  Nodes,	
  64	
  Cores,	
  (64M	
  64R)	
  

                         • For 32 cores, up to 18% over IPoIB (32Gbps) and
                           up to 36% over 10GigE
                         • For 64 cores, up to 20% over IPoIB (32Gbps) and
                           34% over 10GigE
                                                                                                                                                                                         24
                                                                                           HotI 2014
Performance Evaluation on Cluster B
                                                                                                                     35
                     50                     IPoIB (56Gbps)                          83%                                                  IPoIB (56Gbps)                         79%
                                            RDMA (56Gbps)                                                            30                  RDMA (56Gbps)
GroupBy Time (sec)

                                                                                                GroupBy Time (sec)
                     40                                                                                              25

                     30                                                                                              20
                                                                                                                     15
                     20
                                                                                                                     10
                     10
                                                                                                                      5
                      0                                                                                               0
                                8               12          16                    20                                        16               24          32                   40
                                                Data Size (GB)                                                                               Data Size (GB)
                           8	
  Worker	
  Nodes,	
  128	
  Cores,	
  (128M	
  128R)	
                                     16	
  Worker	
  Nodes,	
  256	
  Cores,	
  (256M	
  256R)	
  

                      • For 128 cores, up to 83% over IPoIB (56Gbps)
                      • For 256 cores, up to 79% over IPoIB (56Gbps)

                                                                                                                                                                                          25
                                                                                          HotI 2014
Performance Analysis on Cluster B
• Java-based micro-benchmark comparison
                                                       IPoIB (56Gbps)          RDMA (56Gbps)
                           Peak Bandwidth              1741.46MBps 	
          5612.55MBps

• Benefit of RDMA Connection Sharing Design
                      20
                                  RDMA−w/o−Conn−Sharing (56Gbps)
                                  RDMA−w−Conn−Sharing (56Gbps)
 GroupBy Time (sec)

                      15                                              55%
                                                                               By enabling connection
                      10                                                       sharing, achieve 55%
                                                                               performance benefit for
                       5                                                       40GB data size on 16
                                                                               nodes
                       0
                             16            24          32           40
                                           Data Size (GB)

                                                                                                         26
                                                                   HotI 2014
Outline	
  	
  

• Introduc7on	
  
• Problem	
  Statement	
  
• Proposed	
  Design	
  
• Performance	
  Evalua7on	
  
• Conclusion	
  &	
  Future	
  work

                                             27
                               HotI 2014
Conclusion and Future Work
• Three major conclusions
  – RDMA and high-performance interconnects can
     benefit Spark.
  – Plug-in based approach with SEDA-/RDMA-based
     designs provides both performance and
     productivity.
  – Spark applications can benefit from an RDMA
    -enhanced design.
• Future Work
  – Continuously update this package with newer designs and carry
    out evaluations with more Spark applications on systems with
    high-performance networks
  – Will make this design publicly available through the HiBD project
                                                                        28
                              HotI 2014
 	
  	
  Thank	
  You!	
  
   {luxi,	
  rahmanmd,	
  islamn,	
  shankard,	
  panda}	
  
                 @cse.ohio-­‐state.edu	
  
                              	
  
 Network-­‐Based	
  Compu7ng	
  Laboratory	
  
   hrp://nowlab.cse.ohio-­‐state.edu/
The	
  High-­‐Performance	
  Big	
  Data	
  Project	
  
    hrp://hibd.cse.ohio-­‐state.edu/

                          HotI 2014
You can also read