ELF: Efficient Lightweight Fast Stream Processing at Scale - Liting Hu, Karsten Schwan, Hrishikesh Amur, and Xin Chen, Georgia Institute of Technology

Page created by Carolyn Foster
 
CONTINUE READING
ELF: Efficient Lightweight Fast Stream Processing at Scale - Liting Hu, Karsten Schwan, Hrishikesh Amur, and Xin Chen, Georgia Institute of Technology
ELF: Efficient Lightweight Fast Stream
                     Processing at Scale
Liting Hu, Karsten Schwan, Hrishikesh Amur, and Xin Chen, Georgia Institute of Technology
            https://www.usenix.org/conference/atc14/technical-sessions/presentation/hu

      This paper is included in the Proceedings of USENIX ATC ’14:
              2014 USENIX Annual Technical Conference.
                            June 19–20, 2014 • Philadelphia, PA
                                        978-1-931971-10-2

                                                  Open access to the Proceedings of
                                             USENIX ATC ’14: 2014 USENIX Annual Technical
                                                 Conference is sponsored by USENIX.
ELF: Efficient Lightweight Fast Stream Processing at Scale - Liting Hu, Karsten Schwan, Hrishikesh Amur, and Xin Chen, Georgia Institute of Technology
E LF: Efficient Lightweight Fast Stream Processing at Scale
                          Liting Hu, Karsten Schwan, Hrishikesh Amur, Xin Chen
                                      Georgia Institute of Technology
                           {foxting,amur,xchen384}@gatech.edu, schwan@cc.gatech.edu

Abstract                                                                   App 1 #         Top k
                                                                                     0 Final Fatency X
                                                                                                           App 2 Product Bundles
                                                                                                                   Angry Birds Angry Birds Star Wars,
                                                                                     1
                                                                                       Angry Birds                 Seasons     Angry Birds Trilogy App   3
                                                                            Out        Seasons               Out
Stream processing has become a key means for gaining                                 2
                                                                                       Angry Birds
                                                                                       Star Wars
                                                                                                                                              Yes/No

rapid insights from webserver-captured data. Challenges                              3 Final Fatency V

                                                                            Sort
include how to scale to numerous, concurrently running                     #clicks
streaming jobs, to coordinate across those jobs to share
insights, to make online changes to job functions to adapt
                                                                           Filter                           Filter                           Query?
to new requirements or data characteristics, and for each                  @item                    Stream @buys
                                                                                                                                             "Angry Birds"

job, to efficiently operate over different time windows.                                            per minute {"created_at":"Fri Feb 08
                                                                                                                     01:10:17 +0000 2013",
   The E LF stream processing system addresses these new                     Batch per 5 minute
                                                                                                              In "product":"Angry Birds"...}
                                                                                                                     ...
challenges. Implemented over a set of agents enriching
                                                                      Figure 1: Examples of diverse concurrent applications.
the web tier of datacenter systems, E LF obtains scalabil-
ity by using a decentralized “many masters” architecture            the top-k products that have the most clicks. It can then
where for each job, live data is extracted directly from            dispatch coupons to those “popular” products so as to in-
webservers, and placed into memory-efficient compressed             crease sales. A suitable model for this job is synchronous
buffer trees (CBTs) for local parsing and temporary stor-           batch processing, as all log data for the past 300 s has
age, followed by subsequent aggregation using shared                to be on hand before grouping clicks per product and
reducer trees (SRTs) mapped to sets of worker processes.            calculating the top-k products being viewed.
Job masters at the roots of SRTs can dynamically cus-
                                                                       For the same set of inputs, a concurrent job performs
tomize worker actions, obtain aggregated results for end
                                                                    product-bundling, by extracting user likes and buys
user delivery and/or coordinate with other jobs.
                                                                    from logs, and then creating ‘edges’ and ‘vertices’ linking
   An E LF prototype implemented and evaluated for a
                                                                    those video games that are typically bought together. One
larger scale configuration demonstrates scalability, high
                                                                    purpose is to provide online recommendations for other
per-node throughput, sub-second job latency, and sub-
                                                                    users. For this job, since user activity logs are generated in
second ability to adjust the actions of jobs being run.
                                                                    realtime, we will want to update these connected compo-
                                                                    nents whenever possible, by fitting them into a graph and
1    Introduction                                                   iteratively updating the graph over some sliding time win-
Stream processing of live data is widely used for applica-          dow. For this usage, an asynchronous stream processing
tions that include generating business-critical decisions           model is preferred to provide low latency updates.
from marketing streams, identifying spam campaigns for                 The third sale-prediction job states a product
social networks, performing datacenter intrusion detec-             name, e.g., Angry Birds, which is joined with the product-
tion, etc. Such diversity engenders differences in how              bundling application to find out what products are sim-
streaming jobs must be run, requiring synchronous batch             ilar to Angry Birds (indicated by the ‘typically bought
processing, asynchronous stream processing, or combin-              together’ set). The result is then joined with the micro-
ing both. Further, jobs may need to dynamically adjust              promotion application to determine whether Angry Birds
their behavior to new data content and/or new user needs,           and its peers are currently “popular”. This final result can
and coordinate with other concurrently running jobs to              be used to predict the likely market success of launching
share insights. Figure 1 exemplifies these requirements,            a new product like Angry Birds, and obtaining it requires
where job inputs are user activity logs, e.g., clicks, likes,       interacting with the first and second application.
and buys, continuously generated from say, the Video                   Finally, all of these applications will run for some con-
Games directory in an e-commerce company.                           siderable amount of time, possibly for days. This makes
   In this figure, the micro-promotion application ex-              it natural for the application creator to wish to update
tracts user clicks per product for the past 300 s, and lists        job functions or parameters during ongoing runs, e.g., to

                                                                1
USENIX Association                                                                   2014 USENIX Annual Technical Conference 25
change the batching intervals to adjust to high vs. low traf-                                                       Master 1
fic periods, to flush sliding windows to ‘reset’ job results                                                        for App 1
                                                                                  Master 2
                                                                                  for App 2
after some abnormal period (e.g., a flash mob), etc.
   Distributed streaming systems are challenged by the
requirements articulated above. First, concerning flex-                                             ELF's DHT
ibility, existing systems typically employ some fixed                                           Master 3
                                                                                                for App 3                          Elf node
execution model, e.g., Spark Streaming [28] and oth-                                                                               Dataflow of Elf
ers [11, 12, 17, 22] treat streaming computations as a                                                              Master 4
                                                                                                                                   Dataflow of typical
                                                                                                                                   log processing
series of batch computations, whereas Storm [4] and                                                                 for App 4      systems

others [7, 21, 24] structure streaming computations as                                        Flume/
                                                                                              Chukwa/       HBase       HDFS       Spark/Storm
a dataflow graph where vertices asynchronously process                                        Kafka
                                                                                                                                   MapReduce
                                                                     Webservers
incoming records. Further, these systems are not designed
to be naturally composable, so as to simultaneously pro-
vide both of their execution models, and they do not offer          Figure 2: Dataflow of E LF vs. a typical realtime web log
functionality to coordinate their execution. As a result,           analysis system, composed of Flume, HBase, HDFS, Hadoop
applications desiring the combined use of their execution           MapReduce and Spark/Storm.
models must use multiple platforms governed via external
                                                                    output the final result. E LF’s operation, therefore, entirely
controls, at the expense of simplicity.
                                                                    bypasses the storage-centric data path, to rapidly process
   A second challenge is scaling with job diversity. Many           live data. Intuitively, with a DHT, the masters of differ-
existing systems inherit MapReduce’s “single master/-               ent applications will be mapped to different nodes, thus
many workers” infrastructure, where the centralized mas-            offering scalability by avoiding the potential bottleneck
ter is responsible for all scheduling activities. How to            created by many masters running on the same node.
scale to hundreds of parallel jobs, particularly for jobs              E LF is evaluated experimentally over 1000 logical
with diverse execution logics (e.g., pipeline or cyclic             webservers running on 50 server-class machines, using
dataflow), different batch sizes, differently sized sliding         both batched and continuously streaming workloads. For
windows, etc? A single master governing different per-job           batched workload, E LF can process millions of records
scheduling methods and carrying out cross-job coordina-             per second, outperforming general batch processing sys-
tion will be complex, creating a potential bottleneck.              tems. For a realistic social networking application, E LF
   A third challenge is obtaining high performance for              can respond to queries with latencies of tens of millisec-
incremental updates. This is difficult for most current             ond, equaling the performance of state-of-the-art, asyn-
streaming systems, as they use an in-memory hashtable-              chronous streaming systems. New functionality offered
like data structure to store and aggressively integrate past        by E LF is its ability to dynamically change job functions
states with new state, incurring substantial memory con-            at sub-second latency, while running hundreds of jobs
sumption and limited throughput when operating across               subject to runtime coordination.
large-sized history windows.                                           This paper makes the following technical contributions:
   The E LF (Efficient, Lightweight, Flexible) stream pro-            1. A decentralized ‘many masters’ architecture assign-
cessing system presented in this paper implements novel                   ing each application its own master capable of indi-
functionality to meet the challenges listed above within                  vidually controlling its workers. To the best of our
a single framework: to efficiently run hundreds of con-                   knowledge, E LF is the first to use a decentralized
current and potentially interacting applications, with di-                architecture for scalable stream processing (Sec. 2).
verse per-application execution models, at levels of per-             2. A memory-efficient approach for incremental up-
formance equal to those of less flexible systems.                         dates, using a B-tree-like in-memory data structure,
   As shown in Figure 2, each E LF node resides in each                   to store and manage large stored states (Sec. 2.2).
webserver. Logically, they are structured as a million-               3. Abstractions permitting cyclic dataflows via feed-
node overlay built with the Pastry DHT [25], where each                   back loops, with additional uses of these abstrac-
E LF application has its own respective set of master and                 tions including the ability to rapidly and dynamically
worker processes mapped to E LF nodes, self-constructed                   change job behavior (Sec. 2.3).
as a shared reducer tree (SRT) for data aggregation. The              4. Support for cross-job coordination, enabling interac-
system operates as follows: (1) each line of logs received                tive processing that can utilize and combine multiple
from webserver is parsed into a key-value pair and contin-                jobs’ intermediate results (Sec. 2.3).
uously inserted into E LF’s local in-memory compressed                5. An open-source implementation of E LF and a com-
buffer tree (CBT [9]) for pre-reducing; (2) the distributed               prehensive evaluation of its performance and func-
key-value pairs from CBTs “roll up” along the SRT, which                  tionality on a large cluster using real-world web-
progressively reduces them until they reach the root to                   server logs (Sec. 3, Sec. 4).

                                                                2
26 2014 USENIX Annual Technical Conference                                                                                      USENIX Association
Level 3                                            ELF API                                                  without the need for coordination via some global entity.
                                                                                                             The resulting model supports the concurrent execution
                                            Master 1                                     Master 2
                                                                                                             of diverse applications and flexible coordination between
                                                                                                             those applications.
                                  Aggregate                                         Multicast
                                                                                                                The third component is the high-level E LF program-
                                                                                                             ming API (Figure 3 top) exposed to programmers for
 Level 2

                             Master 1                                     Master 2
                                    Group 1                                        Group 2
                                                                                                             implementing a variety of diverse, complex jobs, e.g.,
                                    for App 1                                      for App 2                 streaming analysis, batch analysis, and interactive queries.
                                                                                                             We next describe these components in more detail.
                                                                                                             2.2     CBT-based Agent
                                                    Cloud                                                    Existing streaming systems like Spark [28], Storm [4] typ-
                         Agent 1           ......        Agent i          ......        Agent n              ically consume data from distributed storage like HDFS
 Level 1

                       cbt_i cbt_ii ...               cbt_i cbt_ii ...                cbt_i cbt_ii ...
               Webserver 1                Webserver 1                    Webserver 1
                                                                                                             or HBase, incurring cross-machine data movement. This
                                                                                                             means that data might be somewhat stale when it arrives
           Figure 3: High-level overview of the E LF system.                                                 at the streaming system. Further, for most realworld jobs,
                                                                                                             their ‘map’ tasks could be ‘pre-reduced’ locally on web-
2          Design                                                                                            servers with the most parallelism, and only the interme-
                                                                                                             diate results need to be transmitted over the network for
This section describes E LF’s basic workflow, introduces                                                     data shuffling, thus decreasing the process latency and
E LF’s components, shows how applications are imple-                                                         most of unnecessary bandwidth overhead.
mented with its APIs, and explains the performance, scal-                                                       E LF adopts an ‘in-situ’ approach to data access in
ability and flexibility benefits of using E LF.                                                              which incoming data is injected into the streaming system
                                                                                                             directly from its sources. E LF agents residing in each
2.1         Overview                                                                                         webserver consume live web logs to produce succinct
As shown in Figure 3, the E LF streaming system                                                              key-value pairs, where a typical log event is a 4-tuple of
runs across a set of agents structured as an over-                                                           timestamp, src ip, prority, body: the timestamp is used
lay built using the Pastry DHT. There are three ba-                                                          to divide input event streams into batches of different
sic components. First, on each webserver produc-                                                             epochs, and the body is the log entry body, formatted as
ing data required by the application, there is an agent                                                      a map from a string attribute name (key) to an arbitrary
(see Figure 3 bottom) locally parsing live data logs                                                         array of bytes (value).
into application-specific key-value pairs. For exam-                                                            Each agent exposes a simple HiveQL-like [26] query
ple, for the micro-promotion application, batches of                                                         interface with which an application can define how to filter
logs are parsed as a map from product name (key) to                                                          and parse live web logs. Figure 4 shows how the micro-
the number of clicks (value), and groupby-aggregated                                                         promotion application uses E LF QL to define the top-k
like (a, 1), (b, 1), (a, 1), (b, 1), (b, 1) → (a, 2), (b, 3),                                            function, which calculates the top-10 popular products
labelled with an integer-valued timestamp for each batch.                                                    that have the most clicks at each epoch (30 s), in the Video
   The second component is the middle-level, application-                                                    Game directory of the e-commerce site.
specific group (Figure 3 middle), composed of a mas-                                                            Each E LF agent is designed to be capable of holding a
ter and a set of agents as workers that jointly imple-                                                       considerable number of ‘past’ key-value pairs, by storing
ment (1) the data plane: a scalable aggregation tree                                                         such data in compressed form, using a space-efficient,
that progressively ‘rolls up’ and reduces those local key-                                                   in-memory data structure, termed a compressed buffer
value pairs from distributed agents within the group,                                                        tree (CBT) [9]. Its in-memory design uses an (a, b)-tree
e.g., (a, 2), (b, 3).., (a, 5).., (b, 2), (c, 2).. from tree                                           with each internal node augmented by a memory buffer.
leaves are reduced as (a, 7), (b, 5), (c, 2).. to the root;                                                Inserts and deletes are not immediately performed, but
(2) the control plane: a scalable multicast tree used by the                                                 buffered in successive levels in the tree allowing better
master to control the application’s execution, e.g., when
                                                                                                              Example log event
necessary, the master can multicast to its workers within                                                     {"created_at":"23:48:22 +0000 2013",   ELF QL ->
the group, to notify them to empty their sliding windows                                                      "id":299665941824950273,               SELECT product,SUM(clicks_count)
                                                                                                              "product":"Angry Birds Season",        FROM *
and/or synchronously start a new batch. Further, different                                                    "clicks_count":2,                      WHERE store == `video_games'
                                                                                                              "buys_count":0,                        GROUP BY product
applications’ masters can exchange queries and results                                                        "user":{"id":343284040,                SORT BY SUM(clicks_count) DESC
using the DHT’s routing substrate, so that given any appli-                                                   "name":"@Curry",                       LIMIT 10
                                                                                                              "location":"Ohio", ...} ...}           WINDOWING 30 SECONDS;
cation’s name as a key, queries or results can be efficiently
routed to that application’s master (within O(logN) hops),                                                                 Figure 4: Example of E LF QL query.

                                                                                                         3
USENIX Association                                                                                                           2014 USENIX Annual Technical Conference 27
#clicks for the product Angry Birds, by inserting new
                               Agent
                                                                 Compressed            key-value pairs, and periodically flushing the CBT. The
                                                     In-memory     PAOs
        New k-v pairs         insert
                                                        CBT
                                                                                       application obtains the local agent’s results in intervals
        from [0,5),[5,10),                                       Uncompressed
        [10,15)...                                                   PAOs              [0,5), [0,10), [0,15), etc. as PAO5 , PAO10 , PAO15 , etc. If
                                               ...                 Structured          the application needs a windowing value in [5,15), rather
                 PAOs_5,                                           buffer (can
                 PAOs_10,     flush                                include both         than repeatedly adding the counts in [5,10) with multiple
                 PAOs_15,                                            above)            operations, it can simply perform one single operation
                 ...
                                                                                       PAO15  PAO5 , where  is an “invertible reduce”. In
                       (a) Compressed buffer tree (CBT)                                another example using synchronous batching, an appli-
        Keep inserting new key-value pairs to CBT
                                                                                       cation can start a new batch by erasing past records, e.g.,
         [0,5)
                            flush=PAOs_5                    Time Product #Clicks
                                                         5 Angry Birds      100
                                                                                       tracking the promotion effect when a new advertisement
                                     flush=PAOs_10
                                                                                       is launched. In this case, all agents’ CBTs coordinate to
 Time

                                                        10 Angry Birds      150
         [0,10)
                                           flush=PAOs_15 15 Angry Birds      250
         [0,15)                                         ...    ...           ...       perform a simultaneous “empty” operation via a multicast
           ...  ...
                       `query sliding window [5,10)' = PAOs_10    PAOs_5 = 50
                                                                                       protocol from the middle-level’s DHT, as described in
                       `query sliding window [5,15)' = PAOs_15    PAOs_5 = 150         more detail in Sec.2.3.
                   (b) Arbitrary sliding windows using CBT                             Why CBTs? Our concern is performance. Consider us-
Figure 5: Between intervals, new k-v pairs are inserted into                           ing an in-memory binary search tree to maintain key-value
the CBT; the root buffer is sorted and aggregated; the buffer is                       pairs as the application’s states, without buffering and
the split into fragments according to hash ranges of children,                         compression. In this case, inserting an element into the
and each fragment is compressed and copied into the respective                         tree requires traversing the tree and performing the in-
children node; at each interval, the CBT is flushed.                                   sert — a read and a write operation per update, leading
                                                                                       to poor performance. It is not necessary, however, to ag-
I/O performance.
                                                                                       gregate each new element in such an aggressive fashion:
   As shown in Figure 5(a), first, each newly parsed key-                              integration can occur lazily. Consider, for instance, an
value pair is represented as a partial aggregation ob-                                 application that determines the top-10 most popular items,
ject (PAO). Second, the PAO is serialized and the tuple                                updated every 30 s, by monitoring streams of data from
hash, size, serializedPAO is appended to the root node’s                             some e-commerce site. The incoming rate can be as high
buffer, where hash is a hash of the key, and size is the                               as millions per second, but CBTs need only be flushed
size of the serialized PAO. Unlike a binary search tree in                             every 30 s to obtain the up-to-date top-10 items. The key
which inserting a value requires traversing the tree and                               to efficiency lies in that “flush” is performed in relatively
then performing the insert to the right place, the CBT                                 large chunks while amortizing the cost across a series of
simply defers the insert. Then, when a buffer reaches                                  small “inserting new data” operations: decompression
some threshold (e.g., half the capacity), it is flushed into                           of the buffer is deferred until we have batched enough
the buffers of nodes in the next level. To flush the buffer,                           inserts in the buffer, thus enhancing the throughput.
the system sorts the tuples by hash value, decompresses
the buffers of the nodes in the next level, and partitions                             2.3    DHT-based SRT
the tuples into those receiving buffers based on the hash                              E LF’s agents are analogous to stateful vertices in dataflow
value. Such an insertion behaves like a B-tree: a full leaf                            systems, constructed into a directed graph in which data
is split into a new leaf and the new leaf is added to the                              passes along directed edges. Using the terms vertex and
parent of the leaf. More detail about the CBT and its                                  agent interchangeably, this subsection describes how we
properties appears in [9].                                                             leverage DHTs to construct these dataflow graphs, thus
   Key for E LF is that the CBT makes possible the rapid                               obtaining unique benefits in flexibility and scalability. To
incremental updates over considerable time windows, i.e.,                              restate our goal, we seek a design that meets the following
extensive sets of historical records. Toward that end, the                             criteria:
following APIs are exposed for controlling the CBT: (i)
“insert” to fill, (ii) “flush” to fetch the tree’s entire groupby-                       1. capability to host hundreds of concurrently running
aggregate results, and (iii) “empty” to empty the tree’s                                    applications’ dataflow graphs;
buffers, which is necessary when the application wants to                                2. with each dataflow graph including minion vertices,
start a new batch. By using a series of “insert”, “flush”,                                  as our vertices reside in distributed webservers; and
“empty” operations, E LF can implement many of standard                                  3. where each dataflow graph can flexibly interact with
operations in streaming systems, such as sliding windows,                                   others for runtime coordination of its execution.
incremental processing, and synchronous batching.                                        E LF leverages DHTs to create a novel ‘many master’
   For example, as shown in Figure 3(b), let the interval                              decentralized infrastructure. As shown in Figure 6, all
be 5 s, a sale-prediction application tracks the up-to-date                            agents are structured into a P2P overlay with DHT-based

                                                                                   4
28 2014 USENIX Annual Technical Conference                                                                                    USENIX Association
d70A6A
                                                                           Micro-promotion App                 Product-bundling App              Sale-predication App
                                                d502fb
                                                                            (a,7)                           (a,)
          fe4156                                 hash(`micro-promotion')    (d,7)   d462ba route query      (b,)                                     43bc3c
                                 JOIN                                                                                     2080c3
                                                                            (b,5)                           (c,)                   route query
                                                    d462ba                  (c,3)                           (d,)
     hash(`product-bundling')
                                                                                  multicast 'flush'                   (b,)
      2080c3                                        d4213f                                    (d,7)                      (c,)   (d,)
                                                                                                                                                          Agent
                                                                     (a,7)                               (a,)
                                                                                              (c,3)                                                       Master
                                                                     (b,3)
                                                                                              (b,2)
                                         JOIN                                                                                                             App1's path
                         JOIN                     d13da3                                                                                                  App2's path
                                                                                                                                                          App3's path
                                                                      (a,2) (a,5)     (b,2)    (c,1)       (a,) (a,) (c,) (d,)
                                      98fc35                          (b,3)           (c,2)    (d,7)       (b,)         (d,) (d,)
                      43bc3c
                           hash(`sale-predication')

                                           Figure 6: Shared Reducer Tree Construction for many jobs.
routings. Each agent has a unique, 128-bit nodeId in a                                    tree progressively ‘aggregates’ the data until the result ar-
circular nodeId space ranging from 0 to 2128 -1. The set                                  rives at the root. For a non-partitioning function, the agent
of nodeIds is uniformly distributed; this is achieved by                                  as a vertex simply processes the data stream locally us-
basing the nodeId on a secure hash (SHA-1) of the node’s                                  ing the CBT. For a partitioning function like TopKCount
IP address. Given a message and a key, it is guaranteed                                   in which the key-value pairs are shuffled and gradually
that the message is reliably routed to the node with the                                  truncated, we build a single aggregation tree, e.g., Fig-
nodeId numerically closest to that key, within log2b N                                  ure 6 middle shows how the groupby, aggregate, sort
hops, where b is a base with typical value 4. SRTs for                                    functions are applied for each level-i subtree’s root for
many applications (jobs) are constructed as follows.                                      the micro-promotion job. For partitioning functions like
   The first step is to construct application-based groups                                WordCount, we build m aggregation trees to divide the
of agents and ensure that these groups are well balanced                                  keys into m ranges, where each tree is responsible for the
over the network. For each job’s group, this is done as                                   reduce function of one range, thus avoiding the root over-
depicted in Figure 6 left: the agent parsing the applica-                                 load when aggregating a large key space. Figure 6 right
tion’s stream will route a JOIN message using appId as                                    shows how the ‘fat-tree’-like aggregation tree is built for
the key. The appId is the hash of the application’s tex-                                  the product-bundling job.
tual name concatenated with its creator’s name. The hash                                  Cycles. Naiad [20] uses timestamp vectors to realize
is computed using the same collision resistant SHA-1                                      dataflow cycles, whereas E LF employs multicast services
hash function, ensuring a uniform distribution of appIds.                                 operating on a job’s aggregation tree to create feedback
Since all agents belonging to the same application use the                                loops in which the results obtained for a job’s last epoch
same key, their JOIN message will eventually arrive at a                                  are re-injected into its sources. Each job’s master has com-
rendezvous node, with nodeId numerically close to ap-                                     plete control over the contents of feedback messages and
pId. The rendezvous node is set as the job’s master. The                                  how often they are sent. Feedback messages, therefore,
unions of all messages’ paths are registered to construct                                 can be used to go beyond supporting cyclic jobs to also
the group, in which the internal node, as the forwarder,                                  exert application-specific controls, e.g., set a new thresh-
maintains a children table for the group containing an                                    old, synchronize a new batch, install new job functionality
entry (IP address and appId) for each child. Note that the                                for agents to use, etc.
uniformly distributed appId ensures the even distribution
of groups across all agents.                                                              Why SRTs? The use of DHTs affords the efficient con-
                                                                                          struction of aggregation trees and multicast services, as
   The second step is to “draw” a directed graph within
                                                                                          their converging properties guarantee aggregation or mul-
each group to guide the dataflow computation. Like other
                                                                                          ticast to be fulfilled within only O(logN) hops. Further,
streaming systems, an application specifies its dataflow
                                                                                          a single overlay can support many different independent
graph as a logical graph of stages linked by connectors.
                                                                                          groups, so that the overheads of maintaining a proximity-
Each connector could simply transfer data to the next
                                                                                          aware overlay network can be amortized over all those
stage, e.g., filter function, or shuffle the data using a
                                                                                          group spanning trees. Finally, because all of these trees
portioning function between stages, e.g., reduce function.
                                                                                          share the same set of underlying agents, each agent can
In this fashion, one can construct the pipeline structures
                                                                                          be an input leaf, an internal node, the root, or any combi-
used in most stream processing systems, but by using
                                                                                          nation of the above, causing the computation load well
feedback, we can also create nested cycles in which a new
                                                                                          balanced. This is why we term these structures “shared
epoch’s input is based on the last epoch’s feedback result,
                                                                                          reducer trees” (SRTs).
explained in more detail next.
                                                                                             Implementing feedback loops using DHT-based multi-
Pipeline structures. We build aggregation trees using                                     cast benefits load and bandwidth usage: each message is
DHTs for pipeline dataflows, in which each level of the                                   replicated in the internal nodes of the tree, at each level,

                                                                                     5
USENIX Association                                                                                        2014 USENIX Annual Technical Conference 29
so that only m copies are sent to each internal node’s                                                                        Worker Process
m children, rather than having the tree root broadcast N                                     Job1's Master Process         Input receiver   streams
copies to N total nodes. Similarly, coordination across                                      Job builder                     Job parser
                                                                                                                           Paos execution   CBT
                                                                                                                                             CBT
jobs via the DHT’s routing methods is entirely decen-
                                                                                                                             SRT proxy
tralized, benefiting scalability and flexibility, the latter                                                                                Agent 2
                                                                                                               Fault
because concurrent E LF jobs use event-based methods to                                      Job tracker
                                                                                                             Straggler        Worker Process
                                                                                            Job feedback
remain responsive to other jobs and/or to user interaction.                                                                Input receiver   streams
                                                                                           Job coordinator
                                                                                                                             Job parser
2.4     E LF API                                                                                                Agent 1
                                                                                                                           Paos execution   CBT
                                                                                                                                             CBT
                                                                                             Job2's Master Process
  Subscribe(Id appid)                                                                                ......    Agent 95      SRT proxy
                                                                                                                                            Agent 3
  Vertex sends JOIN message to construct SRT with the root’s nodeid equals                   Job3's Master Process
  to appid.                                                                                          ......    Agent 32       Worker Process
  OnTimer()                                                                                                                        ...... Agent 252
  Callback. Invoked periodically. This handler has no return value. The                                                       Worker Process
  master uses it for its periodic activities.                                                                                      ...... Agent 1070
  SendTo(Id nodeid, PAO paos)                                                                                                      ......
  Vertex sends the key-value pairs to the parent vertex with nodeid, resulting                         Figure 8: Components of E LF.
  in a corresponding invocation of OnRecv.
  OnRecv(Id nodeid, PAO paos)
  Callback. Invoked when vertex receives serialized key-value pairs from the            A sample use shown in Figure 7 contains partial code
  child vertex with nodeid.                                                          for the micro-promotion application. It multicasts up-
  Multicast(Id appid, Message message)
  Application’s master publishes control messages to vertices, e.g., synchro-
                                                                                     date messages periodically to empty agents’ CBTs for
  nizing CBTs to be emptied; application’s master publishes last epoch’s             synchronous batch processing. It multicasts top-k results
  result, encapsulated into a message, to all vertices for iterative loops; or
  application’s master publishes new functions, encapsulated into a message,
                                                                                     periodically to agents. Upon receiving the results, each
  to all vertices for updating functions.                                            agent checks if it has the top-k product, and if true, the
  OnMulticast(Id appid, Message message)
  Callback. Invoked when vertex receives the multicast message from appli-
                                                                                     extra discount will appear on the web page. To implement
  cation’s master.                                                                   the product-bundling application, the agents subscribe to
                                                                                     multiple SRTs to separately aggregate key-value pairs,
                      Table 1: Data plane API
                                                                                     and agents’ associated CBTs are flushed only (without
  Route(Id appid, Message message)
  Vertex or master sends a message to another application. The appid is the
                                                                                     being synchronously emptied), to send a sliding window
  hash value of the target application’s name concatenated with its creator’s        value to the parent vertices for asynchronously processing.
  name.
  Deliver(Id appid, Message message)
                                                                                     To implement the sale-predication application, the master
  Callback. Invoked when the application’s master receives an outsider mes-          encapsulates its query and routes to the other two applica-
  sage from another application with appid. This outsider message is usually
  a query for the application’s status such as results.
                                                                                     tions to get their intermediate results using Route.

                    Table 2: Control plane API                                       3     Implementation
  Table 1 and Table 2 show the E LF’s data and control                               This section describes E LF’s architecture and prototype
plane APIs, respectively. The data plane APIs concern                                implementation, including its methods for dealing with
data processing within a single application. The control                             faults and with stragglers.
plane APIs are for coordination between different appli-
                                                                                     3.1     System Architecture
cations.
                                                                                     Figure 8 shows E LF’s architecture. We see that unlike
ArrayList topk;
void OnTimer () {                                                                    other streaming systems with static assignments of nodes
if (this.isRoot()) {                                                                 to act as masters vs. workers, all E LF agents are treated
   this.Multicast(hash("micro-promotion"), new topk(topk));
   this.Multicast(hash("micro-promotion"), new update()); }                          equally. They are structured into a P2P overlay, in which
   }                                                                                 each agent has a unique nodeId in the same flat circu-
void OnMulticast(Id appid, Message message) {
if (message instanceof topk) {                                                       lar 128-bit node identifier space. After an application
   for(String product: message.topk) {                                               is launched, agents that have target streams required by
      if(this.hasProduct(product))
      //if it is an topk message, appear discount ... }                              the application are automatically assembled into a shared
      }                                                                              reducer tree (SRT) via their routing substrates. It is only
//if it is an update message, start a new batch
else if (message instanceof update) {                                                at that point that E LF assigns one or more of the following
   //if leaves, flush CBT and update to the parent vertex                            roles to each participating agent:
   if (!this.containsChild(appid)) {
      PAO paos = cbt.get(appid).flush();                                             Job master is SRT’s root, which tracks its own job’s
      this.SendTo (this.getParent(appid), paos);
      cbt.get(appid).empty(); }                                                      execution and coordinates with other jobs’ masters. It has
      }                                                                              four components:
   }
                                                                                         • Job builder constructs the SRT to roll up and aggre-
Figure 7: E LF implementation of micro-promotion application                               gate the distributed PAOs snapshots processed by

                                                                                 6
30 2014 USENIX Annual Technical Conference                                                                                     USENIX Association
local CBTs.                                                       3.2       Consistency
  • Job tracker detects key-value errors, recovers from
                                                                      The consistency of states across nodes is an issue in
    faults, and mitigates stragglers.
                                                                      streaming systems that eagerly process incoming records.
  • Job feedback is continuously sent to agents for iter-
                                                                      For instance, in a system counting page views from male
    ative loops, including last epoch’s results to be iter-
                                                                      users on one node and females on another, if one of the
    ated over, new job functions for agents to be updated
                                                                      nodes is backlogged, the ratio of their counts will be
    on-the-fly, application-specific control messages like
                                                                      wrong [28]. Some systems, like Borealis [6], synchronize
    ‘new discount’, etc.
                                                                      nodes to avoid this problem, while others, like Storm [4],
  • Job coordinator dynamically interacts with other
                                                                      ignore it.
    jobs to carry out interactive queries.
                                                                         E LF’s consistency semantics are straightforward, lever-
Job worker uses a local CBT to implement some                         aging the fact that each CBT’s intermediate results (PAOs
application-specific execution model, e.g., asynchronous              snapshots) are uniquely named for different timestamped
stream processing with a sliding window, synchronous                  intervals. Like a software combining tree barrier, each
incremental batch processing with historical records, etc.            leaf uploads the first interval’s snapshot to its parent. If
For consistency, job workers are synchronized by the job              the parent discovers that it is the last one in its direct list
master to ‘roll up’ the intermediate results to the SRT for           of children to do so, it continues up the tree by aggregat-
global aggregation. Each worker has five components:                  ing the first interval’s snapshots from all branches, else
  • Input receiver observes streams. Its current imple-               it blocks. Figure 9 shows an example in which agent1
    mentation assumes logs are collected with Flume [1],              loses snapshot0 , and thus blocks agent5 and agent7 . Pro-
    so it employs an interceptor to copy stream events,               ceeding in this fashion, a late-coming snapshot eventually
    then parses each event into a job-specified key-value             blocks the entire upstream path to the root. All snapshots
    pair. A typical Flume event is a tuple with times-                from distributed CBTs are thus sequentially aggregated.
    tamp, source IP, and event body that can be split into
                                                                      3.3       Fault Recovery
    columns based on different key-based attributes.
  • Job parser converts a job’s SQL description into a                E LF handles transmission faults and agent failures, tran-
    workflow of operator functions f, e.g., aggregations,             sient or permanent. For the former, the current implemen-
    grouping, and filters.                                            tation uses a simple XOR protocol to detect the integrity
  • PAOs execution: each key-value pair is represented                of records transferred between each source and destina-
    as a partial aggregation object (PAO) [9]. New PAOs               tion agent. Upon an XOR error, records will be resent.
    are inserted into and accumulated in the CBT. When                We deal with agent failures by leveraging CBTs and the
    the CBT is “flushed”, new and past PAOs are ag-                   robust nature of the P2P overlay. Upon an agent’s failure,
    gregated and returned, e.g., argu, 2, f : count()               the dataset cached in the agent’s CBT is re-issued, the
    merges with argu, 5, f : count() to be a PAO                    SRT is re-constructed, and all PAOs are recomputed using
    agru, 7, f : count().                                           a new SRT from the last checkpoint.
  • CBT resides in local agent’s memory, but can be                      In an ongoing implementation, we also use hot repli-
    externalized to SSD or disk, if desired.                          cation to support live recovery. Here, each agent in
  • SRT proxy is analogous to a socket, to join the P2P               the overlay maintains a routing table, a neighborhood
    overlay and link with other SRT proxies to construct              set, and a leaf set. The neighborhood set contains the
    each job’s SRT.                                                   nodeIds hashed from the webservers’ IP addresses that
   A key difference to other streaming systems is that E LF           are closest to the local agent. The job master periodi-
seeks to obtain scalability by changing the system archi-             cally checkpoints each agent’s snapshots in the CBT, by
tecture from 1 : n to m : n, where each job has its own               asynchronously replicating them to the agent’s neighbors.
master and appropriate set of workers, all of which are                     PAOs' snapshots
mapped to a shared set of agents. With many jobs, there-                    sequentially sent     Current Timestamp: 3
                                                                                    ...          1.K-V error
fore, an agent act as one job’s master and another job’s                                 21 0 1              3. Resend

worker, or any combination thereof. Further, using DHTs,                                                            2. Blocked
                                                                                            4. Leap 0
                                                                                                                        5
                                                                                          ...
jobs’ reducing paths are constructed with few overlaps,                                           2
                                                                                                             2 1  0                    Blocked
                                                                                                                                        7
resulting in E LF’s management being fully decentralized                                  ...    3                               1 0   Job master
                                                                                                                       6
and load balanced. The outcome is straightforward scal-                                                       2
                                                                                                                  2
ing to large numbers of concurrently running jobs, with                                   ...    4
                                                                                                nodeId
each master controls its own job’s execution, including
to react to failures, mitigate stragglers, alter a job as it is       Figure 9: Example of the leaping straggler approach. Agent1
running, and coordinate with other jobs at runtime.                   notifies all members to discard snapshot0 .

                                                                  7
USENIX Association                                                                      2014 USENIX Annual Technical Conference 31
E LF’s approach to failure handling is similar to that             • What is the overheads of E LF in terms of CPU, mem-
of Spark and other streaming systems, but has potential                 ory, and network load (Sec.4.3)?
advantages in data locality because the neighborhood
set maintains geographically close (i.e., within the rack)          4.1    Testbed and Application Scenarios
agents, which in turn can reduce synchronization over-              Experiments are conducted on a testbed of 1280 agents
heads and speed up the rebuild process, particularly in             hosted by 60 server blades running Linux 2.6.32, all con-
datacenter networks with limited bi-section bandwidths.             nected via Gigabit Ethernet. Each server has 12 cores
                                                                    (two 2.66GHz six-core Intel X5650 processors), 48GB of
3.4     Straggler Mitigation                                        DDR3 RAM, and one 1TB SATA disk.
Straggler mitigation, including to deal with transient slow-           E LF’s functionality is evaluated by running an actual
down, is important for maintaining low end-to-end delays            application requiring both batch and stream processing.
for time-critical streaming jobs. Users can instruct E LF           The application’s purpose is to identify social spam cam-
jobs to exercise two possible mitigation options. First, as         paigns, such as compromised or fake OSN accounts used
in other stream processing systems, speculative backup              by malicious entities to execute spam and spread mal-
copies of slow tasks could be run in neighboring agents,            ware [13]. We use the most straightforward approach to
termed the “mirroring straggler” option. The second                 identify them – by clustering all spam containing the same
option in actual current use by E LF is the “leaping strag-         label, such as an URL or account, into a campaign. The
gler” approach, which skips the delayed snapshot and                application consumes the events, all labeled as “sales”,
simply jumps to the next interval to continue the stream            from the replay of a previously captured data stream from
computation.                                                        Twitter’s public API [3], to determine the top-k most fre-
    Straggler mitigation is enabled by the fact that each           quently twittering users publishing “sales”. After listing
agent’s CBT states are periodically checkpointed, with a            them as suspects, job functions are changed dynamically,
timestamp at every interval. When a CBT’s Paos snap-                to investigate further, by setting different filtering condi-
shots are rolled up from leaves to root, the straggler will         tions and zooming in to different attributes, e.g., locations
cause all of its all upstream agents to be blocked. In the          or number of followers.
example shown in Figure 9, agent1 has a transient failure              The application is implemented to obtain online results
and fails to resend the first checkpoint’s data for some            from live data streams via E LF’s agents, while in addi-
short duration, blocking the computations in agent5 and             tion, also obtaining offline results via a Hadoop/HBase
agent7 . Using a simple threshold to identify it as a strag-        backend. Having both live and historical data is crucial
gler – whenever its parent determines it to have fallen             for understanding the accuracy and relevance of online
two intervals behind its siblings – agent1 is marked as a           results, e.g., to debug or improve the online code. E LF
straggler. Agent5 , can use the leaping straggler approach:         makes it straightforward to mix online with offline pro-
it invalidates the first interval’s checkpoints on all agents       cessing, as it operates in ways that bypass the storage tier
via multicast, and then jumping to the second interval.             used by Hadoop/HBase.
    The leaping straggler approach leverages the streaming             Specifically, live data streams flow from webservers
nature of E LF, maintaining timeliness at reduced levels of         to E LF and to HBase. For the web tier, there are 1280
result accuracy. This is critical for streaming jobs operat-        emulated webservers generating Twitter streams at a rate
ing on realtime data, as when reacting quickly to changes           of 50 events/s each. Those streams are directly intercepted
in web user behavior or when dealing with realtime sensor           by E LF’s 1280 agents, that filter tuples for processing, and
inputs, e.g., indicating time-critical business decisions or        concurrently, unchanged streams are gathered by Flume
analyzing weather changes, stock ups and downs, etc.                to be moved to the HBase store. The storage tier has 20
                                                                    servers, in which the name node and job tracker run on a
4     Evaluation                                                    master server, and the data node and task trackers run on
                                                                    the remaining machines. Task trackers are configured to
E LF is evaluated with an online social network                     use two map and two reduce slots per worker node. HBase
(OSN) monitoring application and with the well-known                coprocessor, which is analogous to Google’s BigTable
WordCount benchmark application. Experimental evalu-                coprocessor, is used for offline batch processing.
ations answer the following questions:
                                                                    Comparison with Muppet and Storm. With ad-hoc
    • What performance and functionality benefits does              queries sent from a shell via ZeroMQ [5], comparative
      E LF provide for realistic streaming applications             results are obtained for E LF, Muppet, and Storm, with
      (Sec.4.1)?                                                    varying sizes of sliding windows. Figure 10a shows that
    • What is the throughput and processing latency seen            E LF consistently outperforms Muppet. It achieves per-
      for E LF jobs, and how does E LF scale with number            formance superior to Storm for large window sizes, e.g.,
      of nodes and number of concurrent jobs (Sec.4.2)?             300 s, because the CBT’s data structure stabilizes ‘flush’

                                                                8
32 2014 USENIX Annual Technical Conference                                                                 USENIX Association
5s window                                   600
                        150                                                                                                                                                              8000

                                                                                                                                                                          Latency (ms)
                                                     10s window                                        Average latency                    561.3ms

                                                                   Function changing time (ms)
                                                     30s window                                                                                                                          6000
                                                                                                 500
                        120                          60s window                                                                                                                          4000
 Processing time (ms)

                                                     300s window                                 400                                                                                     2000                              Fault recovery
                        90                                                                                                                                                                  0
                                                                                                                                                                                                    1     2     4   8 16 32      64 128
                                                                                                 300                                                                                                             #Fault agents
                        60                                                                                           204.6ms                                                                  200

                                                                                                                                                                               Latency (ms)
                                                                                                 200                                                                                                    Leap straggler
                                                                                                                                                                                              150
                        30                                                                       100                                                                                          100
                                                                                                     7.3ms                                                                                     50
                                                                                                  0                                                                                             0
                         0                                                                         Message        Filter func Map/Reduce func                                                       1     2      4   8 16 32 64 128
                                Storm    Muppet       Elf                                                    New-updated function                                                                               #Straggler agents

                        (a) Comparison of processing times.                                  (b) Latency for job function changes.                                                            (c) Fault and straggler recovery.
                              Figure 10: Processing times, latency of function changes, and recovery results of E LF on the Twitter application.
cost by organizing the compressed historical records in                                                                    are first consumed by CBTs, and the SRT only picks up
an (a, b) tree, enabling fast merging with large numbers                                                                   truncated key-value pairs from CBTs for subsequent shuf-
of past records.                                                                                                           fling. Therefore, the CBT, as the starting point for parallel
                                                                                                                           streaming computations, directly influences E LF’s overall
Job function changes. Novel in E LF is the ability to
                                                                                                                           throughput. The SRT, as the tree structure for shuffling
change job functions at runtime. As Figure 10b shows,
                                                                                                                           key-value pairs, directly influences total job processing
it takes less than 10 ms for the job master to notify all
                                                                                                                           latency, which is the time from when records are sent to
agents about some change, e.g., to publish discounts in
                                                                                                                           the system to when results incorporating them appear at
the microsale example. To zoom in on different attributes,
                                                                                                                           the root. We first report the per-node throughput of E LF
the job master can update the filter functions on all agents,
                                                                                                                           in Figure 11, then report data shuffling times for different
which takes less than 210 ms, and it takes less than 600
                                                                                                                           operators in Figure 12a 12b. The degree to which loads
ms to update user-specified map/reduce functions.
                                                                                                                           are balanced, an important E LF property when running a
Fault and Straggler Recovery. Fault and straggler re-                                                                      large number of concurrent streaming jobs, is reported in
covery are evaluated with methods that use human inter-                                                                    Figure 12c.
vention. To cause permanent failures, we deliberately
remove some working agents from the datacenter to eval-                                                                    Throughput. E LF’s high throughput for local aggrega-
uate how fast E LF can recover. The time cost includes                                                                     tion, even with substantial amounts of local state, is
recomputing the routing table entries, rebuilding the SRT                                                                  based in part on the efficiency of the CBT data struc-
links, synchronizing CBTs across the network, and re-                                                                      ture used for this purpose. Figure 11 compares the
suming the computation. To cause stragglers via tran-                                                                      aggregation performance and memory consumption of
sient failures, we deliberately slow down some working                                                                     the Compressed Buffer Tree (CBT) with a state-of-the-
agents, by collocating them with other CPU-intensive                                                                       art concurrent hashtable implementation from Google’s
and bandwidth-aggressive applications. The leap ahead                                                                      sparsehash [2]. The experiment uses a microbenchmark
method for straggler mitigation is fast, as it only requires                                                               running the WordCount application on a set of input files
the job master to send a multicast message to notify ev-                                                                   containing varying numbers of unique keys. We measure
eryone to drop the intervals in question.                                                                                  the per-unique-key memory consumption and throughput
   Figure 10c reports recovery times with varying num-                                                                     of the two data structures. Results show that the CBT con-
bers of agents. The top curve shows that the delay for                                                                     sumes significantly less memory per key, while yielding
fault recovery is about 7 s, with a very small rate of in-                                                                 similar throughput compared to the hashtable. Tests are
crease with increasing numbers of agents. This is due to                                                                   run with equal numbers of CPUs (12 cores), and hashtable
the DHT overlay’s internally parallel nature of repairing                                                                                            150                                                  150
the SRT and routing table. The bottom curve shows that                                                                                                                                        CBT                            Sparsehash
                                                                                                                         (×106 keys/s) per key (B)

                                                                                                                                                     120                                                  120
                                                                                                                                       Memory

the delay for E LF’s leap ahead approach to dealing with                                                                                              90
                                                                                                                                                      60
                                                                                                                                                                                                           90
                                                                                                                                                                                                           60
stragglers is less than 100 ms, because multicast and sub-                                                                                            30                                                   30
                                                                                                                                                       0                                                    0
sequent skipping time costs are trivial compared to the                                                                                               20                                                   20
cost of full recovery.
                                                                                                                         Throughput

                                                                                                                                                      16                                                   16
                                                                                                                                                      12                                                   12
                                                                                                                                                       8                                                    8
4.2                           Performance                                                                                                              4                                                    4
                                                                                                                                                       0                                                    0
                                                                                                                                                             20    60 100 140 180                                 20   60 100 140 180
Data streams propagate from E LF’s distributed CBTs as
                                                                                                                                                           Number of unique keys                                Number of unique keys
leaves, to the SRT for aggregation, until the job master at                                                                                                      (×106 )                                              (×106 )
the SRT’s root has the final results. Generated live streams                                                                       Figure 11: Comparison of CBT with Google Sparsehash.

                                                                                                                    9

USENIX Association                                                                                                                                                2014 USENIX Annual Technical Conference 33
100                                                                       100                                                                                                 0.9999
                                        max operator                                                                 max operator
  Data-shuffling time (ms)              min operator                                                                 min operator                                                                           0.999

                                                                            Data-shuffling time (ms)
                             80         sum operator                                                   80            sum operator                                                                            0.99

                                                                                                                                                                                      Normal percentiles
                                        avg operator                                                                 avg operator                                                                               0.9
                             60                           50 ms                                        60                                                        50 ms
                                                                                                                                                                                                                0.5
                             40                                                                        40
                                                                                                                                                                                                                0.1                                     Deploying 500 SRTs
                             20                                                                        20                                                                                                    0.01                                       Deploying 1000 SRTs
                                                                                                                                                                                                            0.001                                       Deploying 2000 SRTs
                                                                                                                                                                                                           0.0001                                       Reference line
                              0                                                                         0
                                   10     20     40 80 160 320 640 1280                                      10                               20     40 80 160 320 640 1280                                                                0          5          10          15
                                               #Agents in datacenter                                                                               #Agents in datacenter                                                                    Average #roots of SRTs per agent

                              (a) Running operators separately.                        (b) Running operators simultaneously.                                                                       (c) Normal probability plot of #roots.
                                                       Figure 12: Performance evaluation of E LF on data-shuffling time and load balance.

            CPU Memory I/O C-switch                                                                                                            3                                                                                          600

                                                                                                              #Packets per agent per second
  SPS

                                                                                                                                                                                                             Bytes per agent per second
           %used %used wtps cswsh/s                                                                                                           2.5                                                                                         500
  E LF     2.96% 5.73% 3.39 780.44                                                                                                             2                                                                                          400
  Flume    0.14% 5.48% 2.84 259.23
                                                                                                                                              1.5                                                                                         300
  S-master 0.06% 9.63% 2.96 652.22
  S-worker 1.17% 15.91% 11.47 11198.96                                                                                                         1                                                                                          200
                                                                                                                                                                           5s interval                                                                                5s interval
  SPS: stream processing system.                                                                                                              0.5                          10s interval                                                   100                         10s interval
  wtps: write transactions per second.                                                                                                                                     30s interval                                                                               30s interval
                                                                                                                                               0                                                                                            0
                                                                                                                                               10     20    40 80 160 320 640 1280                                                          10    20    40 80 160 320 640 1280
  cswsh/s: context switches per second.                                                                                                                    #Agents in datacenter                                                                       #Agents in datacenter

                              (a) Runtime overheads of E LF vs. others.                                                           (b) Additional #packets overhead.                                                                             (c) Additional bytes overhead.
                                                           Figure 13: Overheads evaluation of E LF on runtime cost and network cost.
performance scales linearly with the number of cores.                                                                                                           times are strictly governed by an SRT’s depth O(log16 N),
   E LF’s per-node throughput of over 1000,000 keys/s                                                                                                           where N is the number of agents in the datacenter.
is in a range similar to Spark Streaming’s reported best
                                                                                                                                                                Balanced Load. Figure 12c shows the normal probabil-
throughput (640,000 records/s) for Grep, WordCount,
                                                                                                                                                                ity plot for the expected number of roots per agent. These
and TopKCount when running on 4-core nodes. It is
                                                                                                                                                                results illustrate a good load balance among participating
also comparable to the speeds reported for commercial
                                                                                                                                                                agents when running a large number of concurrent jobs.
single-node streaming systems, e.g., Oracle CEP reports a
                                                                                                                                                                Specifically, assuming the root is the agent with the high-
throughput of 1 million records/s on a 16-core server and
                                                                                                                                                                est load, results show that 99.5% of the agents are the
StreamBase reports 245,000 records/s on a 8-core server.
                                                                                                                                                                roots of less than 3 trees when there are 500 SRTs total;
Operators. Figure 12a 12b reports E LF’s data shuffling                                                                                                         99.5% of the agents are the roots of less than 5 trees when
time of E LF when running four operators separately vs.                                                                                                         there are 1000 SRTs total; and 95% of the agents are the
simultaneously. By data shuffling time, we mean the time                                                                                                        roots of less than 5 trees when there are 2000 SRTs total.
from when the SRT fetches a CBT’s snapshot to the result                                                                                                        This is because of the independent nature of the trees’ root
incorporating it appears in the root. max sorts key-value                                                                                                       IDs that are mapped to specific locations in the overlay.
pairs in a descending order of value, and min sorts in an
ascending order. sum is similar to WordCount, and avg                                                                                                           4.3      Overheads
refers to the frequency of words divided by the occurrence
                                                                                                                                                                We evaluate E LF’s basic runtime overheads, particularly
of key-value pairs. As sum does not truncate key-value
                                                                                                                                                                those pertaining to its CBT and SRT abstractions, and
pairs like max or min, and avg is based on sum, naturally,
                                                                                                                                                                compare them with Flume and Storm. The CBT requires
sum and avg take more time than max and min.
                                                                                                                                                                additional memory for maintaining intermediate results,
   Figure 12a 12b demonstrates low performance inter-
                                                                                                                                                                and the SRT generates additional network traffic to main-
ference between concurrent operators, because both data
                                                                                                                                                                tain the overlay and its tree structure. Table 13a and
shuffling times seen for separate operators and concurrent
                                                                                                                                                                Figure 13b 13c present these costs, explained next.
operators are less than 100 ms. Given the fact that concur-
rent jobs reuse operators if processing logic is duplicated,                                                                                                    Runtime overheads. Table 13a shows the per-node run-
the interference between concurrent jobs is also low. Fi-                                                                                                       time overheads of E LF, Flume, Storm master, and Storm
nally, these results also demonstrate that SRT scales well                                                                                                      worker. Experiments are run on 60 nodes, each with 12
with the datacenter size, i.e., number of webservers, as                                                                                                        cores and 48GB RAM. As Table 13a shows, E LF’s run-
the reduce times increase only linearly with exponential                                                                                                        time overheads is small, comparable to Flume, and much
increase in the number of agents. This is because reduce                                                                                                        less than that of Storm master and Storm worker. This

                                                                                                                                                           10

34 2014 USENIX Annual Technical Conference                                                                                                                                                                                                                  USENIX Association
is because both E LF and Flume use a decentralized ar-                the network. CBP [18] and Comet [15] run MapReduce
chitecture that distributes the management load across                jobs on new data every few minutes for “bulk incremental
the datacenter, which is not the case for Storm master.               processing”, with all states stored in on-disk filesystems,
Compared to Flume, which only collects and aggregates                 thus incurring latencies as high as tens of seconds. Spark
streams, E LF offers the additional functionality of pro-             Streaming [28] divides input data streams into batches
viding fast, general stream processing along with per-job             and stores them in memory as RDDs [27]. By adopt-
managemen mechanisms.                                                 ing a batch-computation model, it inherits powerful fault
                                                                      tolerance via parallel recovery, but any dataflow modifi-
Network overheads. Figure 13b 13c show the additional
                                                                      cation, e.g., from pipeline to cyclic, has to be done via
network traffic imposed by E LF with varying update in-
                                                                      the single master, thus introducing overheads avoided
tervals, when running the Twitter application. We see that
                                                                      by E LF’s decentralized approach. For example, it takes
the number of packets and number of bytes sent per agent
                                                                      Spark Streaming seconds for iterating and performing
increase only linearly, with an exponential increase in the
                                                                      incremental updates, but milliseconds for E LF.
number of agents, at a rate less than the increase in up-
                                                                         All of the above systems inherit MapReduce’s “single
date frequency (from 1/30 to 1/5). This is because most
                                                                      master” infrastructure, in which parallel jobs consist of
packets are ping-pong messages used for overlay and SRT
                                                                      hundreds of tasks, and each single task is a pipeline of
maintenance (initialization and keep alive), for which any
                                                                      map and reduce operators. The single master node places
agent pings to a limited set of neighboring agents. We
                                                                      those tasks, launches those tasks, maybe synchronizes
estimate from Figure 13b that when scaling to millions of
                                                                      them, and keeps track of their status for fault recovery
agents, the additional #package is still bounded to 10.
                                                                      or straggler mitigation. The approach works well when
5    Related Work                                                     the number of parallel jobs is small, but does not scale to
                                                                      hundreds of concurrent jobs, particularly when these jobs
Streaming Databases. Early systems for stream process-                differ in their execution models and/or require customized
ing developed in the database community include Au-                   management.
rora [29], Borealis [6], and STREAM [10]. Here, a query
                                                                      Large-scale Streaming Systems. Streaming systems
is composed of fixed operators, and a global scheduler de-
                                                                      like S4 [21], Storm [4], Flume [1], and Muppet [16] use
cides which tuples and which operators to prioritize in ex-
                                                                      a message passing model in which a stream computation
ecution based on different policies, e.g., interesting tuple
                                                                      is structured as a static dataflow graph, and vertices run
content, QoS values for tuples, etc. SPADE [14] provides
                                                                      stateful code to asynchronously process records as they
a toolkit of built-in operators and a set of adapters, target-
                                                                      traverse the graph. There are limited optimizations on
ing the System S runtime. Unlike SPADE or STREAM
                                                                      how past states are stored and how new states are inte-
that use SQL-style declarative query interfaces, Aurora
                                                                      grated with past data, thus incurring high overheads in
allows query activity to be interspersed with message
                                                                      memory usage and low throughput when operating over
processing. Borealis inherits its core stream processing
                                                                      larger time windows. For example, Storm asks users to
functionality from Aurora.
                                                                      write codes to implement sliding windows for trend top-
MapReduce-style Systems. Recent work extends the                      ics, e.g., using Map, Hashmap data structure. Muppet
batch-oriented MapReduce model to support continuous                  uses an in-memory hashtable-like data structure, termed a
stream processing, using techniques like pipelined paral-             slate, to store past keys and their associated values. Each
lelism, incremental processing for map and reduce, etc.               key-value entry has an update trigger that is run when
   MapReduce Online [12] pipelines data between map                   new records arrive and aggressively inserts new values to
and reduce operators, by calling reduce with partial data             the slate. This creates performance issues when the key
for early results. Nova [22] runs as a workflow manager               space is large or when historical windowsize is large. E LF,
on top of an unmodified Pig/Hadoop software stack, with               instead, structures sets of key-value pairs as compressed
data passes in a continuous fashion. Nova claims itself as            buffer trees (CBTs) in memory, and uses lazy aggregation,
a tool more suitable for large batched incremental process-           so as to achieve high memory efficiency and throughput.
ing than for small data increments. Incoop [11] applies                  Systems using persistent storage to provide full fault-
memorization to the results of partial computations, so               tolerance. MillWheel [7] writes all states contained in ver-
that subsequent computations can reuse previous results               tices to some distributed storage system like BigTable or
for unchanged inputs. One-Pass Analytics [17] optimizes               Spanner. Percolator [23] structures a web indexing com-
MapReduce jobs by avoiding expensive I/O blocking op-                 putation as triggers that run when new values are written
erations such as reloading map output.                                into a distributed key-value store, but does not offer con-
   iMR [19] offers the MapReduce API for continuous                   sistency guarantees across nodes. TimeStream [24] runs
log processing, and similar to E LF’s agent, mines data               the continuous, stateful operators in Microsoft’s StreamIn-
locally first, so as to reduce the volume of data crossing            sight [8] on a cluster, scaling with load swings through

                                                                 11
USENIX Association                                                                2014 USENIX Annual Technical Conference 35
You can also read