Chi: A Scalable and Programmable Control Plane for Distributed Stream Processing Systems

Page created by Bonnie Walton
 
CONTINUE READING
Chi: A Scalable and Programmable Control Plane for Distributed Stream Processing Systems
Chi: A Scalable and Programmable Control Plane for
               Distributed Stream Processing Systems

                     Luo Mai1 , Kai Zeng2 , Rahul Potharaju2 , Le Xu3 , Steve Suh2 , Shivaram Venkataraman2 , Paolo Costa1,2 ,
                             Terry Kim2 , Saravanan Muthukrishnan2 , Vamsi Kuppa2 , Sudheer Dhulipalla2 , Sriram Rao2

                                                    1 Imperial     College London, 2 Microsoft, 3 UIUC
            1 luo.mai11@imperial.ac.uk, 2 {kaizeng,          rapoth, stsuh, shivaram.venkataraman, pcosta, terryk, sarmut, vamsik, sudheerd,
                                                        sriramra}@microsoft.com, 3 lexu1@illinois.edu

ABSTRACT                                                                                  Fully achieving the benefits promised by these online systems,
Stream-processing workloads and modern shared cluster environ-                          however, is particularly challenging. First, streaming workloads
ments exhibit high variability and unpredictability. Combined with                      exhibit high temporal and spatial variability, up to an order of mag-
the large parameter space and the diverse set of user SLOs, this                        nitude compared to the average load [27, 31]. Second, large shared
makes modern streaming systems very challenging to statically con-                      clusters exhibit high hardware heterogeneity and unpredictable con-
figure and tune. To address these issues, in this paper we investigate                  current usage. Third, modern streaming systems expose a large pa-
a novel control-plane design, Chi, which supports continuous mon-                       rameter space, which makes tuning hard, even for the most experi-
itoring and feedback, and enables dynamic re-configuration. Chi                         enced engineers. Further, different users and jobs have a diverse set
leverages the key insight of embedding control-plane messages in                        of Service Level Objectives (SLOs), leading to divergent configura-
the data-plane channels to achieve a low-latency and flexible con-                      tion settings. Addressing these issues requires introducing contin-
trol plane for stream-processing systems.                                               uous monitoring and feedback as well as dynamic re-configuration
  Chi introduces a new reactive programming model and design                            into all aspects of streaming systems, ranging from query planning
mechanisms to asynchronously execute control policies, thus avoid-                      and resource allocation/scheduling to parameter tuning. We call
ing global synchronization. We show how this allows us to easily                        the system layer in charge of such control mechanisms the control
implement a wide spectrum of control policies targeting different                       plane, to distinguish it from the data plane (the layer in charge of
use cases observed in production. Large-scale experiments using                         data processing).
production workloads from a popular cloud provider demonstrate                            Through our interaction with product teams and cloud operators,
the flexibility and efficiency of our approach.                                         we identified the following requirements that a control plane should
                                                                                        satisfy. First, it should be possible to define new custom control op-
PVLDB Reference Format:                                                                 erations, tailored to different scenarios [24]. Second, this should be
Luo Mai, Kai Zeng, Rahul Potharaju, Le Xu, Steve Suh, Shivaram Venkatara-               achieved with only minimal effort by the developers and through
man, Paolo Costa, Terry Kim, Saravanan Muthukrishnan, Vamsi Kuppa,                      a simple and intuitive API to minimize the chance of bugs, which
Sudheer Dhulipalla, Sriram Rao. Chi: A Scalable and Programmable Con-
                                                                                        are particularly challenging to detect in a distributed environment.
trol Plane for Distributed Stream Processing Systems. PVLDB, 11 (10):
1303-1316, 2018.                                                                        Finally, the control overhead should be kept to a minimum, even
DOI: https://doi.org/10.14778/3231751.3231765                                           in the presence of high event throughput and large computation
                                                                                        graphs. This is particularly critical in today’s landscape with differ-
                                                                                        ent cloud providers competing hard to offer the best SLOs to their
1.     INTRODUCTION                                                                     customers. Ideally the control plane should match the data-plane
  Large-scale Internet-service providers such as Amazon, Facebook,                      SLO (usually in the order of seconds or less).
Google, and Microsoft generate tens of millions of data events per                        Unfortunately, to the best of our knowledge, none of the existing
second [5]. To handle such high throughput, they have traditionally                     streaming systems can satisfy all these requirements. Heron [27]
resorted to offline batch systems [21, 13, 4]. More recently, how-                      and Flink [10] have a monolithic control plane that supports only a
ever, there has been an increasing trend towards using streaming                        limited set of predefined control policies (e.g., dynamic scaling and
systems [3, 10, 27, 31] to ensure timely processing and avoid the                       back pressure), and lacks a clean API, which makes it hard for users
delays typically incurred by offline batch systems.                                     to define custom policies. Spark Streaming [41], adopts a Bulk-
                                                                                        Synchronous Parallel (BSP) model [38] in which a set of events is
Permission to make digital or hard copies of all or part of this work for               buffered and processed as a batch. While this allows the system
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies              to modify a dataflow between batches, it has limited flexibility due
bear this notice and the full citation on the first page. To copy otherwise, to         to the hard batch boundaries and incurs high overhead due to the
republish, to post on servers or to redistribute to lists, requires prior specific      synchronization and scheduling operations required.
permission and/or a fee. Articles from this volume were invited to present                To overcome the shortcomings of today’s systems and meet the
their results at The 44th International Conference on Very Large Data Bases,            aforementioned requirements, we explore a novel control-plane de-
August 2018, Rio de Janeiro, Brazil.                                                    sign for stream processing systems. Inspired by the idea of punc-
Proceedings of the VLDB Endowment, Vol. 11, No. 10
Copyright 2018 VLDB Endowment 2150-8097/18/06.                                          tuations [37] being used for data operators, we propose embedding
DOI: https://doi.org/10.14778/3231751.3231765

                                                                                     1303
Chi: A Scalable and Programmable Control Plane for Distributed Stream Processing Systems
control-plane messages into the data stream. By leveraging the ex-          We begin by summarizing the main results of our analysis and
isting data pipeline, control messages can be streamed at low la-         discuss the opportunities that arise.
tency and in a scalable fashion, without requiring any ad-hoc mech-       Workload Unpredictability: The load on servers ingesting log
anism (§7.2). This seemingly simple approach, however, requires           events is highly variable. Fig. 1(a) shows the heat-map of the nor-
support from the underlying streaming infrastructure, to execute          malized number of tuples produced per minute across a random
distributed control operations with consistency requirements while        subset of streams in our cluster. This shows two important phenom-
minimizing synchronization overhead and enabling an extensible            ena. First, as evidenced by the different color patterns across hori-
control programming model.                                                zontal lines, each stream exhibits a unique workload characteristic
  In this paper, we describe Chi, a control plane built on this idea of   (spatial variability). Second, while some streams are dark (high
embedding control messages in the dataflow to execute low-latency         volume) for most time, several streams exhibit burstiness (tempo-
control operations. We introduce a reactive programming model for         ral variability), as shown by color spots the same horizontal line.
handling control messages that allows users to encode a number of           There are three main reasons for such variability in the log events:
complex control policies. This hides the complex distributed na-
ture of the underlying system and provides developers an intuitive        • Heterogenous Workloads: The clusters generating these logs han-
model to understand the boundaries of data events to which the              dle a variety of workloads ranging from typical big data jobs
control events are applied. We then design a mechanism to execute           (e.g., filter, transform, join) to iterative machine learning work-
these control policies in an asynchronous manner at each opera-             loads on behalf of hundreds of teams.
tor, avoiding synchronization and high runtime overhead. We show          • Failures & Debugging: Failures are common at large scale. These
that by using our programming model and execution mechanism                 failures (e.g., networking issues, power issues etc.) generate a
we can efficiently implement a number of control policies ranging           large amount of error logs that lead to traffic bursts. In addition,
from basic functionalities such as checkpointing and replay, to ad-         when ingestion data is not sufficient, developers temporarily ac-
vanced ones with strong global consistency requirements, such as            tivate more verbose logs to perform in-depth debugging, which
continuous monitoring, plan re-optimization, and parameter tuning           results in a higher volume stream.
(§5). Finally, we also discuss how our control plane infrastructure       • Diverse Stream Semantics: The logs we ingest have diverse se-
can be easily parallelized by having a separate control loop per op-        mantics (e.g., info, verbose, debug, error, etc.) with various com-
eration, making the control plane scalable.                                 ponents in the service emitting with different characteristics. A
  To validate our claims and evaluate the runtime performance of            service at the lowest level (e.g., storage layer) naturally produces
our new control-plane design, we implement Chi on top of Flare,             the most logs since most requests involve store I/O transactions.
one of the internal stream processing systems used in our clusters.
                                                                            To quantify the degree of variability, in Fig. 1(b) we show a box-
Flare is built on top of Orleans [7], a highly efficient distributed
                                                                          and-whiskers plot with the Y-axis representing the delta in terms
actor framework, and uses Trill [16] as the underlying stream pro-
                                                                          of event count per minute on a subset of incoming data streams.
cessing engine. While we choose to showcase our approach on top
                                                                          The height of the box shows the difference in counts between the
of Flare for ease of implementation and deployment on our inter-
                                                                          25th and 75th percentiles. Beside the significant difference in the
nal clusters, our design is not tied to a specific platform, and with
                                                                          behavior of each stream, it is worth noting the high range of change
some additional engineering effort, it can be applied to existing
                                                                          observed within the same stream, denoted by the large number of
systems, including Heron and Flink. Our experimental evaluation,
                                                                          outliers (blue dots in the figure). For example, for certain event
based on production workloads and industry-standard benchmarks,
                                                                          streams, the event count can increase by up to tens of millions in
shows that Chi is able to perform reconfiguration for a large state-
                                                                          just one minute, indicating that a timely response is critical for sus-
ful dataflow in 5.8 seconds on a 32-server cluster. Real-world use
                                                                          taining load spikes.
cases further show the effectiveness of Chi in helping developers
                                                                          Data Diversity Another key element in determining the optimal
automatically tune critical system parameters and reduce latency
                                                                          query plan and resource allocation is the data distribution. To an-
by 61%, reducing workload skew during runtime.
                                                                          alyze its dynamics, we focus on the key selectivity, defined as the
  In summary, this paper makes the following contributions:
                                                                          number of tuples that fall into a particular bucket. To analyze the
• An extensive study of today’s stream-processing workloads through
                                                                          dynamics of this parameter, in Fig. 1(c) we plot the selectivity of
the logs collected from more than 200,000 production servers.
                                                                          various grouping keys across time while in 1(d) we plot the selec-
• A scalable and efficient control-plane that allows a streaming
                                                                          tivity over time across multiple grouping keys. The wide skew in
system to efficiently adapt to changes in workloads or environment.
                                                                          the selectivity observed in both plots indicates that one-shot query
• A flexible control-plane API that enables developers to easily
                                                                          planning (e.g., traditional cardinality-based optimizer) either glob-
implement custom control operations.
                                                                          ally or on a per-key basis will likely be sub-optimal over time.
• Evaluation of our prototype implementation on a 32-server Azure
                                                                          Multi-tenant control policies: In our production environment, the
VM cluster using production workloads from a cloud service provider
                                                                          same streaming system is used by many teams. Thus there is a
and across a large set of control operations, including dynamic scal-
                                                                          big diversity of SLOs across queries from different teams, or even
ing, failure recovery, auto-tuning, and data skew management.
                                                                          across different queries from the same team, leading to the need
                                                                          for multiple control policies to be active at the same time. For
2.   MOTIVATION                                                           instance, we have many customers who are interested in consum-
  We performed an extensive measurement study by analysing the            ing verbose, info and error logs. While verbose and error logs are
logs generated by more than 200,000 servers of a data-analytics           used for debugging, info logs are often used to compute important
cluster of a popular cloud provider. These clusters generate a sig-       metrics (e.g., billing metrics). One popular request made by our
nificant amount of log data (10s PB/day), and queries on this data        developers has been to provide stronger delivery semantics (e.g.,
are executed by developers for debugging, monitoring, etc. Given          exactly-once) for info logs and weaker delivery semantics (e.g.,
the data size, we require a large cluster with adequate network and       best-effort, at-least-once) for verbose/error logs. This highlights
compute resources to process data in real time. Our setup is con-         the importance of supporting multiple control policies either on a
sistent with a recent analysis of Google production logs [19].            per-stream level or per-tenant level.

                                                                      1304
(a)                                     (b)                                    (c)                                     (d)
Figure 1: (a) Heatmap of a subset of data streams in our ingestion workload. Each horizontal band signifies a unique data stream being ingested, (b) Box plot
representing the variability of delta count (order of millions) per minute (Y-axis) for a subset of incoming data streams (X-axis). (c),(d) Box plot depicting the
selectivity properties of grouping keys used in a streaming join query from production traces. (c) shows selectivity of various grouping keys across time while
(d) shows selectivity variability across the grouping key space.

  Multi-tenancy also introduces new opportunities in system man-                    out incoming channels are source operators; those without outgoing
agement and optimization. In our production traces we observe                       channels are sink operators.
queries that share a significant overlap in terms of data sources and                 For instance, consider an example where the user is interested in
operators (e.g., users parsing logs in an identical way). Thus control              tokenizing sentences into words, and count the windowed accumu-
policies could look at sharing computation across users or materi-                  lated count (e.g., per hour) of each word across all sentences in a
alizing intermediate data for later reuse. Further, with insight into               data parallel way. This query can be expressed in a LINQ-style
the data streams and workloads, control policies could be used to                   language as below:
automatically and transparently choose new data layouts (e.g., with
partitioning) and storage engines (e.g., in-memory, database) that                       E XAMPLE 1      (W ORD C OUNT E XAMPLE ).
can improve execution efficiency. While all these optimizations                      stream . SelectMany ( line = > Tokenize ( line ))
are well understood in isolation [33, 34, 40, 32], applying them in                         . GroupByKey ( word = > word )
an integrated manner to optimize the streaming workloads requires                           . TumblingWindow ( OneHour ). Count ()
flexibility and scalability in the control plane.
Takeaway: Our trace analysis uncovered several interesting char-                    An instance of a dataflow graph for Example 1 is shown in Stage
acteristics of our workloads – high volume (10s of PB/day), low la-                 I of Fig 2. Note that operators {R1 , R2 } maintain the accumulated
tency requirement (SLO of seconds), widely diverse (100s of data                    counts for the words as their mutable local states. These states
streams) and dynamic (due to the nature of services producing these                 cover a disjoint set of keys, and jointly represent the entire key
logs). Based on this, we can derive the following list of require-                  subspace, e.g., R1 maintains the accumulated count for all words
ments for the control plane:                                                        with the starting letter in the range [’a’-‘l’], while R2 maintains
1. Efficient and extensible feedback-loop controls: Because of                      those for the range [’m’-’z’].
the diversity of our workloads, it is important to allow users or ap-               Dataflow Computation Model: Formally, a dataflow computation
plications to make flexible late-binding decisions on their data pro-               is represented as a DAG G(V, E), where V represents the set of op-
cessing logic for optimizing performance. From an extensibility                     erators, and E, the set of edges connecting the operators, u → v
standpoint, it should be possible to integrate with dedicated com-                  represents a directed edge from operator u to v. We use {· → v}
ponents (e.g., policy controllers such as Dhalion [24]).                            to denote the input edges to v, and {v → ·} the output edges from
2. Easy control interface: Since we intend the control plane to                     v. An operator v ∈ V is described by a triple (sv , fv , pv ), where sv
be used by application developers, having an easier programming                     is the state of v; fv defines a function that captures the computa-
interface is critical to its adoption.                                              tion run on v, i.e., f : sv , mei ∈{·→v} −→ s0v , {m0eo ∈{v→·} }, meaning
3. Minimal impact on the data plane: The control plane should
                                                                                    a function takes a single input message m, from an input edge ei ,
have limited or no impact on the latency and throughput of the data
                                                                                    and based on the current state sv , it modifies the state of v to a
plane and it should be able to seamlessly cope with high control
                                                                                    new state s0v , and generates one or more messages on a set of out-
frequency and large dataflow graphs.
                                                                                    put edges {m0eo ∈{v→·} }. Operators without input edges are called
                                                                                    sources, and operators without output edges are called sinks. For
                                                                                    generality, we represent the properties associated with v that are not
3.    BACKGROUND                                                                    part of state as pv , e.g., pv can define the maximum memory used
  Chi is primarily designed for streaming systems that are based                    by v. An edge e does not have any state but can hold properties pe ,
on a streaming dataflow computation model. In this section we                       e.g., the token size of windows before triggering back-pressure.
provide the reader with the necessary background on the streaming
dataflow computation model.
  Many existing streaming systems, such as Naiad [30], Stream-                      4.     DESIGN
Scope [28] and Apache Flink [10] adopt the streaming dataflow                         The intuition behind embedding the control plane into the data
computation model. In this model, a computation job is represented                  plane is that this enables re-using the existing, efficient data-plane
as a directed acyclic graph (DAG) of stateful operators, where each                 infrastructure and offers developers a familiar API to write con-
operator sends and receives logically timestamped events along di-                  trol operations, i.e., the same used to handle data events. Further,
rected edges. Each operator maintains mutable local state. Upon                     having control messages directly trailing data events provides a nat-
receiving events, an operator updates its local state, generates new                ural way to create custom boundaries between sequences of events.
events, and sends them to downstream operators. Operators with-                     This makes it easy to implement asynchronous control operations

                                                                             1305
because control messages can be used to capture causal dependen-                        word counts for the range [‘a’-‘h’], R2 for [‘i’-‘p’], and R3
cies without requiring expensive global synchronization operations.                     for [‘q’-‘z’]. This reconfiguration needs to maintain the
  In this section, we show how we incorporate these principles into                     consistency of the states. It starts by broadcasting a con-
our control-plane design and provide several examples to describe                       trol message with the new topology configuration to all the
how we can easily build different control operations on top. We                         source nodes (→ {M1 , M2 }).
conclude the section by describing some of the more advanced fea-               (III)   When the source (mapper M1 or M2 ) receives this control
tures of our design and discussing how Chi can be adapted to sup-                       message, the mapper immediately blocks the input channel
port BSP-style computations.                                                            while processing the message, updates its routing table with
                                                                                        the new topology and broadcasts the message downstream
4.1    Overview                                                                         (→ {R1 , R2 , R3 }).
                                                                                (IV)    When the reducer R1 (or R2 ) receives the control message, it
  Chi relies on the following three functionalities. First, channels
                                                                                        blocks the input channel on which the message has been re-
between operators support exactly-once and FIFO delivery of mes-
                                                                                        ceived. When the control messages from all input channels
sages. Back-pressure is used to stop the message propagation when
                                                                                        have been received, it updates its routing table and check-
the buffer of the downstream operator fills up. Second, operators
                                                                                        points its state. Next, it splits the state into two parts: the ac-
process messages one at a time and in the order that they have
                                                                                        cumulated word counts for the range [‘a’-‘h’] and the range
been received. Finally, the underlying engine provides basic op-
                                                                                        [‘i’-‘l’] (or for the ranges [‘m’-‘p’] and [‘q’-‘z’] ) and at-
erator lifecycle management capabilities. Specifically it allows us
                                                                                        taches the state that needs to be handled by R3 , i.e., the word
to start, stop and kill an operator. These functionalities are already
                                                                                        counts for [‘i’-‘l’] (or for [‘m’-‘p’]) to the control message
supported by Flare, our internal streaming system, but they can also
                                                                                        and broadcasts along all output channels (→ {R3 ,C}).
be found in other existing systems [10, 27, 39, 41].
                                                                                 (V)    When R3 receives a control message, it blocks that input
  Our system design uses dataflow controllers that are responsi-
                                                                                        channel. If the control message originates from R1 (or R2 ),
ble for monitoring dataflow behavior and external environmental
                                                                                        it records the state from the control message. When it re-
changes, and triggering control operations whenever needed. Users
                                                                                        ceives control messages from all input channels, it proceeds
can define control operations, and submit them to the controllers.
                                                                                        to merge all the states received, generate the new state of the
  A control operation is carried out through a control loop that is
                                                                                        accumulated word counts for the range [‘i’-‘p’], and install
comprised of a dataflow controller and the dataflow topology itself.
                                                                                        a new function (from the control message) using the new
In general, a control loop consists of three phases: (Phase-I) The
                                                                                        state. Finally, it broadcasts on the output channel (→ C).
controller makes a control decision and instantiates a control op-
                                                                                (VI)    When C receives control messages from all the expected
eration with a unique identifier. A control operation is defined by
                                                                                        sink nodes {R1 , R2 , R3 }, the scale-out operation is completed.
implementing a reactive API (Section §4.2.2), and has control con-
                                                                                        The controller then keeps monitoring the memory usage of
figurations for each operator (e.g., the new topology to scale out the
                                                                                        these operators in the new topology and can decide to scale
dataflow stream). The control operation is serialized into a control
                                                                                        out/in if needed. This forms a feedback-loop control.
message (Section §6). (Phase-II). The control message is broad-
casted by the controller to all source operators of the dataflow. The
control messages then propagate through the dataflow, interleaved
                                                                            4.2         Control Mechanism
with normal data messages. During the propagation, upon receiv-               Next, we describe the core mechanisms underpinning Chi. We
ing a control message, each operator triggers the corresponding             start by formally defining the dataflow computation model and ex-
control actions—which can optionally attach additional data (e.g.,          plain how graph transformations occur. Then, we discuss the con-
the repartitioned state) to the control message—and broadcast the           troller and operator APIs and provide an example control operation
control message to all downstream operators. See Section §4.2.2             implementation. We provide a proof of correctness in §4.2.3.
and Section §6 for more implementation details. (Phase-III) In the
end, the sink operators propagate the control messages back to the              4.2.1      Graph Transitions through Meta Topology
controller, and the controller carries out post-processing.                   Formally, a user control operation C can be modeled as a trans-
  We use Fig. 2 to illustrate this process. We consider a case where        formation that converts a dataflow execution graph G(V, E) to a
the controller wants to increase the throughput by modifying the            new graph G∗ (V ∗ , E ∗ ). For an operator v, such a transformation
topology and adding a new reducer.                                          can change one or more entries in the triple (S, f , P). For an edge
                                                                            e, such a transformation can optionally change pe . In particular,
   (I) At the beginning, there are two map operators {M1 , M2 } and         since the operator state S can capture state accumulated over a long
       two reduce operators {R1 , R2 } that compute word counting           time period, special care is needed to capture the transformation of
       for ingested sentences. These operators are stateful. For ex-        states during reconfiguration (i.e., G → G∗ ). That is, for v∗ ∈ V ∗ ,
       ample, {R1 , R2 } hold the accumulated count for all words,          Sv∗ is defined by a transformation function T on one or more nodes
       where R1 maintains the counts for all words starting with            {v} ⊆ V , i.e., T ({Sv }) = Sv∗ . In cases without ambiguity, we relax
       [‘a’-‘l’], and R2 maintains those for [‘m’-‘z’]. The con-            the notation and use T −1 (v∗ ) to represent the set {v} ⊆ V whose
       troller C is responsible for monitoring the memory usage             states Sv∗ depends on.
       of all operators and reconfiguring parallelism if needed. To           Most existing systems (e.g., [10, 39]) adopt a freeze-the-world
       simplify we omit the control messages that collect memory            approach to perform the transformation i.e., stop G by appropriate
       usage and focus on the parallelism reconfiguration process           checkpointing mechanisms, start G∗ , migrate the old checkpointed
       in the following discussions.                                        state on G to G∗ and resume the dataflow. However, this would
  (II) Once the controller C detects that the aggregated memory             likely trigger back-pressure, causing increased latency and loss of
       usage of the reducers goes beyond a threshold, it makes              throughput, and in turn limits the frequency of execution and ex-
       a reconfiguration decision to start a new reducer R3 to in-          pressivity of dataflow reconfigurations. Therefore, in the design of
       crease memory provisioning. In the new topology, the states          Chi we opted for an asynchronous alternative: instead of affecting
       of {R1 , R2 , R3 } should be repartitioned so that R1 holds the      the transformation directly (i.e., G → G∗ ), we introduce an interme-

                                                                         1306
Controller     State
              C         be: 2             C                   C                        C                            C                           C
                        not: 1

         M1           R1           M1           R1      M1           R1          M1           R1              M1          R1              M1          R1
                                                                          R3                        R3                           R3                        R3
        M2            R2           M2           R2      M2           R2          M2           R2              M2          R2              M2          R2

              (I) Operator               (II)                (III)                     (IV)                         (V)                        (VI)
                   Lifecycle               (Configuration)
      Controller
                                                               (Configuration)
 Mapper M1/M2
                                                                                           (Configuration, State)
 Reducer R1/R2
                            Reducer R3
                                                                                                                        (Configuration)

                      Figure 2: Scaling-out control in action where the user is interested in changing the number of reducers in Example 1.

diate meta topology G0 , which the control operation can temporar-                  • Configuration injection: We allow the same control operation
ily utilize in order to complete the transformation asynchronously.                 to carry different configurations for different operators. The config-
That is, during propagation of the the control messages, each oper-                 urations instruct operators to take different control actions. Config-
ator broadcasts messages to its downstream according to G0 . The                    urations are injected into a control message (see Fig. 6 for imple-
operators in G0 − G∗ , after processing the control messages, will                  mentation). The runtime transparently instantiates the control oper-
shut down; while the operators in G0 − G will only start processing                 ation appropriately with the correct configuration at each operator.
data messages after finishing the processing of control messages.                   In the scaling-out example shown in Fig. 2, the injected configura-
When the control message propagation finishes, the resulting topol-                 tions need to instruct (1) mappers to reconnect output channels, (2)
ogy will be equivalent to G∗ .                                                      the reducers R1 and R2 to migrate states, and (3) the new reducer
  We derive the meta-topology G0 for a control operation C as fol-                  R3 to accept migrated states. Shown in Algorithm 1(L1-11), R1 is
lows. In the most general case, we set G0 = G ∪ G∗ ∪ EV,V ∗ , where                 injected with SplitState (L6) and LoadFunc (L9) instructions, and
EV,V ∗ = {(v, v∗ )|∀v∗ ∈ V ∗ , ∀v ∈ T −1 (v∗ )}. In other words, the prop-          R3 with MergeState (L8) and LoadFunc instructions.
agation graph for C consists of all the operators and channels from                 • Reactive execution: Chi exposes a reactive (event-driven) pro-
both the old and new execution graph, and channels that capture                     gramming interface that users can leverage to define control oper-
the dependency relationship between states of the old and new op-                   ations. A control operation comprises two sets of event handlers:
erators. While this approach can lead to doubling the size of the                   those executed at the controller {OnInitAtController, OnBeginAt-
dataflow execution graph during reconfiguration, in practice, we                    Controller, OnNextAtController, OnCompleteAtController, OnDis-
can significantly reduce the propagation topology through appro-                    poseAtController}, and those executed at the operators { OnBegi-
priate pruning. For instance:                                                       nAtOperator, OnNextAtOperator, OnCompleteAtOperator, OnDis-
• State invariance. If a control operation does not change a node                   poseAtOperator}. These event handlers offer users great flexibili-
   v’s state Sv , we can collapse the corresponding new node v∗ ∈ G∗                ties to collect metrics or modify configurations when the controller
   with v ∈ G, and merge the input and output channels adjacent to                  and operators receive the first, next and last control messages. The
   v. For example, in Fig 3(a), M1∗ (M2∗ ) can be merged with M1                    OnInitAtController is called when initializing the control operation
   (M2 ) respectively.                                                              and allows users to inject configurations into the control message.
• Acyclic invariance. Aggressively merge the old and new topol-                     The OnDisposeAtController and OnDisposeAtOperator are called
   ogy as long as we can guarantee the graph acyclicity. For in-                    when the control operations are disposed. They are usually used
   stance, in Fig 3(a), we can further collapse R∗1 (R∗2 ) with R1 (R2 )            for releasing resources. The runtime handles the correct invocation
   without breaking the acyclicity. This is guaranteed by (i) the                   of these handlers and state transition as shown in Fig. 3(b), thus
   functional query interface which ensures initial dataflow topol-                 supporting expressing complex control logic in a safe manner.
   ogy is acyclic as well as (ii) the pruning algorithm which ensures               Blocking behavior: In the example in Fig. 2, operators always
   that no cycles is introduced during optimizing a meta topology.                  block the input channel upon receiving a control message from it.
   For example, for scale-out/in reconfiguration, the pruning algo-                 We find that this is a fairly common pattern in many control scenar-
   rithm uses consistent hashing as the state allocation scheme to                  ios e.g., checkpointing and scale-in/out operations. To simplify im-
   avoid introducing cycles when re-partitioning states.                            plementation of complex control, we provide a layer of abstraction
By applying the above pruning rules repeatedly in Fig 3(a), we ob-                  that allows the users to implement their control operations in both
tain the graph shown in Stage (IV) in Fig 2.                                        blocking and non-blocking ways. We do this by classifying con-
                                                                                    trol messages into two categories: (1) blocking: where the operator
                                                                                    blocks the corresponding channel on which the control message is
4.2.2      Control API                                                              received and subsequently unblocks it only when all the control ac-
                                                                                    tions are finished on that operator, and (2) non-blocking: where the
  We next describe features of our control API that enable devel-
                                                                                    operator does not block the input channel and continues to receive
opers to implement complex control operations. Chi’s API allows
                                                                                    other data/control messages on that channel. We believe such ab-
expressing different behavior across the following dimensions: (1)
                                                                                    straction is useful for users to express more advanced control oper-
spatial, e.g., behavior of {M1 , M2 } being different than {R1 , R2 , R3 },
                                                                                    ations. For instance, blocking control messages are usually useful
and (2) temporal, e.g., behavior of R3 when receiving the first con-
                                                                                    for control that affects states, while non-blocking control messages
trol message vs. the last in Fig 2.
                                                                                    are useful for the other cases, e.g., monitoring.
  To enable such a flexibility, we abstract control operations and
provide the following capabilities to users:

                                                                             1307
G → Gʹ → G *                                                                                Receive next CM
                                            Event     Make control decision                            • Invoke OnNextAtController()
                                                      • Invoke OnInitAtController()                                                            Complete global control
 M1            R1    M1           R1
                                            Action                                                     • (Optional) Block input channel        • Invoke OnDisposeAtController()
                                       R3
                                                      • Broadcast CMs to source nodes
 M2            R2    M2           R2

                                                                 Initialized                                       Preparing                        Disposing
      (1) Initial meta topology                                                         Receive first CM
                                                                                        • Invoke OnBeginAtController()            Receive all CMs
                                                                                        • Invoke OnNextAtController()             • Invoke OnCompleteAtController()
                                            Controller
 M1            R1                 R1

                                       R3                 CM – Control Message          • (Optional) Block input channel
 M2            R2                 R2
                                                          CM – Control Message
      (2) State invariance pruning          Operator                                               Receive all CMs
                                                                                                   • Invoke OnCompleteAtOperator()

 M1            R1                                                                Preparing                                            Disposing
                                       R3            Receive first CM
                                                     • Invoke OnBeginAtOperator()                  Receive next CM                            Complete global control
 M2            R2
                                                     • Invoke OnNextAtOperator()                   • Invoke OnNextAtOperator()                • Broadcast CMs to output channels
                                                     • (Optional) Block input channel              • (Optional) Block input channel           • Invoke OnDisposeAtOperator()
      (3) Acyclic invariance pruning
               (a)                                                                                          (b)
Figure 3: (a) Pruning the meta topology using state invariance and acyclic invariance. The final stage here is equivalent to Stage (IV) in Fig 2. (b) State
machine transitions for controller and operator

  Example: We demonstrate usage of the control API using the                                   facilitate users to understand the semantics and correctness of their
example shown in Fig. 2 where we want to scale-out from G into                                 customized control operations. For detail explanation and proof of
G∗ through G0 (shown in Fig. 3(b)). Algorithm 1 shows a pseudo                                 Theorem 1 and the properties of safe blocking control operations,
implementation of the dataflow reconfiguration control operation.                              please refer to the technical report1 .
Once the reconfiguration decision is made, the developer creates a
blocking control message. In OnInitAtController, a blocking con-                               4.3      Advanced Functionalities
trol message is created and injected with configurations for trigger-                            We discuss Chi’s advanced functionalities that are necessary to
ing control actions at different operators. Specifically, as explained                         cope with production workloads.
in §4.2.1, an edge (v, v∗ ) ∈ EV,V ∗ describes a dependency between                            Multiple Controllers Thus far our description assumes a single
the states of v∗ and v, operator v needs to be configured to split its                         dataflow controller that enforces a control operation. Our design is
states and ship the corresponding part of the state to v∗ , while oper-                        able to naturally scale out the controller by allowing multiple con-
ator v∗ needs to be configured to load and merge the received states                           current controllers for a single dataflow. For instance, users can
(L5-8). For example, as in Fig. 2, R1 needs to split the state and                             have one controller per desired functionality, e.g., one controller
ship the accumulated counts for the range [’i’-’l’] to R3 . Once the                           takes care of monitoring the health of the dataflow execution, an-
topology or state is changed, the operators need to reset associated                           other one periodically checkpoints operator states, while yet an-
computation functions (L9). Such a control message is broadcast                                other takes charge of reconfiguring dataflow execution graphs in
to source operators. It first initializes session variables that hold                          case of workload spikes. Multiple controllers can function at the
migrated states and control actions (L15-16) in the OnBeginAtOp-                               same time as long as the serializability of control messages for dif-
erator function. The migrated state is gradually accumulated until                             ferent control operations are guaranteed. In order to perform cross-
all parts of the state are received by OnNextAtOperator (L18-19).                              dataflow control operations, e.g., coordinating resource allocation
Once receiving all messages, the operator (shown in the OnCom-                                 across multiple dataflows, we can introduce a global controller that
pleteAtOperator function) performs control actions including move                              can interact with each dataflow’s controller.
away states that do not hold any more according to the new state                               Broadcast/aggregation trees. In practice, a dataflow graph usu-
key range (L23-25), merges states given by others (L26-27), and                                ally has a large number of source operators (and sometimes, sink
resets the function (L28-29). Once the controller receives the con-                            operators). In such a topology, the controller can quickly become a
trol messages from sink operators, the control operation is marked                             bottleneck due to the large fan-in/out. To mitigate this, we leverage
completed in OnCompleteAtController (L12).                                                     a simple technique such as inserting a spanning broadcast (aggre-
                                                                                               gation) tree before (after) the source (sink) operators.
4.2.3           Correctness Properties                                                         Dealing with congestion/deadlock. When congestion arises, e.g.,
 Chi provides correctness properties that can help users prove the                             due to network or CPU bottlenecks, our back-pressure mechanism
correctness of their control operations.                                                       is triggered and all messages, including control messages, are de-
   T HEOREM 1. Consider a control operation that changes a graph                               layed. This could be particularly critical if these messages are part
from G to G∗ using a control message and a state transformation                                of a control operation to alleviate congestion. One option might be
function T . The control operation has the following properties:                               to have two separate queues and give control messages higher prior-
   1. The control operation will terminate in finite time.                                     ity, so that in case of congestion they are delivered first. This, how-
   2. If a pair of operators v, v0 satisfies (a) v → v0 is an edge in                          ever, would break the ordering of control and data messages, thus
      G or G∗ , or (b) v ∈ T −1 (Sv0 ), then v will always invoke On-                          making it hard to maintain consistency. Therefore, we wait for fin-
      CompleteAtOperator before v0 .                                                           ishing processing the message. This is similar to the approach taken
                                                                                               by other systems such as Flink [10] and Spark Streaming [41].
  Furthermore, we introduce safe blocking control operations—a                                 Fault tolerance One of the main benefits of integrating control and
special type of blocking control operations whose control actions                              data plane is that failures in the control plane are handled in the
at each operator only read/write the corresponding operator state                              same way as failures in the data plane. More specifically, if con-
in OnCompleteAtOperator. Safe blocking control operations have                                 trol messages are lost, the underlying data channel is responsible
stronger properties—the semantics of safe blocking control oper-
ations is equivalent to the freeze-the-world approach—which can                                1 http://aka.ms/project-continuum

                                                                                    1308
Table 1: Comparing Chi with the Synchronous Global Control (SGC) and
Algorithm 1 Dataflow Reconfiguration Operation                               Asynchronous Local Control (ALC) models.
                                                                                                 SGC Models      ALC Models           Chi
Assumption: Each operator has a context (ctx). In the controller,                Consistency     Barrier         None                 Barrier/None
control operation can access G0 and G∗ as well as G0 and EV,V ∗                  Semantic        Simple          Hard                 Simple
(See §4.2.1). Users can create session variables (names start with               Latency         High            Low                  Low
the $ mark) that live through function scopes. A graph vertex has                Overhead        High            Implementation-      Low
properties including stage (stg), state key-range, and function                                                  dependent
(func). A control message has properties including source (src),                 Scalability     Implementation- Implementation-      High
                                                                                                 dependent       dependent
destination (dest), configuration dictionary (confs) and control
payload. A configuration is assigned with a list of instructions.

 1:   function O N I NITAT C ONTROLLER                                                         Controller
 2:      msg := new BlockingControlMessage()                                                                                          Filter: x
on a runtime to automatically manages the states of control oper-                       A     B
ations, handles failures, and performs operations asynchronously.                                                   Control           Data
In contrast, existing control models lack programmability and run-                                                 Processor        Processor

time support. For example, Flink implements a distributed check-                                       Control Messages                 Data Messages
                                                                                       R1     R2
point algorithm which is unable to support general control oper-
                                                                                                                     Dispatcher / Multiplexer
ations that, for example, require changing state. More recently,
Dhalion studied the high-level representation of a control policy,                      A     B                                    Messages
with a particular focus on identifying symptoms and therapies of                                                      Communication Layer
detected anomalies. It relies on the underlying system, i.e., Heron,
                                                                                 R4 R3 R1     R2                                   Binary data
to provide actual reconfiguration capability. Hence, Dhalion does
not has a control plane runtime as Chi which can be applied onto                     (a)                                    (b)
general record-at-a-time streaming systems.
                                                                            Figure 5: (a) Handling Skew (b) A Chi-enabled Flare operator architecture.
Overhead. When reconfiguring a BSP system, the entire dataflow
is halted. This is particularly detrimental for online streaming sys-         Notice that the controller no longer needs to ping each individual
tems and affects both latency and throughput. Furthermore, the              operator separately to collect statistics. Instead, the metrics are col-
BSP barriers severely constrain the frequency and timing of control         lected and aggregated along in a scalable tree fashion. §7 shows the
operations. To amortize the scheduling and communication cost, a            evaluation of using the monitoring control for collecting per-join-
BSP barrier interval is often set to be no smaller than seconds [39].       key cardinality that helps in identifying skewness in the join space
  As mentioned above, reconfiguring a record-at-a-time system re-           (which can then be used for straggler-mitigation).
quires freeze-the-world as in Flink (Section §7), or re-computation         Dataflow Reconfiguration. Dataflow reconfiguration is important
through data replays as in SEEP. On the one hand, freeze-the-world          for several interesting use-cases such as adding/removing operator
incurs a similar overhead than the synchronous global barrier. On           from a given query, increasing degree of parallelism of an operator
the other hand, replaying data is expensive in production, despite          and exploiting computational reuse.
reconfigured operators being often already short of resources. Our            Besides scale-in/out, one can also carry out more complex re-
traces show that buffering all intermediate outputs in preparation          configurations including changing the query plan. For instance,
for replays require a significant amount of memory, several orders          Fig. 5(a) demonstrates how we can change the query plan to alle-
of magnitude larger than the one consumed by the operator compu-            viate stragglers when we find skew in a streaming join query. As-
tation state. Replaying such a large buffer state not only consumes         sume originally streams A and B are joined using a shuffle join [26],
significant bandwidth, but also blocks the operator for a long time.        where mappers read data from A and B respectively, and partition
  In Chi, changes are applied on the asynchronous global barriers.          and route the data to the corresponding reducer based on the join
There is no need for freeze-the-world or data replay. Blocking be-          key; reducers on receiving the data, join the tuples with the same
haviors are local to each operator. There is no global blocking that        key together. Due to skewed key space, reducer R1 receives much
freezes the entire dataflow, thus reducing data plane overheads.            more data than R2 . At this point, we can change the query plan by
Scalability. Existing streaming systems usually has a separate con-         adding reducers {R3 , R4 } to share the load for R1 . The idea is to let
trol plane. This duplicates resources for the control plane to deal         A (assume A has a significantly higher workload than B) partition
with typical distributed system issues such as fault tolerance and          R1 ’s load into {R1 , R3 , R4 }; B broadcasts R1 ’s load to {R1 , R3 , R4 }.
scalability. Chi, however, embeds the control plane into the data           This reconfiguration requires R1 to replicate its internally main-
plane. As the data plane is optimized for handle high volume of             tained hash table of data from B to R3 and R4 , while partitioning
data, Chi benefits from the same optimization, e.g., zero-copy data         and redistributing the hash table of data from A to R3 and R4 .
movement and broadcast/aggregation trees for large fan-in/out in            Auto Parameter Tuning. Big data systems have many param-
dataflows. Furthermore, existing systems mostly adopt centralized           eters that are very hard to tune even for very experienced engi-
controllers. Reconfiguration requires a large number of control             neers. Chi can be used for automatic parameter tuning by lever-
messages, thus incurring single-point of failure and performance            aging both monitoring and dataflow reconfiguration in a tight con-
bottleneck. On the contrary, Chi can naturally partitions workloads         trol loop to simulate A/B testing of multiple instances of a single
on a controller. Dataflows are managed by parallel controllers. A           query. For instance, many existing streaming systems use micro-
dataflow controller further splits to operate control operations in         batching to tradeoff between latency and throughput. Whereas a
parallel. All these controllers run on multiple servers and keep state      large batch size provides good throughput, it does so at an increased
on a distributed store, implying a high-scale architecture.                 latency. Tuning the right batch size is a tricky problem. One so-
                                                                            lution through Chi would be to continuously monitor latency of
5.    APPLICATION EXAMPLES                                                  the data plane and adjust the batch size when we see considerable
                                                                            fluctuations in the observed latency until we obtain the maximum
  To demonstrate the flexibility of our approach, we illustrate three
                                                                            throughput with the desired latency.
control applications that we implemented using Chi.
Continuous Monitoring. Due to unpredictability of our work-
loads, unlike traditional batch processing systems where jobs can           6.    IMPLEMENTATION & DISCUSSION
be tuned offline to achieve optimal performance, streaming pipelines        Distributed runtime. To showcase the performance and flexibility
have to be continuously monitored and optimized on-demand in or-            of Chi, we implemented it on top of Flare, a streaming system used
der to achieve high performance and robustness. We show an exam-            internally by our team. Flare is built on top of Orleans [7]—a vir-
ple of using Chi to continually collect measurement data of all oper-       tual actor framework—as a runtime and Trill [16] as the operator-
ators, a foundational block for detecting and acting upon interesting       level stream processing engine. By leveraging Orleans, Flare achieves
characteristics of the system such as overload/underload, skew/drift        decentralized scalability, specifically: (1) nodes can join/leave the
in data distribution, intermittent environment-related bottlenecks          cluster without notifying master nodes, and (2) the lifecycle of an
and mitigating stragglers. Due to the page limit, the monitoring            actor is automatically managed by the platform, which transcends
control operation implementation is shown in the technical report.          the lifetime of in-memory instantiation and particular servers.

                                                                         1310
Data Message           Control Message               Configuration Program Field
                                                                                                                                    300                                   500
     Metadata
                                                                                                                                                 Flare                                    Flare

                                                                                                           Throughput (million/s)
     32bit             Timestamp                                             Control
                                                                           Instruction                                                           Flink                    400             Flink

                                                                                                                                                           Latency (ms)
     32bit      Type (Data or Control)                                        Logic                                                 200          Drizzle                  300             Drizzle
     64bit              Sender
                                                                                                                                                                          200
                                                                         Op1 Configuration
     32bit              Version
     64bit      Previous Sequence ID
                                                          Control
                                                         Program
                                                                         Op2 Configuration                                          100
                                                                                                                                                                          100

                                                                                  …
                                                       Configurations
     64bit           Sequence ID                                         Opn Configuration

     Control Payload                                                             Op1                                                 0                                      0
                                                          Operator                                                                        YSB      IPQ1                           YSB       IPQ1

                                                                                  …
                                                            IDs
                                    Control ID                                   Opn                                                  (a) Throughput                            (b) Latency
                                                                 Op1 Configuration Offset
                                     Control           Configuration

                                                                                  …
                                  Configurations         Offsets                                      Figure 7: Flare’s throughput and latency against Flink and Drizzle for the
                             (Created by controller)                    Opn Configuration Offset

                                                                        Op1 Configuration Length      YSB and IPQ1 workload
     Data Payload
                                                       Configuration

                                                                                  …
                                                          sizes
                                                                        Opn Configuration Length
       Processing Logic           Customizable
                                  Message Zone                          Configurations Count          the underlying system does not provide FIFO exactly-once mes-
                                    (Modified by
              Data
                                     operators)
                                                                                                      sage delivery, (2) porting the message dispatcher/multiplexer and
                                                                                                      the control processor, and (3) reusing the existing data processor.
                     Figure 6: Control message structure in Chi
                                                                                                      7.                      EVALUATION
  An operator in Chi has a stacked architecture embedded into the                                       In this section, we evaluate the performance of Chi using a num-
Flare operator which is a single-threaded environment, as shown                                       ber of micro-benchmarks and two real-world benchmarks. The
in Fig. 5(b). (i) The communication layer provides FIFO exactly-                                      first real-world benchmark focuses on dynamically scaling in/out
once data communication channels with back-pressure that mim-                                         resources while the second assesses Chi’s ability to handle control
ics the TCP. (ii) The message dispatcher/multiplexer invokes the                                      and data failures. These results demonstrate that Chi incurs negligi-
corresponding processing module based on the types of messages,                                       ble overhead, even under high data/control load and large dataflow
and multiplexes their outputs down to the communication layer.                                        graphs, and it is able to quickly react to changes in the workload or
(iii) The data processor applies a Trill pipeline onto data messages.                                 failures. To show the flexibility of our approach, we also report on
(iv) The control processor invokes a series of local control actions                                  the experiments with two more advanced case studies, i.e., handling
based on received control messages. It loads the corresponding                                        a skewed key distribution and auto-tuning for meeting SLOs.
control configuration, manages state machines, and invokes user-
defined control actions accordingly.                                                                  7.1                           Experimental Setup
  Flare further provides the following functionalities to simplify                                      Our experimental cluster comprises 32 DS12v2 instances in Azure.
control operations: (i) a user-friendly state management function-                                    Each virtual machine has 4 vCPUs, 28 GB RAM and a 10 Gbps
ality which models operator states as a key-value map and can au-                                     network connection. We consider one public workload, the Yahoo!
tomatically split and merge states using a user-defined partition                                     Streaming Benchmark (YSB) [42], and one private workload based
scheme on the key, (ii) a query compiler (extensible with custom                                      on production traces, IPQ1, which consists of multi-stage queries
optimization rules) similar to Spark’s catalyst [4] that converts LINQ                                with complex window aggregations and streaming joins. YSB is
queries into a distributed dataflow, and (iii) file server that allows                                rather light-weight in terms of data handling (bytes/event) while
runtime extensibility, where users can upload new control operation                                   IPQ1 is computationally heavier (KB/event). As explained in §6,
dependencies and have them loaded at runtime.                                                         we implemented Chi on top of Flare, a streaming engine built on
Custom serialization. Control messages may carry/propagate large                                      .NET CLR and is used internally by our team. When required, we
payloads including configurations and states/metrics/etc. from each                                   compare against Drizzle [39], a fork of Apache Spark v2.0.0 (a
operator. Since serialization and deserialization at each operator                                    BSP-style engine) with an optimized scheduler for streaming sce-
may introduce unnecessary overhead, we implemented a zero-copy                                        narios, and Apache Flink v1.3.2 [10] (a continuous dataflow en-
operation that allows us to extract the necessary pieces (e.g., con-                                  gine). For all our experiments, we warm up the JVM and CLR
figuration, payload of interest) from the byte buffers without dese-                                  before taking measurements to discount bootstrapping effects.
rializing the entire message.                                                                         Flare performance To help understand whether the underlying en-
  Fig 6 shows the structure of a control message. Each message                                        gine that we used for Chi is competitive with existing systems, we
includes three basic components: 1. Metadata field that is used to                                    compare the base performance of Flare against Flink and Drizzle.
ensure FIFO exactly-once delivery; 2. Configuration payload field                                     In Fig. 7, we show the results in terms of throughput and latency for
to store configurations for different operators; and 3. Data payload                                  the three systems when using the YSB and IPQ1. For the through-
for an operator to insert any control-specific data e.g., re-partitioned                              put experiments (Fig. 7(a)), we set the latency SLO to 350 ms and
state while scaling-out. Control messages are generated either by a                                   maximize the throughput while for the latency experiment we fix
controller (before control instructions are applied to any operators)                                 the ingestion rate at 20 million tuples/s and minimize the latency.
or by an operator (after a control operation has triggered but before                                 Following a common practice [39], we define latency as the time it
control message has been propagated to succeeding operators).                                         takes all events in the window to be processed after the window has
Portability. While we implemented our approach on top of Flare                                        ended. For instance, if a window ends at time a and the last event
for ease of implementation/deployment in our internal clusters, our                                   from the window is processed at time b, the processing latency for
design is not tied to a specific platform. Chi can be applied to                                      this window is calculated as b − a. The results obtained for Flink
other systems as long as the systems provides FIFO at-least-once                                      and Drizzle are consistent with what previously reported in the lit-
or exactly-once delivery semantics, and in-order message process-                                     erature [39, 25] and confirm that Flare’s performance is comparable
ing. These modest requirements are offered by most contempo-                                          with these systems.
rary streaming systems such as Flink [10] and SEEP [11]. A typ-
ical porting plan includes (1) porting the communication layer if

                                                                                                   1311
7.2    Micro-benchmarks                                                    along different paths in the dataflow (e.g., due to work imbalance).
  In this sub section, we study the interplay between the control          As already discussed in §4.2.2, however, we observe that block-
and data planes. Specifically, we are interested in the operational        ing and non-blocking control messages are semantically equivalent,
overhead and scalability aspects of Chi. To this end, we vary the          although they differ in both their implementation and execution
data-plane load under three different CPU regimes (resp. 50%,              model. Thus, to reduce completion time, developers can always
75%, and 90%) and set the ingestion rates for the two workloads            convert the blocking control messages to a non-blocking version.
YSB and IPQ1 accordingly. For the computation-heavy IPQ1 this              Is the control plane scalable? As discussed in §6, Chi can scale
results in 20, 40, and 60 million events/s (corresponding to 56%,          to large dataflows with a large number of sources (sinks) by using
73% and 94% average CPU in our 32-server cluster) while for the            a broadcast (aggregation) tree to exchange control events between
communication-heavy YSB, this leads to 140, 180, and 220 mil-              the controller and the operators. To evaluate its effect, in Fig. 9
lion events/s (resp. 51%, 74%, and 89% average CPU). For the               we show the completion time of a control message as we increase
control load, we consider two instances of the control message,            the number of sources. The results show that the completion time
using blocking (B-Control) and non-blocking (NB-Control) mes-              increases logarithmically and remains well below 100 ms even for
sages. To accurately isolate the impact of Chi and avoid biasing the       very large dataflow graphs (8,192 sources).
result with custom control logic (e.g., CPU-intensive user code),
we use only NoOp control messages in our experiments.                      7.3    Adaptivity and Fault-Tolerance
Does control plane affect the data plane? We now assess the                  The previous sections have shown that Chi incurs low overhead
overhead introduced by Chi. In Fig. 8(a) and 8(b), we show the             and completion time, and can scale to large data-flow graphs. Here-
relative increase in latency for the two workloads when varying the        after, we study how Chi leverages these properties to improve the
control-plane load from one control message/s (representative of           adaptivity to workload changes (dynamic elasticity) and to failures
dataflow management tasks, e.g., checkpointing, and reconfigura-           (fault recovery) of the data plane using the YSB workload.
tion ones, e.g., scale in/out) to 100 control messages/s (representa-      Dynamic Elasticity. We set the ingestion rate of our 32-server
tive of monitoring tasks). This range covers all control scenarios         cluster to 30M tuples/sec. At time t = 40 s, we double the inges-
that we have observed in production and we believe that the same           tion rate to emulate a workload spike. At the same time, we start the
should hold for the vast majority of use cases.                            scale-out process on all three streaming engines, using their respec-
  These results show that both non-blocking and blocking control           tive strategies. For instance, in Apache Flink, when the scale-out
have low overhead. This is because neither of these controls re-           process begins, we immediately invoke a Savepoint [23] operation.
quires global synchronization — control events are propagated and          It externally stores a self-contained checkpoint that is useful for
processed asynchronously just like data events. For example, at            stop-and-resume of Flink programs, and restarts the dataflow with
the highest load of the computation-intensive IPQ1 workload (60            a larger topology. We plot the results in Fig. 10(a):
million events/s), Chi incurs lower than 20% latency penalty to the          C HI . At the scale-out time (t = 40s), there is no noticeable drop
data plane, even for high control load (100 messages/s). This is              in throughput while latency temporarily spikes to 5.8s. However,
important because the average CPU utilization was already at 94%              the system quickly recovers within 6s to a stable state.
(the latency could be further reduced by scaling out). As expected,          A PACHE F LINK . At t = 40s throughput drops to zero (due to freeze-
blocking control messages are more expensive than non-blocking                the-world initiated by Savepoints) for five seconds. Processing
ones. This is due to the fact that blocking control messages require          latency, instead, spikes up to 35.6 s and it takes 31 s after restart
local synchronization which block the input channel temporarily.              to return to a steady state.
Does a busy data plane limit the control plane? Next, we study               D RIZZLE . At t = 40s, there is no visible drop in throughput be-
the impact of the data-plane load on the completion time of control           cause, since Drizzle uses a BSP model, it was not feasible to
messages. The concern is that by merging control and data events,             measure throughput continuously. Similar to Flink, the process-
high data-plane load might negatively affect the performance of the           ing latency spikes to 6.1 s and it took an additional 10 s be-
control messages. To verify this, we repeat the experiment and                fore stabilizing. Notably, Drizzle starts reacting to the workload
we measure the completion time of the blocking and non-blocking               spike after 5 s. This is because the workload spike happened in
messages (see Fig. 8(c) and 8(d)). As a reference point, we also              the middle of a scheduling group [39], where the system cannot
show the latency of the data plane (“data latency” bar in the graph).         adjust the dataflow topology — the larger the scheduling group
The message latency is defined as the difference between the times-           size, the slower it reacts to workload changes.
tamps of the message M entering the system and the timestamp of            Fault Tolerance: Next, we examine the default failure recovery
when the last message triggered by M leaves the system. In addi-           operation (Section §4.3) implemented on top of Chi. As in the
tion to queuing delays, the completion time of a data (resp. control)      previous experiments, we set the ingestion rate to 30M tuples/sec.
message also depends on the complexity of the actual data process-         We also enable checkpointing for all three streaming systems every
ing (resp. control logics). Hence, a control message can complete          10 s. At time t = 40s, 5 s after the last checkpoint, we kill a vir-
faster than data messages if there is no backlog and messages are          tual machine to simulate a failure in the dataflow and we start the
processed immediately upon receiving.                                      recovery process on all three streaming engines (see Fig 10(b)):
  In most cases control messages complete relatively quickly in              C HI At failure time (t = 40s), there is no noticeable drop in through-
comparison to the data message delivery. Also, when the control               put while latency temporarily spikes to 4.3 s. However, the sys-
load is low (one or ten control messages/s), the completion time              tem quickly recovers to a stable state within 5 s.
is relatively unaffected by the increasing data-plane load. We ob-           A PACHE F LINK At t = 40s, throughput drops to zero for five
serve, however, that at the highest load (resp. 220 million events/s          seconds (system downtime), as Flink’s recovery mechanism re-
for YSB and 60 million events/s for IPQ1), the completion time                deploys the entire distributed dataflow with the last completed
for a control message is similar or higher than the data latency.             checkpoint [22]. The processing latency spikes up to 17.7 s. It
This is particularly evident for the blocking messages. The rea-              then takes 13 s after restart to return to a steady state.
son is that the latter introduces a local synchronization at each op-        D RIZZLE At t = 40s, we observe no drop in throughput due to
erator and, at high load, there is a higher variability in latencies          the reason described in the scale-out experiment. The processing

                                                                        1312
You can also read