Exploiting Efficient and Scalable Shuffle Transfers in Future Data Center Networks

Page created by Jerome Hampton

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
                                                10.1109/TPDS.2014.2316829, IEEE Transactions on Parallel and Distributed Systems
                                                                                                                                                                             1

                   Exploiting Efficient and Scalable Shuffle Transfers in
                               Future Data Center Networks
                    Deke Guo, Member, IEEE, Junjie Xie, Xiaomin Zhu, Member, IEEE, Wei Wei, Member, IEEE

                 Abstract—Distributed computing systems like MapReduce in data centers transfer massive amount of data across successive
                 processing stages. Such shuffle transfers contribute most of the network traffic and make the network bandwidth become a bottleneck.
                 In many commonly used workloads, data flows in such a transfer are highly correlated and aggregated at the receiver side. To lower
                 down the network traffic and efficiently use the available network bandwidth, we propose to push the aggregation computation into
                 the network and parallelize the shuffle and reduce phases. In this paper, we first examine the gain and feasibility of the in-network
                 aggregation with BCube, a novel server-centric networking structure for future data centers. To exploit such a gain, we model the
                 in-network aggregation problem that is NP-hard in BCube. We propose two approximate methods for building the efficient IRS-based
                 incast aggregation tree and SRS-based shuffle aggregation subgraph, solely based on the labels of their members and the data center
                 topology. We further design scalable forwarding schemes based on Bloom filters to implement in-network aggregation over massive
                 concurrent shuffle transfers. Based on a prototype and large-scale simulations, we demonstrate that our approaches can significantly
                 decrease the amount of network traffic and save the data center resources. Our approaches for BCube can be adapted to other
                 server-centric network structures for future data centers after minimal modifications.

                 Index Terms—Data center, shuffle transfer, data aggregation

                                                                                           F

          1     I NTRODUCTION                                                                  network bandwidth in future data centers. A transfer
                                                                                               refers to the set of all data flows across successive stages
          Data center is the key infrastructure for not only online
                                                                                               of a data processing job. The many-to-many shuffle and
          cloud services but also systems of massively distributed
                                                                                               many-to-one incast transfers are commonly needed to
          computing, such as MapReduce [1], Dryad [2], CIEL [3],
                                                                                               support those distributed computing systems. Such data
          and Pregel [4]. To date, such large-scale systems manage
                                                                                               transfers contribute most of the network traffic (about
          large number of data processing jobs each of which may
                                                                                               80% as shown in [10]) and impose severe impacts on
          utilize hundreds even thousands of servers in a data
                                                                                               the application performance. In a shuffle transfer, the
          center. These systems follow a data flow computation
                                                                                               data flows from all senders to each receiver, however, are
          paradigm, where massive data are transferred across
                                                                                               typically highly correlated. Many state-of-the-practice
          successive processing stages in each job.
                                                                                               systems thus already apply aggregation functions at the
             Inside a data center, large number of servers and
                                                                                               receiver side of a shuffle transfer to reduce the output
          network devices are interconnected using a specific data
                                                                                               data size. For example, each reducer of a MapReduce
          center network structure. As the network bandwidth
                                                                                               job is assigned a unique partition of the key range
          of a data center has become a bottleneck for these
                                                                                               and performs aggregation operations on the content
          systems, recently many advanced network structures
                                                                                               of its partition retrieved from every mapper’s output.
          have been proposed to improve the network capacity
                                                                                               Such aggregation operations can be the sum, maximum,
          for future data centers, such as DCell [5], BCube [6],
                                                                                               minimum, count, top-k, KNN, et. al. As prior work [11]
          Fat-Tree [7], VL2 [8], and BCN [9]. We believe that it is
                                                                                               has shown, the reduction in the size between the input
          more important to efficiently use the available network
                                                                                               data and output data of the receiver after aggregation is
          bandwidth of data centers, compared to just increasing
                                                                                               81.7% for Mapreduce jobs in Facebook.
          the network capacity. Although prior work [10] focuses
                                                                                                  Such insights motivate us to push the aggregation
          on scheduling network resources so as to improve the
                                                                                               computation into the network, i.e., aggregating corre-
          data center utilization, there has been relatively little
                                                                                               lated flows of each shuffle transfer during the trans-
          work on directly lowering down the network traffic.
                                                                                               mission process as early as possible rather than just at
             In this paper, we focus on managing the network ac-                               the receiver side. Such an in-network aggregation can
          tivity at the level of transfers so as to significantly lower                        significantly reduce the resulting network traffic of a
          down the network traffic and efficiently use the available                           shuffle transfer and speed up the job processing since the
                                                                                               final input data to each reducer will be considerably de-
          • D. Guo, J. Xie, and X. Zhu are with the Key laboratory for Information             creased. It, however, requires involved flows to intersect
            System Engineering, College of Information System and Management,
            National University of Defense Technology, Changsha 410073, P.R. China.
                                                                                               at some rendezvous nodes that can cache and aggregate
            E-mail: guodeke@gmail.com.                                                         packets for supporting the in-network aggregation. In
          • W. Wei is with the College of Computer Science and Engineering, Xian               existing tree-based switch-centric structures [7], [8] of
            University of Technology, Xian 710048, P.R. China.
                                                                                               data centers, it is difficult for a traditional switch to do

                                 1045-9219 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
                                         http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2014.2316829, IEEE Transactions on Parallel and Distributed Systems

so due to the small buffer space shared by too many switches. Fat-Tree [7], VL2 [8] and PortLand [12] fall into
flows and limited packet processing capability. such a category, where servers only connect to the edge-
Additionally, these server-centric network structures level switches. Different from such structures, FBFLY
possess high link density, i.e., they provide multiple [13] and HyperX [14] organize homogeneous switches,
disjoint routing paths for any pair of servers. Conse- each connecting several servers, into the generalized
quently, they bring vast challenges and opportunities hypercube [15]. For such switch-centric structures, there
in cooperatively scheduling all flows in each shuffle has been an interest in adding the in-network process-
transfer so as to form as many rendezvous servers ing functionality to network devices. For example, new
as possible for performing the in-network aggregation. Cisco ASIC and arista application switches have pro-
Currently, a shuffle transfer is just treated as a set of vided a programmable data plane. Additionally, existing
independently unicast transmissions that do not account switches have been extended with dedicated devices
for collective behaviors of flows. Such a method brings like sidecar [16], [17]. Sidecar is connected to a switch
less opportunity for an efficient in-network aggregation. via a direct link and receives all traffic. Such trends
In this paper, we examine the gain and feasibility provide a software-defined approach to customize the
of the in-network aggregation on a shuffle transfer in networking functionality and to implement arbitrary in-
a server-centric data center. To exploit the gain of in- network processing within such data centers.
network aggregation, the efficient incast and shuffle The second category is server-centric, which puts the
transfers are formalized as the minimal incast aggre- interconnection intelligence on servers and uses switches
gation tree and minimal shuffle aggregation subgraph only as cross-bars (or does not use any switch). BCube
problems that are NP-hard in BCube. We then propose [6], DCell [5], CamCube [18], BCN [9], SWDC [19], and
two approximate methods, called IRS-based and SRS- Scafida [20] fall into such a category. In such settings,
based methods, which build the efficient incast aggrega- a commodity server with ServerSwitch [21] card acts as
tion tree and shuffle aggregation subgraph, respectively. not only an end host but also a mini-switch. Each server,
We further design scalable forwarding schemes based thus, uses a gigabit programmable switching chip for
on Bloom filters to implement in-network aggregation customized packet forwarding and leverages its resource
on massive concurrent shuffle transfers. for enabling in-network caching and processing, as prior
We evaluate our approaches with a prototype im- work [21], [22] have shown.
plementation and large-scale simulations. The results In this paper, we focus on BCube that is an outstand-
indicate that our approaches can significantly reduce ing server-centric structure for data centers. BCube0 con-
the resultant network traffic and save data center re- sists of n servers connecting to a n-port switch. BCubek
sources. More precisely, our SRS-based approach saves (k ≥ 1) is constructed from n BCubek−1 ’s and nk n-port
the network traffic by 32.87% on average for a small- switches. Each server in BCubek has k+1 ports. Servers
scale shuffle transfer with 120 members in data centers with multiple NIC ports are connected to multiple layers
BCube(6, k) for 2≤k≤8. It saves the network traffic by of mini-switches, but those switches are not directly con-
55.33% on average for shuffle transfers with 100 to 3000 nected. Fig.1 plots a BCube(4, 1) structure that consists of
members in a large-scale data center BCube(8, 5) with nk+1 =16 servers and two levels of nk (k+1)=8 switches.
262144 servers. For the in-packet Bloom filter based Essentially, BCube(n, k) is an emulation of a k+1
forwarding scheme in large data centers, we find that dimensional n-ary generalized hypercube [15]. In
a packet will not incur false positive forwarding at the BCube(n, k), two servers, labeled as xk xk−1 ...x1 x0 and
cost of less than 10 bytes of Bloom filter in each packet. yk yk−1 ...y1 y0 are mutual one-hop neighbors in dimen-
The rest of this paper is organized as follows. We sion j if their labels only differ in dimension j, where
briefly present the preliminaries in Section 2. We for- xi and yi ∈{0, 1, ..., n−1} for 0≤i≤k. Such servers con-
mulate the shuffle aggregation problem and propose the nect to a j level switch with a label yk ...yj+1 yj−1 ...y1 y0
shuffle aggregation subgraph building method in Section in BCube(n, k). Therefore, a server and its n−1 1-hop
3. Section 4 presents scalable forwarding schemes to effi- neighbors in each dimension are connected indirectly via
ciently perform the in-network aggregation. We evaluate a common switch. Additionally, two servers are j-hop
our methods and related work using a prototype and neighbors if their labels differ in number of j dimensions
large-scale simulations in Section 5 and conclude this for 1≤j≤k+1. As shown in Fig.1, two servers v0 and v15 ,
work in Section 6. with labels 00 and 33, are 2-hop neighbors since their
labels differ in 2 dimensions.
2 P RELIMINARIES
2.1 Data center networking 2.2 In-network aggregation
Many efforts have been made towards designing net- In a sensor network, sensor readings related to the same
work structures for future data centers and can be event or area can be jointly fused together while being
roughly divided into two categories. One is switch- forwarded towards the sink. In-network aggregation is
centric, which organizes switches into structures other defined as the global process of gathering and routing in-
than tree and puts the interconnection intelligence on formation through a multihop network, processing data

1045-9219 (c) 2013 IEEE. Personal use is permitted, but 2
republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

methods, such as 1.746 [30], 1.693 [31], 1.55 [27]. Such
methods have a similar idea. Indeed, they start from
an MST and improve it step by step by using a greedy
principle to choose a Steiner vertex to connect a triple of
vertices.

3 AGGREGATING SHUFFLE TRANSFERS
Fig. 1. A BCube(4,1) structure. We first formulate an efficient shuffle transfer as building
at intermediate nodes with the objective of reducing en- the minimal shuffle aggregation subgraph, an NP-hard
ergy consumption, thereby increasing network lifetime. problem, which can be applied to any family of data
In-network aggregation is a very important aspect of center topologies. We then propose efficient aggregation
wireless sensor networks and have been widely studied, methods for incast and shuffle transfers, respectively.
such as [23] and [24]. Finally, we present the extension of our methodologies
Costa et al have introduced the basic idea of such an to other network structures.
in-network aggregation to the field of data centers. They
built Camdoop [25], a MapReduce-like system running 3.1 Problem statement
on CamCube data centers. Camdoop exploits the prop- We model a data center as a graph G=(V, E) with a
erty that CamCube servers forward traffic to perform vertex set V and an edge set E. A vertex of the graph
in-network aggregation of data during the shuffle phase refers a switch or a datacenter server. An edge (u, v)
and reduces the network traffic. CamCube differentiates denotes a link, through which u connects with v where
itself from other server-centric structures by utilizing a v, u∈V .
3-dimensional torus topology, i.e., each server is directly Definition 1: A shuffle transfer has m senders and n
connected to six neighboring servers. receivers where a data flow is established for any pair of
In the Camdoop, servers use the function getParent sender i and receiver j for 1≤i≤m and 1≤j≤n. An incast
to compute the tree topology for an incast transfer. transfer consists of m senders and one of n receivers. A
Given the locations of the receiver and a local server (a shuffle transfer consists of n incast transfers that share
sender or an inner server), the function returns a one-hop the same set of senders but differ in the receiver.
neighbor, which is the nearest one towards the receiver In many commonly used workloads, data flows from
in a 3D space compared to other five ones. In this way, all of m senders to all of n receivers in a shuffle transfer
each flow of an incast transfer identifies its routing path are highly correlated. More precisely, for each of its n
to the same receiver, independently. The unicast-based incast transfers, the list of key-value pairs among m
tree approach proposed in this paper shares the similar flows share the identical key partition for the same
idea of the tree-building method in Camdoop. receiver. For this reason, the receiver typically applies
In this paper, we cooperatively manage the network an aggregation function, e.g., the sum, maximum, mini-
activities of all flows in a transfer for exploiting the mum, count, top-k and KNN, to the received data flows
benefits of in-network aggregation. That is, the routing in an incast transfer. As prior work [11] has shown,
paths of all flows in any transfer should be identified the reduction in the size between the input data and
at the level of transfer not individual flows. We then output data of the receiver after aggregation is 81.7% for
build efficient tree and subgraph topologies for incast Facebook jobs.
and shuffle transfers via the IRS-based and SRS-based In this paper, we aim to push aggregation into the
methods, respectively. Such two methods result in more network and parallelize the shuffle and reduce phases so
gains of in-network aggregation than the unicast-based as to minimize the resultant traffic of a shuffle transfer.
method, as shown in Fig.2 and Section 5. We start with a simplified shuffle transfer with n=1, an
incast transfer, and then discuss a general shuffle trans-
2.3 Stenier tree problem in general graphs fer. Given an incast transfer with a receiver R and a set of
The Steiner Tree problems have been proved to be senders {s1 , s2 , ..., sm }, message routes from senders to
NP-hard, even in the special cases of hypercube, gen- the same receiver essentially form an aggregation tree.
eralized Hypercube and BCube. Numerous heuristics In principle, there exist many such trees for an incast
for Steiner tree problem in general graphs have been transfer in densely connected data center networks, e.g.,
proposed [26], [27]. A simple and natural idea is us- BCube. Although any tree topology could be used, tree
ing a minimum spanning tree (MST) to approximate topologies differ in their aggregation gains. A challenge
the Steiner minimum tree. Such kind of methods can is how to produce an aggregation tree that minimizes
achieve a 2-approximation. The Steiner tree problem in the amount of network traffic of a shuffle transfer after
general graphs cannot be approximated within a factor applying the in-network aggregation.
of 1+ for sufficiently small [28]. Several methods, Given an aggregation tree in BCube, we define the
based on greedy strategies originally due to Zelikovsky total edge weight as its cost metric, i.e., the sum of the
[29], achieve better approximation ratio than MST-based amount of the outgoing traffics of all vertices in the tree.

1045-9219 (c) 2013 IEEE. Personal use is permitted, but 3
republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

As discussed in Section 2.1, BCube utilizes traditional algorithms cannot efficiently exploit the topological fea-
switches and only data center servers enable the in- ture of BCube, a densely connected data center network.
network caching and processing. Thus, a vertex in the For such reasons, we develop an efficient incast tree
aggregation tree is an aggregating vertex only if it repre- building method by exploiting the topological feature of
sents a server and at least two flows converge at it. An BCube in Section 3.2. The resulting time complexity is of
aggregating vertex then aggregates its incoming flows O(m× log3 N ) in Theorem 2.
and forwards a resultant single flow instead of multiple Definition 3: For a shuffle transfer, the minimal aggre-
individual flows along the tree. At a non-aggregating gation subgraph problem is to find a connected subgraph
vertex, the size of its outgoing flow is the cumulative size in G=(V, E) that spans all shuffle members with minimal
of its incoming flows. The vertices representing switches cost for completing the shuffle transfer.
are always non-aggregating ones. We assume that the Note that the minimal aggregation tree of an incast
introduced traffic from each of m senders is unity (1 transfer in BCube is NP-hard and a shuffle transfer
MB) so as to normalize the cost of an aggregation tree. is normalized as the combination of a set of incast
In such a way, the weight of the outgoing link is one transfers. Therefore, the minimal aggregation subgraph
at an aggregating vertex and equals to the number of of a shuffle transfer in BCube is NP-hard. For this reason,
incoming flows at a non-aggregating vertex. We develop an efficient shuffle subgraph construction
Definition 2: For an incast transfer, the minimal aggre- method by exploiting the topological feature of BCube
gation tree problem is to find a connected subgraph in in Section 3.3.
G=(V, E) that spans all incast members with minimal After deriving the shuffle subgraph for a given shuffle,
cost for completing the incast transfer. each sender greedily delivers its data flow toward a
Theorem 1: The minimal aggregation tree problem is receiver along the shuffle subgraph if its output for the
NP-hard in BCube. receiver is ready. All packets will be cached when a
Proof: In BCube, a pair of one-hop neighboring data flow meets an aggregating server. An aggregating
servers is indirectly connected by a network device. server can perform the aggregation operation once a
Consider that all of network devices in BCube do not new data flow arrives and brings an additional delay.
conduct the in-network aggregation and just forward Such a scheme amortizes the delay due to wait and
received flows. Accordingly, we can update the topology simultaneously aggregate all incoming flows. The cache
of BCube by removing all network devices and replacing behavior of the fast data flow indeed brings a little
the two-hop path between two neighboring servers with additional delay of such a flow. However, the in-network
a direct link. Consequently, BCube is transferred as the aggregation can lower down the job completion time due
generalized hypercube structure. to significantly reduce network traffic and improve the
Note that without loss of generality, the size of traffic utilization of rare bandwidth resource, as shown in Fig.7.
resulting from each sender of an incast transfer is as-
sumed to be unity (1 MB) so as to normalize the cost of
an aggregation tree. If a link, in the related generalized 3.2 Incast aggregation tree building method
hypercube, carries one flow in that incast transfer, it will As aforementioned, building a minimal aggregation tree
cause 2MB network traffic in BCube. That is, the weight for any incast transfer in BCube is NP-hard. In this paper,
of each link in generalized hypercube is 2MB. After such we aim to design an approximate method by exploiting
a mapping, for any incast transfer in BCube, the problem the topological feature of data center networks, e.g.,
is translated to find a connected subgraph in generalized BCube(n, k).
hypercube so as to span all incast members with minimal BCube(n, k) is a densely connected network structure
cost. Thus, the minimal aggregation tree problem is just since there are k+1 equal-cost disjoint unicast paths
equivalent to the minimum Steiner Tree problem, which between any server pair if their labels differ in k+1
spans all incast members with minimal cost. The Steiner dimensions. For an incast transfer, each of its all senders
Tree problem is NP-hard in hypercube and generalized independently delivers its data to the receiver along a
hypercube [32]. Thus, the minimal aggregation tree prob- unicast path that is randomly selected from k+1 ones.
lem is NP-hard in BCube. The combination of such unicast paths forms a unicast-
As aforementioned in Section 2.3, there are many based aggregation tree, as shown in Fig.2(a). Such a
approximate algorithms for Steiner tree problem. The method, however, has less chance of achieving the gain
time complexity of such algorithms for general graphs of the in-network aggregation.
is of O(m×N 2 ), where m and N are the numbers of For an incast transfer, we aim to build an efficient ag-
incast numbers and all servers in a data center. The gregation tree in a managed way instead of the random
time complexity is too high to meet the requirement of way, given the labels of all incast members and the data
online tree building for incast transfers in production center topology. Consider an incast transfer consists of
data centers which hosts large number of servers. For a receiver and m senders in BCube(n, k). Let d denote
example, Google has more than 450,000 servers in 2006 the maximum hamming distance between the receiver
in its 13th data center and Microsoft and Yahoo! have and each sender in the incast transfer, where d≤k+1.
hundreds of thousands servers. On the other hand, such Without loss of generality, we consider the general case

1045-9219 (c) 2013 IEEE. Personal use is permitted, but 4
republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

(a) A unicast-based tree of cost 22 with 18 links. (b) An aggregation tree of cost 16 with 12 links. (c) An aggregation tree of cost 14 with 11 links.

Fig. 2. Different aggregation trees for an incast transfer with the sender set {v2 , v5 , v9 , v10 , v11 , v14 } and the receiver
v0 in Bcube(4,1).

that d=k+1 in this paper. An aggregation tree of the group, all servers establish paths to a common neighbor-
incast transfer can be expanded as a directed multistage ing server at stage j−1 that aggregates the flows from
graph with d+1 stages. The stage 0 only has the receiver. all servers in the group, called an inter-stage aggregation.
Each of all senders should appear at stage j if it is a All servers in the group and their common neighboring
j-hop neighbor of the receiver. Only such senders and server at stage j−1 have the same label except the ej
the receiver, however, cannot definitely form a connected dimension. Actually, the ej dimension of the common
subgraph. The problem is then translated to identify a neighboring server is just equal to the receiver’s label in
minimal set of servers for each stage and to identify dimension ej . Thus, the number of partitioned groups at
switches between successive stages so as to constitute a stage j is just equal to the number of appended servers
minimal aggregation tree in BCube(n, k). The level and at stage j−1. The union of such appended servers and
label of a switch between a pair of neighboring servers the possible senders at stage j−1 constitute the set of all
across two stages can be inferred from the labels of the servers at stage j−1. For example, all servers at stage 2
two servers. We thus only focus on identify additional are partitioned into three groups, {v5 , v9 }, {v10 , v14 }, and
servers for each stage. {v11 }, using a routing symbol e2 =1, as shown in Fig.2(b).
Identifying the minimal set of servers for each stage The neighboring servers at stage 1 of such groups are
is an efficient approximate method for the minimal with labels v1 , v2 , and v3 , respectively.
aggregation tree problem. For any stage, less number The number of servers at stage j−1 still has an
of servers incurs less number of flows towards the opportunity to be reduced due to the observations as
receiver since all incoming flows at each server will be follows. There may exist groups each of which only has
aggregated as a single flow. Given the server set at stage one element at each stage. For example, all servers at
j for 1≤j≤d, we can infer the required servers at stage the stage 2 are partitioned into three groups, {v5 , v9 },
j−1 by leveraging the topology features of BCube(n, k). {v10 , v14 }, and {v11 } in Fig.2(b). The flow from the server
Each server at stage j, however, has a one-hop neighbor {v11 } to the receiver has no opportunity to perform the
at stage j−1 in each of the j dimensions in which the in-network aggregation.
labels of the server and the receiver only differ. If each To address such an issue, we propose an intra-stage
server at stage j randomly selects one of j one-hop aggregation scheme. After partitioning all servers at any
neighbors at stage j−1, it just results in a unicast-based stage j according to ej , we focus on groups each of
aggregation tree. which has a single server and its neighboring server at
The insight of our method is to derive a common stage j−1 is not a sender of the incast transfer. That is,
neighbor at stage j−1 for as many servers at stage j as the neighboring server at stage j−1 for each of such
possible. In such a way, the number of servers at stage groups fails to perform the inter-stage aggregation. At
j−1 can be significantly reduced and each of them can stage j, the only server in such a group has no one-
merge all its incoming flows as a single one for further hop neighbors in dimension ej , but may have one-hop
forwarding. We identify the minimal set of servers for neighbors in other dimensions, {0, 1, ..., k}−{ej }. In such
each stage from stage d to stage 1 along the processes as a case, the only server in such a group no longer delivers
follows. its data to its neighboring server at stage j−1 but to a
We start with stage j=d where all servers are just one-hop neighbor in another dimension at stage j. The
those senders that are d-hops neighbors of the receiver selected one-hop neighbor thus aggregates all incoming
in the incast transfer. After defining a routing symbol data flows and the data flow by itself (if any) at stage j,
ej ∈{0, 1, ..., k}, we partition all such servers into groups called an intra-stage aggregation.
such that all servers in each group are mutual one- Such an intra-stage aggregation can further reduce the
hop neighbors in dimension ej . That is, the labels of all cost of the aggregation tree since such a one-hop intra-
servers in each group only differ in dimension ej . In each stage forwarding increases the tree cost by two but saves

1045-9219 (c) 2013 IEEE. Personal use is permitted, but 5
republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

the tree cost by at least four. The root cause is that for a O((k+1)2 ×m). In summary, the total computation cost of
sole server in a group at a stage j the nearest aggregating generating the set of servers at stage k is O((k+1)2 ×m).
server along the original path to the receiver appears at At stage k, our approach can identify the server set for
stage ≤j−2. Note that the cost of an inter-stage or an stage k−1 from k candidates resulting from k settings
intra-stage transmission is two due to the relay of an of ek since ek+1 has exclusively selected one from the
intermediate switch. For example, the server v11 has no set {0, 1, 2, ..., k}. The resultant computation cost is thus
neighbor in dimension e2 =1 at stage 2, however, has two O(k 2 ×m). In summary, the total computation cost of the
neighbors, the servers v9 and v10 , in dimension 0 at stage IRS-based building method is O(k 3 ×m) since it involves
2, as shown in Fig.2(b). If the server v11 uses the server at most k stages. Thus, Theorem 2 proved.
v9 or v10 as a relay for its data flow towards the receiver, Our method for minimal aggregation tree in the BCube
prior aggregation tree in Fig.2(b) can be optimized as the and the method for minimal Steiner tree in the hyper-
new one in Fig.2(c). Thus, the cost of the aggregation tree cube proposed in [32] share similar process. Such two
is decreased by two while the number of active links is methods differ from those MST-based and improved ap-
decremented by 1. proximate methods for general graphs in Section 2.3 and
So far, the set of servers at stage j−1 can be achieved are in effect a (1+) approximation, as shown in [32]. The
under a partition of all servers at stage j according evaluation result indicates that our method outperforms
to dimension ej . The number of outgoing data flows the approximate algorithm for general graphs, whose
from stage j−1 to its next stage j−2 is just equal to approximate ratio is less than 2 [26].
the cardinality of the server set at stage j−1. Inspired
by such a fact, we apply the above method to other k 3.3 Shuffle aggregation subgraph building method
settings of ej and accordingly achieve other k server sets For a shuffle transfer in BCube(n, k), let S={s1 , s2 , ..., sm }
that can appear at the stage j−1. We then simply select and R={r1 , r2 , ..., rn } denote the sender set and receiver
the smallest server set from all k+1 ones. The setting set, respectively. An intrinsic way for scheduling all
of ej is marked as the best routing symbol for stage j flows in the shuffler transfer is to derive an aggregation
among all k+1 candidates. Thus, the data flows from tree from all senders to each receiver via the IRS-based
stage j can achieve the largest gain of the in-network building method. The combination of such aggrega-
aggregation at the next stage j−1. tion trees produces an incast-based shuffle subgraph in
Similarly, we can further derive the minimal set of G=(V, E) that spans all senders and receivers. The shuf-
servers and the best choice of ej−1 at the next stage j−2, fle subgraph consists of m×n paths along which all flows
given the server set at j−1 stage, and so on. Here, the in the shuffle transfer can be successfully delivered. If
possible setting of ej−1 comes from {0, 1, ..., k} − {ej }. we produce a unicast-based aggregation tree from all
Finally, server sets at all k+2 stages and the directed senders to each receiver, the combination of such trees
paths between successive stages constitute an aggrega- results in a unicast-based shuffle subgraph.
tion tree. The behind routing symbol at each stage is We find that the shuffle subgraph has an opportunity
simultaneously found. Such a method is called the IRS- to be optimized due to the observations as follows. There
based aggregation tree building method. exist (k+1)×(n−1) one-hop neighbors of any server
Theorem 2: Given an incast transfer consisting of m in BCube(n, k). For any receiver r∈R, there may exist
senders in BCube(n, k), the complexity of the IRS-based some receivers in R that are one-hop neighbors of the
aggregation tree building method is O(m× log3 N ), receiver r. Such a fact motivates us to think whether the
where N =nk+1 is the number of servers in a BCube(n, k). aggregation tree rooted at the receiver r can be reused to
Proof: Before achieving an aggregation tree, our carry flows for its one-hop neighbors in R. Fortunately,
method performs at most k stages, from stage k+1 to this issue can be addressed by a practical strategy. The
stage 2, given an incast transfer. The process to partition flows from all of senders to a one-hop neighbor r1 ∈R of
all servers at any stage j into groups according to ej can r can be delivered to the receiver r along its aggregation
be simplified as follows. For each server at stage j, we tree and then be forwarded to the receiver r1 in one hop.
extract its label except the ej dimension. The resultant In such a way, a shuffle transfer significantly reuses some
label denotes a group that involves such a server. We aggregation trees and thus utilizes less data center re-
then add the server into such a group. The computa- sources, including physical links, servers, and switches,
tional overhead of such a partition is proportional to compared to the incast-based method. We formalize such
the number of servers at stage j. Thus, the resultant a basic idea as a NP-hard problem in Definition 4.
time complexity at each stage is O(m) since the number Definition 4: For a shuffle transfer in BCube (n, k), the
of servers at each stage cannot exceed m. Additionally, minimal clustering problem of all receivers is how to
the intra-stage aggregation after a partition of servers partition all receivers into a minimal number of groups
at each stage incurs additional O(k×m) computational under two constraints. The intersection of any two
overhead. groups is empty. Other receivers are one-hop neighbors
In the IRS-based building method, we conduct the of a receiver in each group.
same operations under all of k+1 settings of ek+1 . This Such a problem can be relaxed to find a minimal
produces k+1 server sets for stage k at the cost of dominating set (MDS) in a graph G0 =(V 0 , E 0 ), where V 0

1045-9219 (c) 2013 IEEE. Personal use is permitted, but 6
republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Algorithm 1 Clustering all nodes in a graph G0 =(V 0 , E 0 )
Require: A graph G0 = (V 0 , E 0 ) with |V 0 | = n vertices.
1: Let groups denote an empty set;
2: while G0 is not empty do
3: Calculate the degree of each vertex in V 0 ;
4: Find the vertex with the largest degree in G0 . Such
a vertex and its neighbor vertices form a group
that is added into the set groups; (a) A tree of cost 14 with 11 active (b) A tree of cost 12 with 9 active links.
5: Remove all elements in the resultant group and links.
their related edges from G0 . Fig. 3. Examples of aggregation trees for two incast trans-
fers, with the same sender set {v2 , v5 , v9 , v10 , v11 , v14 } and
denotes the set of all receivers of the shuffle transfer, and
the receiver v3 or v8 .
for any u, v∈V 0 , there is (u, v)∈E 0 if u and v are one-hop
neighbor in BCube(n, k). In such a way, each element of rj finally forwards flows towards other |Ri |−β−1
MDS and its neighbors form a group. Any pair of such receivers, each of which is two-hops from rj due
groups, however, may have common elements. This is to the relay of the group head r1 at the cost of 4.
the only difference between the minimal dominating set Given Cij for 1≤j≤|Ri |, the cost of a shuffle transfer
problem and the minimal clustering problem. Finding a from m senders to a receiver group Ri ={r1 , r2 , ..., rg } is
minimum dominating set is NP-hard in general. There- given by
fore, the minimal clustering problem of all receivers for a |R |
Ci = min{Ci1 , Ci2 , ..., Ci i }. (1)
shuffle transfer is NP-hard. We thus propose an efficient
algorithm to approximate the optimal solution, as shown As a result, the best entry point for the receiver group Ri
in Algorithm 1. can be found simultaneously. It is not necessary that the
The basic idea behind our algorithm is to calculate the best entry point for the receiver group Ri is the group
degree of each vertex in the graph G0 and find the vertex head r1 . Accordingly, a shuffle subgraph from m senders
with the largest degree. Such a vertex (a head vertex) and to all receivers in Ri is established. Such a method is
its neighbors form the largest group. We then remove called the SRS-based shuffle subgraph building method.
all vertices in the group and their related edges from We use an example to reveal the benefit of our SRS-
G0 . This brings impact on the degree of each remainder based building method. Consider a shuffle transfer from
vertex in G0 . Therefore, we repeat the above processes all of senders {v2 , v5 , v9 , v10 , v11 , v14 } to a receiver group
if the graph G0 is not empty. In such a way, Algorithm {v0 , v3 , v8 } with the group head v0 . Fig.2(c), Fig.3(a), and
1 can partition all receivers of any shuffle transfer as a Fig.3(b) illustrate the IRS-based aggregation trees rooted
set of disjoint groups. Let α denote the number of such at v0 , v3 , and v8 , respectively. If one selects the group
groups. head v0 as the entry point for the receiver group, the
We define the cost of a shuffle transfer from m senders resultant shuffle subgraph is of cost 46. When the entry
Pαa group of receivers Ri ={r1 , r2 , ..., rg } as Ci , where
to point is v3 or v8 , the resultant shuffle subgraph is of cost
i≥1 |Ri |=n. Without loss of generality, we assume that 48 or 42. Thus, the best entry point for the receiver group
the group head is r1 in Ri . Let cj denote the cost of an {v0 , v3 , v8 } should be v8 not the group head v0 .
aggregation tree, produced by our IRS-based building In such a way, the best entry point for each of the
method, from all m senders to a receiver rj in the group α receiver groups and the associated shuffle subgraph
Ri where 1≤j≤|Ri |. The cost of such a shuffle transfer can be produced. The combination of such α shuffle
depends on the entry point selection for each receiver subgraphs generates the final shuffle subgraph from m
group Ri . senders to n receivers. Therefore, the cost of the resultantPα
1) Ci1 =|Ri |×c1 +2(|Ri |−1) if the group head r1 is se- shuffle subgraph, denoted as C, is given by C= i≥1 Ci .
lected as the entry point. In such a case, the flows
from all of senders to all of the group members are
firstly delivered to r1 along its aggregation tree and 3.4 Robustness against failures
then forwarded to each of other group members The failures of the link, server, and switch in a data
in one hop. Such a one-hop forwarding operation center may have a negative impact on the resulting
increases the cost by two due to the relay of a shuffle subgraph for a given shuffle transfer. Hereby, we
switch between any two servers. describe how to make the resultant shuffle subgraph ro-
2) Cij =|Ri |×cj +4(|Ri |−β−1)+2×β if a receiver rj is bust against failures. The fault-tolerant routing protocol
selected as the entry point. In such a case, the flows of BCube already provides some disjoint paths between
from all of senders to all of the group members are any server pair. A server at stage j fails to send packets
firstly delivered to rj along its aggregation tree. The to its parent server at stage j−1 due to the link failures
flow towards each of β one-hop neighbors of rj in or switch failures. In such a case, it forwards packets
Ri , including the head r1 , reaches its destination toward its parent along another path that can route
in one-hop from rj at the cost of 2. The receiver around the failures at the default one.

1045-9219 (c) 2013 IEEE. Personal use is permitted, but 7
republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
                                                10.1109/TPDS.2014.2316829, IEEE Transactions on Parallel and Distributed Systems

             When a server cannot send packets to its failed par-                             structures utilize those novel switches each with a pro-
          ent server, it is useless to utilize other backup disjoint                          grammable data plane. We leave the incast minimal tree
          path to the failed server. If the failed one is not an                              problem for other switch-centric structures as one of our
          aggregating server on the path from the current server                              future work. Note that the software-defined networking
          to the receiver, it can reroute packets along another                               and programmable data plane just bring possibility to
          disjoint path toward the next aggregating server (if any)                           implement arbitrary in-network processing functionality.
          or the receiver. If the failed server is an aggregating                             Many efforts should be done on improving the capability
          server, it may have received and stored intermediate data                           of novel network devices so as to support the aggrega-
          from some senders. Thus, each sender whose data flow                                tion of the data from large-scale map-reduce jobs.
          passes through such a failed aggregating server needs to
          redeliver its data flow to the next aggregating server (if
          any) or the receiver from a disjoint backup path.
                                                                                              4 S CALABLE F ORWARDING S CHEMES                                         FOR
                                                                                              P ERFORMING I N - NETWORK AGGREGATION
          3.5    Impact of job characteristics                                                We start with a general forwarding scheme to perform
          The MapReduce workload at Facebook in a 600 nodes                                   the in-network aggregation based on a shuffle subgraph.
          cluster shows that 83% of jobs have fewer than 271                                  Accordingly, we propose two more scalable and practical
          map tasks and 30 reduce tasks on average [33]. More                                 forwarding schemes based on in-switch Bloom filters
          precisely, the authors in [34], [35] demonstrate that jobs                          and in-packet Bloom filters, respectively.
          in two production clusters, each consisting of thousands
          of machines Facebook’s Hadoop cluster and Microsoft                                 4.1     General forwarding scheme
          Bing’s Dryad cluster, exhibit a heavy-tailed distribution
                                                                                              As MapReduce-like systems grow in popularity, there is
          of input sizes. The job sizes (input sizes and number of
                                                                                              a strong need to share data centers among users that sub-
          tasks) indeed follow a power-law distribution, i.e, the
                                                                                              mit MapReduce jobs. Let β be the number of such jobs
          workloads consist of many small jobs and relatively few
                                                                                              running in a data center concurrently. For a job, there
          large jobs.
                                                                                              exists a JobTracker that is aware of the locations of m
             Our data aggregation method is very suitable to such
                                                                                              map tasks and n reduce tasks of a related shuffle transfer.
          small jobs since each aggregating server has sufficient
                                                                                              The JobTracker, as a shuffle manager, then calculates an
          memory space to cache required packets with high
                                                                                              efficient shuffle subgraph using our SRS-based method.
          probability. Additionally, for most MapReduce like jobs,
                                                                                              The shuffle subgraph contains n aggregation trees since
          the functions running on each aggregating server are
                                                                                              such a shuffle transfer can be normalized as n incast
          associative and commutative. The authors in [36] fur-
                                                                                              transfers. The concatenation of the job id and the receiver
          ther proposed methods to translate those noncommu-
                                                                                              id is defined as the identifier of an incast transfer and its
          tative/nonassociative functions and user-defined func-
                                                                                              aggregation tree. In such a way, we can differentiate all
          tions into commutative/associative ones. Consequently,
                                                                                              incast transfers in a job or across jobs in a data center.
          each aggregating server can perform partial aggregation
                                                                                                 Conceptually, given an aggregation tree all involved
          when receiving partial packets of flows. Thus, the allo-
                                                                                              servers and switches can conduct the in-network ag-
          cated memory space for aggregating functions is usually
                                                                                              gregation for the incast transfer via the following op-
          sufficient to cache required packets.
                                                                                              erations. Each sender (a leaf vertex) greedily delivers a
                                                                                              data flow along the tree if its output for the receiver
          3.6    Extension to other network structure                                         has generated and it is not an aggregating server. If
          Although we use the BCube as a vehicle to study the                                 a data flow meets an aggregating server, all packets
          incast tree building methods for server-centric structures,                         will be cached. Upon the data from all its child servers
          the methodologies proposed in this paper can be applied                             and itself (if any) have been received, an aggregating
          to other server-centric structures. The only difference is                          server performs the in-network aggregation on multiply
          the incast-tree building methods that exploit different                             flows as follows. It groups all key-value pairs in such
          topological features of data center network structures,                             flows according to their keys and applies the aggregate
          e.g, DCell [5], BCN [9], SWDC [19], and Scafida [20].                               function to each group. Such a process finally replaces
          More details on aggregating data transfers in such data                             all participant flows with a new one that is continuously
          centers, however, we leave as one of our future work.                               forwarded along the aggregation tree.
            The methodologies proposed in this paper can also                                    To put such a design into practice, we need to make
          be applied to FBFLY and HyperX, two switch-centric                                  some efforts as follows. For each of all incast transfers
          network structures. The main reason is that the topology                            in a shuffle transfer, the shuffle manager makes all
          behind the BCube is an emulation of the generalized                                 servers and switches in an aggregation tree are aware
          hypercube while the topologies of FBFLY and HyperX                                  of that they join the incast transfer. Each device in the
          at the level of switch are just the generalized hyper-                              aggregation tree thus knows its parent device and adds
          cube. For FBFLY and HyperX, our building methods                                    the incast transfer identifier into the routing entry for the
          can be used directly with minimal modifications if such                             interface connecting to its parent device. Note that the

                                 1045-9219 (c) 2013 IEEE. Personal use is permitted, but 8
                                                                                         republication/redistribution requires IEEE permission. See
                                         http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

routing entry for each interface is a list of incast transfer the identifiers of all incast transfers on the interface.
identifiers. When a switch or server receives a flow with When the switch or server receives a flow with an incast
an incast transfer identifier, all routing entries on its transfer identifier, all the Bloom filters on its interfaces
interfaces are checked to determine where to forward are checked to determine which interface the flow should
or not. be sent out. If the response is null at a server, it is
Although such a method ensures that all flows in a the destination of the packet. Consider that a server is
shuffle transfer can be successfully routed to destina- a potential aggregating server in each incast transfer.
tions, it is insufficient to achieve the inherent gain of When a server receives a flow with an incast identifier,
in-network aggregation. Actually, each flow of an incast it first checks all id−value pairs to determine whether
transfer is just forwarded even it reaches an aggregating to cache the flow for the future aggregation. Only if the
server in the related aggregation tree. The root cause is related value is 1 or a new flow is generated by aggre-
that a server cannot infer from its routing entries that it gating cached flows, the server checks all Bloom filters
is an aggregating server and no matter knows its child on its interfaces for identifying the correct forwarding
servers in the aggregation tree. To tackle such an issue, direction.
we let each server in an aggregation tree maintain a pair The introduction of Bloom filter can considerably com-
of id−value that records the incast transfer identifier and press the forwarding table at each server and switch.
the number of its child servers. Note that a switch is not Moreover, checking a Bloom filter on each interface only
an aggregation server and thus just forwards all received incurs a constant delay, only h hash queries, irrespective
flows. of the number of items represented by the Bloom filter.
In such a way, when a server receives a flow with an In contrast, checking the routing entry for each interface
incast transfer identifier, all id−value pairs are checked usually incurs O(log γ) delay in the general forwarding
to determine whether to cache the flow for the future scheme, where γ denotes the number of incast transfers
aggregation. That is, a flow needs to be cached if the on the interface. Thus, Bloom filters can significantly
related value exceeds one. If yes and the flows from all reduce the delay of making a forwarding decision on
its child servers and itself (if any) have been received, each server and switch. Such two benefits help realize
the server aggregates all cached flows for the same the scalable incast and shuffle forwarding in data center
incast transfer as a new flow. All routing entries on its networks.
interfaces are then checked to determine which interface
the new flow should be sent out. If the response is
null, the server is the destination of the new flow. If a 4.3 In-packet bloom filter based forwarding scheme
flow reaches a non-aggregating server, it just checks all Both the general and the in-switch Bloom filter based for-
routing entries to identify an interface the flow should warding schemes suffer non-trivial management over-
be forwarded. head due to the dynamic behaviors of shuffle transfers.
More precisely, a new scheduled Mapreduce job will
4.2 In-switch bloom filter based forwarding scheme bring a shuffle transfer and an associated shuffle sub-
One trend of modern data center designs is to utilize graph. All servers and switches in the shuffle subgraph
a large number of low-end switches for interconnection thus should update its routing entry or Bloom filter on
for economic and scalability considerations. The space related interface. Similarly, all servers and switches in a
of fast memory in such kind of switches is relatively shuffle subgraph have to update their routing table once
narrow and thus is quite challenging to support massive the related Mapreduce job is completed. To avoid such
incast transfers by keeping the incast routing entries. constraints, we propose the in-packet Bloom filter based
To address such a challenge, we propose two types of forwarding scheme.
Bloom filter based incast forwarding schemes, namely, Given a shuffle transfer, we first calculate a shuffle
in-switch Bloom filter and in-packet Bloom filter. subgraph by invoking our SRS-based building method.
A Bloom filter consists of a vector of m bits, initially For each flow in the shuffle transfer, the basic idea of
all set to 0. It encodes each item in a set X by mapping our new method is to encode the flow path information
it to h random bits in the bit vector uniformly via h into a Bloom filter field in the header of each packet of
hash functions. To judge whether an element x belongs the flow. For example, a flow path from a sender v14 to
to X with n0 items, one just needs to check whether all the receiver v0 in Fig.2(c) is a sequence of links, denoted
hashed bits of x in the Bloom filter are set to 1. If not, as v14 →w6 , w6 →v2 , v2 →w0 , and w0 →v0 . Such a set of
x is definitely not a member of X. Otherwise, we infer links is then encoded as a Bloom filter via the standard
that x is a member of X. A Bloom filter may yield a false input operation [37], [38]. Such a method eliminates
positive due to hash collisions, for which it suggests that the necessity of routing entries or Bloom filters on the
an element x is in X even though it is not. The false interfaces of each server and switch. We use the link-
positive probability is fp ≈ (1−e−h×n0 /m )h . From [37], based Bloom filters to encode all directed physical links
fp is minimized as 0.6185m/n0 when h=(m/n0 ) ln 2. in the flow path. To differentiate packets of different
For the in-switch Bloom filter, each interface on a incast transfers, all packets of a flow is associated with
server or a switch maintains a Bloom filter that encodes the identifier of an incast transfer the flow joins.

1045-9219 (c) 2013 IEEE. Personal use is permitted, but 9
republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

When a switch receives a packet with a Bloom filter here n0 =2(k+1). If we impose a constriant that the value
field, its entire links are checked to determine along of Formula (2) is less than 1, then
which link the packet should be forwarded. For a server, 2
it should first check all id−value pairs to determine m ≥ 2(k+1) × log0.6185 . (3)
k × (k−1)
whether to cache the flow for the future aggregation.
That is, the packet should be forwarded directly if We can derive the requried size of Bloom filter field
the related value is 1. Once a server has received all of each packet from Formula (3). We find that the value
value flows of an incast transfer, e.g., server v10 collects of k is usually not large for a large-scale data center, as
value=3 flows in Fig.3(b), such flows are aggregated as shown in Section 5.5. Thus, the Bloom filter field of each
a new flow. To forward such packets, the server checks packet incurs additional traffic overhead.
the Bloom filter with its entire links as inputs to find a
forwarding direction. 5 P ERFORMANCE EVALUATION
Generally, such a forwarding scheme may incur false We start with our prototype implementation. We then
positive forwarding due to the false positive feature evaluate our methods and compare with other works un-
of Bloom filters. When a server in a flow path meets der different data center sizes, shuffle transfer sizes, and
a false positive, it will forward flow packets towards aggregation ratios. We also measure the traffic overhead
another one-hop neighbors besides its right upstream of our in-packet Bloom filter based forwarding scheme.
server. The flow packets propagated to another server
due to a false positive will be terminated with high
probability since the in-packet Bloom filter just encodes 5.1 The prototype implementation
the right flow path. Additionally, the number of such Our testbed consists of 61 virtual machines (VM) hosted
mis-forwarding can be less than one during the entire by 6 servers connected with an Ethernet. Each server
propagation process of a packet if the parameters of a equips with two 8-core processors, 24GB memory and a
Bloom filter satisfy certain requirements. 1TB disk. Five of the servers run 10 virtual machines as
For a shuffle transfer in BCube(n, k), a flow path is at the Hadoop virtual slave nodes, and the other one runs
most 2(k + 1) of length, i.e., the diameter of BCube(n, k). 10 virtual slave nodes and 1 master node that acts as
Thus, a Bloom filter used by each flow packet encodes the shuffle manager. Each virtual slave node supports
n0 = 2(k+1) directed links. Let us consider a forwarding two map tasks and two reduce tasks. We extend the
of a flow packet from stage i to stage i−1 for 1≤i≤k Hadoop to embrace the in-network aggregation on any
along the shuffle subgraph. The server at stage i is an shuffle transfer. We launch the built-in wordcount job
i-hop neighbor of the flow destination since their labels with the combiner, where the job has 60 map tasks and
differ in i dimensions. When the server at stage i needs 60 reduce tasks, respectively. That is, such a job employs
to forward a packet of the flow, it just needs to check one map task and reduce task from each of 60 VMs. We
i links towards its i one-hop neighbors that are also associate each map task ten input files each with 64M. A
(i−1)-hops neighbors of the flow destination. The flow shuffle transfer from all 60 senders to 60 receivers is thus
packet, however, will be forwarded to far away from achieved. The average amount of data from a sender
the destination through other k+1−i links that should (map task) to the receiver (reduce task) is about 1M after
be ignored directly when the server makes a forwarding performing the combiner at each sender. Each receiver
decision. Among the i links on the server at stage i, performs the count function. Our approaches exhibit
i−1 links may cause false positive forwarding at a given more benefits for other aggregation functions, e.g., the
probability when checking the Bloom filter field of a sum, maximum, minimum, top-k and KNN functions.
flow packet. The packet is finally sent to an intermediate To deploy the wordcount job in a BCube(6, k) data
switch along the right link. Although the switch has center for 2≤k≤8, all of senders and receivers (the map
n links, only one of them connects with the server at and reduce tasks) of the shuffle transfer are randomly
stage i−1 in the path encoded by a Bloom filter field in allocated with BCube labels. We then generate the SRS-
the packet. The servers along with other n−2 links of based, incast-based, and unicast-based shuffle subgraphs
the switch are still i-hops neighbors of the destination. for the shuffle transfer via corresponding methods. Our
Such n−2 links are thus ignored when the switch makes testbed is used to emulate a partial BCube(6,k) on which
a forwarding decision. Consequently, no false positive the resultant shuffle subgraphs can be conceptedly de-
forwarding appears at a switch. ployed. To this end, given a shuffle subgraph we omit all
In summary, given a shuffle subgraph a packet will of switches and map all of intermediate server vertices
meet at most k servers before reaching the destination to the 60 slave VMs in our testbed. That is, we use
and may occur false positive forwarding on each of at a software agent to emulate an intermediate server so
Pk−1
most i=1 i server links at probability fp . Thus, the as to receive, cache, aggregate, and forward packets.
number of resultant false positive forwarding due to Thus, we achieve an overlay implementation of shuffle
deliver a packet to its destination is given by transfer. Each path between two neighboring servers in
Xk−1 m the shuffle subgraph is mapped to a virtual link between
fp × i = 0.6185 2(k+1) × k × (k−1)/2, (2) VMs or a physical link across servers in our testbed.
i=1

1045-9219 (c) 2013 IEEE. Personal use is permitted, but10
republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You can also read