Hermes: Dynamic Partitioning for Distributed Social Network Graph Databases

Page created by Marie Vazquez
 
CONTINUE READING
Hermes: Dynamic Partitioning for Distributed Social Network Graph Databases
Hermes: Dynamic Partitioning for Distributed
                       Social Network Graph Databases

          Daniel Nicoara                         Shahin Kamali                  Khuzaima Daudjee                      Lei Chen
       University of Waterloo  University of Waterloo                           University of Waterloo                HKUST
     daniel.nicoara@gmail.com s3kamali@uwaterloo.ca                            kdaudjee@uwaterloo.ca            leichen@cse.ust.hk

ABSTRACT                                                                        jectives need to be met:
Social networks are large graphs that require multiple graph                   • The partitioning should be balanced. Each vertex of the
database servers to store and manage them. Each database                         graph has a weight that indicates the popularity of the
server hosts a graph partition with the objectives of bal-                       vertex (e.g., in terms of the frequency of queries to that
ancing server loads, reducing remote traversals (edge-cuts),                     vertex). In social networks, a small number of users (e.g.,
and adapting the partitioning to changes in the structure                        celebrities, politicians) are extremely popular while a large
of the graph in the face of changing workloads. To achieve                       number of users are much less popular. This discrepancy
these objectives, a dynamic repartitioning algorithm is re-                      reveals the importance of achieving a balanced partitioning
quired to modify an existing partitioning to maintain good                       in which all partitions have almost equal aggregate weight
quality partitions while not imposing a significant overhead                     defined as the total weight of vertices in the partition.
to the system. In this paper, we introduce a lightweight                       • The partitioning should minimize the number of edge-cuts.
repartitioner, which dynamically modifies a partitioning us-                     An edge-cut is defined by an edge connecting vertices in
ing a small amount of resources. In contrast to the exist-                       two different partitions and involves queries that need to
ing repartitioning algorithms, our lightweight repartitioner                     transition from a partition on one server to a partition
is efficient, making it suitable for use in a real system. We                    on another server. This results in shifting local traversal
integrated our lightweight repartitioner into Hermes, which                      to remote traversal, thereby incurring significant network
we designed as an extension of the open source Neo4j graph                       latency. In social networks, it is critical to minimize edge-
database system, to support workloads over partitioned graph                     cuts since most operations are done on the node that rep-
data distributed over multiple servers. Using real-world                         resents a user and its immediate neighbors. Since these 1-
social network data, we show that Hermes leverages the                           hop traversal operations are so prevalent in these networks,
lightweight repartitioner to maintain high quality partitions                    minimizing edge-cuts is analogous to keeping communities
and provides a 2 to 3 times performance improvement over                         intact. This leads to highly local queries similar to those
the de-facto standard random hash-based partitioning.                            in SPAR [27] and minimizes the network load, allowing for
                                                                                 better scalability by reducing network IO.
                                                                               • The partitioning should be incremental. Social networks
1.    INTRODUCTION                                                               are dynamic in the sense that users and their relations
   Large scale graphs, in particular social networks, perme-                     are always changing, e.g., a new user might be added, two
ate our lives. The scale of these networks, often in millions                    users might get connected, or an ordinary user might be-
of vertices or more, means that it is often infeasible to store,                 come popular. Although the changes in the social graph
query and manage them on a single graph database server.                         can be much slower when compared to the read traffic [8],
Thus, there is a need to partition, or shard, the graph across                   a good partitioning solution should dynamically adapt its
multiple database servers, allowing the load and concurrent                      partitioning to these changes. Considering the size of the
processing to be distributed over these servers to provide                       graph, it is infeasible to create a partitioning from scratch;
good performance and increase availability. Social networks                      hence, a repartitioning solution, a repartitioner, is needed
exhibit a high degree of correlation for accesses of certain                     to improve on an existing partitioning. This usually in-
groups of records, for example through frictionless sharing                      volves migrating some vertices from one partition to an-
[15]. Also, these networks have a heavy-tailed distribution                      other.
for popularity of vertices. To achieve a good partitioning                     • The repartitioning algorithm should perform well in terms
which improves the overall performance, the following ob-                        of time and memory requirements. To achieve this effi-
                                                                                 ciency, it is desirable to perform repartitioning locally by
                                                                                 accessing a small amount of information about the struc-
                                                                                 ture of the graph. From a practical point of view, this
                                                                                 requirement is critical and prevents us from applying ex-
  c 2015, Copyright is with the authors. Published in Proc. 18th Inter-          isting approaches, e.g., [18, 30, 31, 6] for the repartitioning
national Conference on Extending Database Technology (EDBT), March               problem.
23-27, 2015, Brussels, Belgium: ISBN 978-3-89318-067-7, on OpenPro-               The focus of this paper is on the design and provision of
ceedings.org. Distribution of this paper is permitted under the terms of the
Creative Commons license CC-by-nc-nd 4.0                                        a practical partitioned social graph data management sys-
.                                                                               tem that can support remote traversals while providing an
effective method to dynamically repartition the graph using        tion ratio unless P=NP [7]. Hence, it is not possible to intro-
only local views. The distributed partitioning aims to co-         duce algorithms which provide worst-case guarantees on the
locate vertices of the graph on-the-fly so as to satisfy the       quality of solutions, and it makes more sense to study the
above requirements. The fundamental contribution of this           typical behavior of algorithms. Consequently, the problem
paper is a dynamic partitioning algorithm, referred to as          is mostly approached through heuristics [20] [12] which are
lightweight repartitioner, that can identify which parts of        aimed to improve the average-case performance. Regardless,
graph data can benefit from co-location. The algorithm aims        the time complexity of these heuristics Ω(n3 ) which makes
to incrementally improve an existing partitioning by decreas-      them unsuitable in practice.
ing edge-cuts while maintaining almost balanced partitions.           To improve the time complexity, a class of multi-level al-
The main advantage of the algorithm is that it relies on only      gorithms were introduced. In each level of these algorithms,
a small amount of knowledge on the graph structure referred        the input graph is coarsened to a representative graph of
to as auxiliary data. Since the auxiliary data is small and        smaller size; when the representative graph is small enough,
easy to update, our repartitioning algorithm is performant         a partitioning algorithm like that of Kernighan-Lin [20] is
in terms of time and memory while maintaining high-quality         applied to it, and the resulting partitions are mapped back
partitionings in terms of edge-cut and load balance.               (uncoarsened) to the original graph. Many algorithms fit in
   We built Hermes as an extension of the Neo4j1 open source       this general framework of multi-level algorithms; a widely
graph database system by incorporating into it our algo-           used example is the family of Metis algorithms [19, 30, 6].
rithm to provide the functionality to move data on-the-fly         The multi-level algorithms are global in the sense that they
to achieve data locality and reduce the cost of remote traver-     need to know the whole structure of the graph in the coars-
sals for graph data. Our experimental evaluation of Hermes         ening phase, and the coarsened graph in each stage should
using real-world social network graphs shows that our tech-        be stored for the uncoarsening stage. This problem is par-
niques are effective in producing performance gains and work       tially solved by introducing distributed versions of these al-
almost as well as the popular Metis partitioning algorithms        gorithms in which the partitioning algorithm is performed
[18, 30, 6] that performs static offline partitioning by relying   in parallel for each partition [4]. In these algorithms, in ad-
on a global view of the graph.                                     dition to the local information (structure of the partition),
   The rest of the paper is structured as follows. Section 2       for each vertex, the list of the adjacent vertices in other par-
describes the problem addressed in the paper and reviews           titions is required in the coarsening phase. The following
classical approaches and their shortcomings. Section 3 in-         theorem establishes that in the worst case, acquiring this
troduces and analyzes the lightweight repartitioner. Section       amount of data is close to having a global knowledge of
4 presents an overview of the Hermes system. Section 5             graph (the proof can be found in [25]).
presents performance evaluation of the system. Section 6
covers related work, and Section 7 concludes the paper.               Theorem 1. Consider the (α, γ)-graph partitioning prob-
                                                                   lem where γ < 2. There are instances of the problem for
                                                                   which the number of edge-cuts in any valid solution is asymp-
2.    PROBLEM DEFINITION                                           totically equal to the number of edges in the input graph.
  In this section we formally define the partitioning problem
and review some of the related results. In what follows, the          Hence, the average amount of data required in the coars-
term ‘graph’ refers to an undirected graph with weights on         ening phase of multi-level algorithms can be a constant frac-
vertices.                                                          tion of all edges. The graphs used in the proof of the above
                                                                   theorem belong to the family of power-law graphs which are
2.1    Graph Partitioning                                          often used to model social networks. Consequently, even the
                                                                   distributed versions of multi-level algorithms in the worst
   In the classical (α, γ)-graph partitioning problem [20], the
                                                                   case require almost global information on the structure of
goal is to partition a given graph into α vertex-disjoint sub-
                                                                   the graph (particularly when used for partitioning social
graphs. The weight of a partition is the total weight of ver-
                                                                   networks). This reveals the importance of providing practi-
tices in that partition. In a valid solution, the weight of each
                                                                   cal partitioning algorithms which need only a small amount
partition is at most a factor γ ≥ 1 away from the average
                                                                   of knowledge about the structure of the graph that can be
weight of partitions. More precisely, for P  a partition P of a
                                                                   easily maintained in memory. The lightweight repartitioner
graph G, we need to have ω(P ) ≤ γ ×              ω(v)/α. Here,
                                         v∈V (G)
                                                                   introduced in this paper has this property, i.e., it maintains
ω(P ) and ω(v) denote the weight of a partition P and vertex       only a small amount of data, referred to as auxiliary data,
v, respectively. Parameter γ is called the imbalance load fac-     to perform repartitioning.
tor and defines how imbalanced the partitions are allowed
to be. Practically, γ is in range [1, 2]. Here, γ = 1 implies
                                                                   2.2    Repartitioning
that partitions are required to be completely balanced (all           A variety of partitioning methods can be used to create
have the same aggregate weights), while γ = 2 allows the           an initial, static, partitioning. This should be followed by
weight of one partition to be up to twice the average weight       a repartitioning strategy to maintain good partitioning that
of all partitions. The goal of the minimization problem is to      can adapt to changes in the graph. One solution is to pe-
achieve a valid solution in which the number of edge-cuts is       riodically run an algorithm on the whole graph to get new
minimized.                                                         partitions. However, running an algorithm to get new par-
   The partitioning problem is NP-hard [13]. Moreover, there       titions from scratch is costly in terms of time and space.
is no approximation algorithm with a constant approxima-           Hence, an incremental partitioning algorithm needs to adapt
                                                                   the existing partitions to changes in the graph structure.
1                                                                     It is desirable to have a lightweight repartitioner that
  Neo4j is being used by customers such as Adobe and HP
[3].                                                               maintains only a small amount of auxiliary data to perform
repartitioning. Since such algorithm refers only to this auxil-                                                                                         gate weights) while keeping the number of edge-cuts as small
iary data, which is significantly smaller than the actual data                                                                                          as possible. For example, when the repartitioner starts from
required for storing the graph, the repartitioning algorithm                                                                                            the state in Figure 1b, on partition 1, vertices a through d
is not a system performance bottleneck. The auxiliary data                                                                                              are poor candidates for migration because their neighbors
maintained at each machine (partition) consists of the list of                                                                                          are in the same partition. Vertex e, however, has a split ac-
accumulated weight of vertices in each partition, as well as                                                                                            cess pattern between partitions 1 and 2. Since vertex e has
the number of neighbors of each hosted vertex in each parti-                                                                                            the fewest neighbors in partition one, it will be migrated to
tion. Note that maintaining the number of neighbors is far                                                                                              partition 2. On partition 2, the same process is performed
cheaper that maintaining the list of neighbors in other parti-                                                                                          in parallel; however, vertex f will not be migrated since par-
tions. In what follows, the main ideas behind our lightweight                                                                                           tition 1 has a higher aggregate weight. Once vertex e is
repartitioner are introduced through an example.                                                                                                        migrated, the load (aggregate weights) becomes balanced,
   Example: Consider the partitioning problem on the graph                                                                                              thus any remaining iterations will not result in any migra-
shown in Figure 1. Assume there are α = 2 partitions in the                                                                                             tions (see Figure 1c).
system and the imbalance factor is γ = 1.1, i.e., in a valid                                                                                               The above example is a simple case to illustrate how the
solution, the aggregate weight of a partition is at most 1.1                                                                                            lightweight repartitioner works. Several issues are left out
times more than the average weight of partitions. Assume                                                                                                of the example, e.g., two highly connected clusters of vertices
the numbers on vertices denote their weight. During nor-                                                                                                may repeatedly exchange their clusters to decrease edge-cut.
mal operation in social networks, users will request different                                                                                          This results in an oscillation which is discussed in detail in
pieces of information. In this sense, the weight of a ver-                                                                                              Section 3.
tex is the number of read requests to that vertex. Figure
1a shows a partitioning of the graph into two partitions,                                                                                               3.    PARTITIONING ALGORITHM
where there is only one edge-cut and the partitions are well
                                                                                                                                                           Unlike Neo4j which is centralized, Hermes can apply hash-
balanced, i.e., the weight of both partitions is equal to the
                                                                                                                                                        based or Metis algorithm to partition a graph and distribute
average weight. Assuming user b is a popular weblogger who
                                                                                                                                                        the partitions to multiple servers. Thus, the system starts
posts a post, the request traffic for vertex b will increase as
                                                                                                                                                        with an initial partitioning and incrementally applies the
its neighbors poll for updates, leading to an imbalance in
                                                                                                                                                        lightweight repartitioner to maintain partitioning with good
load on the first partition (see Figure 1b). Here, the ratio
                                                                                                                                                        performance in the dynamic environment. In this section,
between aggregate weight of partition 1 (i.e., 15) and the
                                                                                                                                                        we introduce the lightweight repartitioner algorithm behind
average weight of partitions (i.e., 13) is more than γ. This
                                                                                                                                                        Hermes. Embedding the initial partitioning algorithm and
means that the response time and request rates increase by
                                                                                                                                                        the lightweight repartitioner into Neo4j required modifica-
more than the acceptable skew limit, and the repartitioning
                                                                                                                                                        tion of Neo4j components.
needs to be triggered to rebalance the load across partitions
                                                                                                                                                           To increase query locality and decrease query response
(while keeping the number of edge-cuts as small as possible).
                                                                                                                                                        times, the initial partitioning needs to be optimized in terms
   The auxiliary data of the lightweight repartitioner avail-
                                                                                                                                                        of having almost balanced distributions (valid solutions) with
able to each partition includes the weight of each of the
                                                                                                                                                        small number of edge-cuts. We use Metis to obtain the ini-
two partitions, as well as the number of neighbors of each
                                                                                                                                                        tial data partitioning, which is a static, offline, process that
vertex v hosted in the partition. Provided with this aux-
                                                                                                                                                        is orthogonal to the dynamic, on-the-fly, partitioning that
iliary data, a partition can determine whether load imbal-
                                                                                                                                                        Hermes performs.
ances exist and the extent of the    Partitimbalance
                                           ion 1            in the       system
                                                                  Partition   2
                                             d 3
(to compare it with γ). If therea is2 a load         imbalance,
                                                        2
                                                                  g 3 a repar-
                                                                            j
                                                                               2                                                                        3.1    Lightweight Repartitioner
                                                      e
titioner needs to indicate      where to migrate data to restore
                   Partition 1                   2 2
                                         Partition                                                                                                         When new nodes join the network or the traffic patterns
load balance. Migration        is an iterativec       process
                                                          f 2    hwhich
                                                                    2           will
                           d 3
                                    b 2 g 3   11 j                      i 2                                                                           (weights) of nodes change, the lightweight repartitioner is
identify vertices athat
                    2      whene moved
                                  2       will balance
                                                     2        loads (aggre-
                                                              11                                                                                      triggered to rebalance vertex weights while decreasing edge-
                                                      2
                                                  c                               f   2
                                                                                           h 2
                                                                                                                                                        cut through an iterative process. The algorithm makes use
                                    b 2                                                             i 2
  Partition 1                                       11
                                            Partition 2                               Pa rtition 1
                                                                                         11
                                                                                                                                                        of aggregate vertex weight information as its auxiliary data.
                                                                                                                                       Partition 2
 a 2
          d 3                               g 3               j                       a 2
                                                                                                d 3
                                                                                                                                       g 3      j
                                                                                                                                                        Assuming there are α partitions, for each vertex v, the auxil-
                       2                                          2                                                     2
                   e                                                                                                e                               2
                                                                                                                                                        iary data includes α integers indicating the number of neigh-
              2                    Partition 1                                                          2                                               bors of v in each of the α partitions. This auxiliary data is
         c                     f 2
                                                                                             Partition
                                                                                                 c 2                            2
                                      h 2 d 3 2                                                                             f       h 2
 b 2        11                  a 2          i                                          b 6 g 3                                             i 2       insignificant compared to the physical data associated with
                                                                      e   2                     ω=15 j
                                   11                                                                         2
                                                                                                                                  11
                                                          2
                                                                                                                                                        the vertex which include adjacency list and other informa-
                                                  c                           f       2
                                                                                          Partit 2 1
                                                                                             h ion          Partition 2                                 tion referred to as properties of the vertex (e.g., pictures
                                    b 6                                                              i 2
         ion 1                                   ω=15                                                             g 3
  Partit
(a)  Balanced          partitioned
                              Partitiongraph
                                        2                                             a  211
                                                                                                  d(b)
                                                                                                     3 Skewed graph     j
                                                                                                                                                    2   posted by a user in a social network). The repartitioning
                                                                                                                        2
 a 2
           d 3
                                       g 3      j
                                                                                                                    e                                   auxiliary data is collected and updated based on execution
                       2
                   e                Partition 1
                                                  2                                   Partition 2                               2
                                                                                              g 3   c
                                                                                                        2
                                                                                                            j
                                                                                                                            f       h 2
                                                                                                                                             i 2
                                                                                                                                                        of user requests, e.g., when a new edge is added, the aux-
                                            d 3
              2
                                                                                          b 6
          c                f    2a 2
                                        h    2                        e   2                       ω=13
                                                                                                                2
                                                                                                                                ω=13                    iliary data of the partitioning(s) including the endpoints of
 b 6                                                  i       2
         ω=15                                                                                                                                           the edge get updated (two integers are incremented). Hence,
                                  11                    2                   f       2
                                                  c                                         h 2
                                    b   6                                                               i 2                                             the cost involved in maintenance of auxiliary data is propor-
  Partition 1                   Partition 2ω=13                                       ω=13
                                          g 3
                                                                                                                                                        tional to the rate of changes in the graph. As mentioned ear-
          d 3                                             j
 a 2
                   e   2
                                                  2
                                            (c) Repartitioned graph                                                                                     lier, social networks change quite slowly (when compared to
                                2
                                                                                                                                                        the read traffic); hence, the maintenance of auxiliary data is
              2            f            h 2
 b 6
          c                                       i 2                                                                                                   not a system bottleneck. Each partition collects and stores
Figure 1: Graph
     ω=13      ω=13evolution and effects of repartitioning in                                                                                           aggregate vertex information relevant to only the local ver-
response to imbalances.                                                                                                                                 tices. Moreover, the auxiliary data includes the total weight
Partition 1
                                                                                                                                                                   f
                                                                                                                                                                                   g
                                                                                                                                                     b
                                                                                                                                                                                           h
                                                                                                                                                a              d

of all partitions, i.e., in doing repartitioning, each server                            γ and underloaded if its weight is less than 2 − γ times                                      i
                                                                                                                                                  c
knows the total weight of all other partitions.                                          the average partition weight. Here, γ e is the maximum                             Partition 2
   The repartitioning process has two phases. In each iter-                              allowed imbalance factor (1 < γ < 2); the default value
ation of the first phase, each server runs the repartitioner                             of γ in Hermes is set to be 1.1, i.e., a partition’s load is
algorithm using the auxiliary data to indicate some vertices                             required to be in range (0.9, 1.1) of the average partition
in its partition that should be migrated to other partitions.                            weight. This is so that imbalances do not get too high
Before the next iteration, these vertices are logically moved                            before repartitioning triggers.
to their target partitions. Logical movement of a vertex                            • Either Ps is overloaded OR there is a positive gain in
means that only the auxiliary data associated with the ver-                              moving v from Ps to Pt . When a partition is overloaded, it
tex is sent to the other partition. This process continues up                            is good to consider all vertices as candidates for migration
to a point (iteration) in which no further vertices are chosen                           to any other partition as long as they do not cause an
for migration. At this point the second phase is performed in                            overload on the target partition. When the partition is
                                                                        Partition 1                                                        Partition 1
which the physical data is moved based on the result of first                           fnot overloaded, it is good to move only vertices which
                                                                                                             g                                                                     f
                                                                                                                                                                          g
phase. The algorithm is split into two phases because bor- b                             have positive weight so as to improveb the edge-cut. h
                                                                                                                   h
der vertices are likely to change partitions more than oncea                          d                                                         a                        d
                                                                                         When a vertex v is a candidate for migration to more than
(this will be discussed later) and auxiliary data records are                   c one
                                                                                                                 i                                        c                  e
                                                                                                                                                                                       i
                                                                                          e partition, the partition with maximum gain is selected
lightweight compared to the physical data records, allowing                                             Partition 2                                                   Partition 2
                                                                                    as the target partition of the vertex. This is illustrated in
the algorithm to finish faster. In what follows, we describe
                                                                                    Algorithm 1. Note that detecting whether a vertex v is
how vertices are selected for migration in an iteration of the
                                                                                    a candidate for migration and selecting its target partition
repartitioner.
                                                                                    is performed using only the auxiliary data. Precisely, for
   Consider a partition Ps (source partition) is running the
                                                                                    detecting underloaded and overloaded partitions (Lines 2,
repartitioner algorithm. Let v be a vertex in partition Ps .
                                                                                    5 and 11), the algorithm uses the weight of the vertex and
The gain of moving v from Ps to another partition Pt (target
                                                                                    the accumulated weights of all partitions; these are included
partition) is defined as the difference between the number
                                                                                    in the auxiliary data. Similarly, for calculating the gain of
of neighbors of v in Pt and Ps , respectively, i.e., gain(v) =
                                                                                    moving v from partition Ps to partition Pt (Line 10), it uses
dv (t) − dv (s) (dv (k) denotes the number of neighbors of v
                                                                                    the number of neighbors of v in any of the partitions, which
in partition k). Intuitively, the gain represents the decrease
of the number of edge-cuts when migrating v from Ps to Pt
(assuming that no other vertex migrates). Note that the
gain can be negative, meaning that it is better, in terms of                                                                            Partition 1
edge-cuts, to keep v in Ps rather than moving it to Pt . In                             Partition 1
                                                                                                        f
                                                                                                                                                                 Partition 1 f
                                                                                                                               g               b           g                                                         f
each iteration and on each partition, the repartitioner selects                              b                                                                                b
                                                                                                                                                                                                     g
                                                                                                                                        a                                           d
for migration candidate vertices that will give the maximum                            a
                                                                                                                                     h                  h
                                                                                                                                                                      a                              d
                                                                                                      d
gain when moved from the partition. However, to avoid os-                                                                                       c                                e
                                                                                                                                                              i
cillation and ensure a valid packing in term of load balance,                                   c         e
                                                                                                                                   i
                                                                                                                                                                         Partition 2
                                                                                                                                                                                    c                        e

the algorithm enforces a set of rules in migrating vertices.                                                              Partition 2                                                          Partition 2
First, it defines two stages in each iteration. In the first                                                                           (b) The resulted graph if ver-
stage, the migration of vertices is allowed only from par-                          (a) Initial graph, before the tices migrate in the same stage
titions with lower ID to higher ID, while the second stage                          first iteration.                                   (i.e., in a two-way manner).
allows the migration only in the opposite direction, i.e., from
partitions with higherPartition
                         ID to  1 those with lower ID. Here, par-
                                    f
                                                                                    Partition 1
                                                 g                                                                             f          Partition 1
tition ID defines a fixedb ordering of partitions         (and can bePartition 1               b
                                                                                                                       g
                                                                                                                                                                               f
                                                                                                            f
                                                                                                                                     h
replaced by any other fixed ordering). Migrating       h    vertices in     b           g                                                      b
                                                                                                                                                            g                           h
                       a          d                                                      a                            d
one-direction in two stages      prevent the algorithm from oscil-a                  h
                                                                                                               d                         a                             d
                                                                                                                                  i
lation. Oscillation happens   c
                                  when there is a ilarge number of                                  c                    e                            c                    e
                                                                                                                                                                                     i
                                      e                                      c                              e
edges between two group of vertices hosted            in
                                             Partition 2 two  different                    i
                                                                                                                    Partition 2                                     Partition 2
                                                                                                      Partition 2
partitions (see Figure 2). If the algorithm allows two-way
migration of vertices, the vertices in each group migrate to                        (c) The resulting graph after (d) The final graph after the
the partition of the other group, while the edge-cut does not                       the first stage.                                   second stage.
improve (Figure 2b). In one-way migration, however, the
vertices in one group remain in their partitions while the                          Figure 2: An unsupervised repartitioning might result in
other group joins them in that partition (Figure 2d).                               oscillation. Consider the partitioning depicted in (a). The
   In addition to preventing oscillation, the repartitioner al-                     repartitioner on partition 1 detects that migrating d, e, f to
gorithm minimizes load imbalance as follows. A vertex v on                          partition 2 improves edge-cut; similarly, the repartitioner on
a partition Ps is a candidate for migration to partition Pt if                      partition 2 tends to migrate g, h, i to partition 1. When the
the following conditions hold:                                                      vertices move accordingly, as depicted in (b), the edge-cut
• Ps and Pt fulfill the above one-way migration rule.                               does not improve and the repartitioner needs to move d, e, f
• Moving v from Ps to Pt does not cause Pt to be overloaded                         and h, i again. To resolve this issue, in the first stage of
   nor Ps to be underloaded. Recall from Section 2.1 that                           repartitioning of (a), the vertices d, e, f are migrated from
   the imbalance ratio of a partition is the ratio between the                      partition 1 (lower ID) to partition 2 (higher ID). After this,
   weight of the partition (the total weight of vertices it is                      as depicted in (c), the only vertex to migrate in the second
   hosting) and the average weight of all the partitions. A                         stage is vertex g which moves from partition 2 (higher ID)
   partition is overloaded if its imbalance load is more than                       to migration 1 (d).                                                         Partition 1
                                                                                                  Partition 1                                                                                                    f
                                                                                                                                  f
                                                                                                         b          g                                                      b
                                                                                                                                                                                       g
                                                                                                                                      d                                a                         d
                                                                                                   a            h
                                                                                                                                                                               c                         e
                                                                                                         c                        e
                                                                                                                        i
                                                                                                                                                                                               Partition 2
                                                                                                                            Partition 2

                                                                                               Partition 1                    f
                             Partition 1
                                                             f
                                   b       g                                                       b
Algorithm 1 Choosing target partition for migration               Algorithm 2 Lightweight Repartitioner
 1: procedure get target part(vertex v currently                   1: procedure repartitioning iteration(partition Ps )
    hosted in partition Ps , the current stage of the              2:    for stage ∈ {1, 2} do
    iteration.)                                                    3:       candidates ← {}
 2:     if imbalance factor(Ps − {v}) < 2 − γ then                 4:       for Vertex v ∈ VertexSet(Ps ) do
 3:          return (null, 0)                                      5:            target(v) ← get target part(v,stage)
 4:     target = null; maxGain = 0;                                6:                           . setting target(v) and gain(v)
 5:     if imbalance factor(Ps ) > γ then                          7:            if target(v) 6= null then
 6:          maxGain = −∞                                          8:                candidates.add (v)
 7:     for partition Pt ∈ partitionSet do                         9:       top-k ← k candidates with maximum gains
 8:          if (stage = 1 and Pt .ID > Ps .ID) or                10:        for Vertex v ∈ top-k do
 9:    .        (stage = 2 and Pt .ID < Ps .ID) then              11:            migrate(v, PS , target(v))
10:              gain ← Gain(v, Ps , Pt )                         12:        Ps .update auxiliary data
11:              if imbalance factor(Pt ∪ {v}) < γ and
12:        .        gain > maxGain then
13:                  target ← Pt ; maxGain = gain
                                                                  initial state of the graph. The partitions are sub-optimal as
14:      return (target, maxGain)
                                                                  6 of the 11 edges shown are edge-cuts. Consider the first
                                                                  stage of the first iteration of the lightweight repartitioner.
is also included in the auxiliary data.                           Since the first stage restricts vertex migrations from lower
                                                                  ID partitions to higher ID only, vertices a and e are the
Recall that the repartitioning algorithm runs on each par-        migration candidates since they are the only ones that can
tition independently, a property that supports scalability.       improve edge-cut. Note that if the algorithm was performed
For each partition Ps , after selecting the candidate ver-        in one stage, vertices h and d would be migrated to partition
tices for migration and their target partitions, the algo-        1 causing the oscillating behavior discussed previously. At
rithm selects k candidate vertices which have the highest         the end of the first stage of the first iteration, the state of
gains among all vertices and proceeds by (logically) migrat-      the graph is as presented in Figure 3b. In the second stage,
ing these top-k vertices to their target partitions. Here, mi-    the algorithm migrates only vertex g. While vertex c could
grating a vertex means sending (and updating) the auxil-          be migrated to improve edge-cut, the migration direction
iary data associated with the vertex to its target destina-       does not allow this (Figure 3c). In addition, such migration
tion and updating the auxiliary data associated with parti-       would cause partition 1 to be underloaded (its load will be
tion weights accordingly. The algorithm restricts the num-        2 which is less than 2.2̄). In the second iteration, vertex
ber of migrated vertices in each iteration (to k) to avoid        c is migrated to partition 2. The result of the first stage
imbalanced partitionings. Note that when selecting the tar-       of iteration 2 is presented in Figure 3d. At this point, the
get partition for a migrating vertex, the algorithm does not      graph reaches an optimal grouping, thus the second stage
know the target partition of other vertices; hence, there is a    of the second iteration will not perform any migrations. In
chance that a large number of vertices migrate to the same        fact further iterations would not migrate anything since the
partition to improve edge-cut. Selecting only k vertices en-      graph has an optimal partitioning.
ables the algorithm to control the accumulative weight of
partitions by restricting the number of migrating vertices.       3.2    Physical Data Migration
We discuss later how the value of k is selected. In general,         Physical data migration is the final step of the reparti-
taking k as a small, fixed fraction of n (size of the graph)      tioner. Vertices and relationships that were marked for mi-
gives satisfactory results.                                       gration by the repartitioner are moved to the target parti-
   Algorithm 2 shows the details of one iteration of the repar-   tions using a two step process: (1) Copy marked vertices
titioner algorithm performed on a partition Ps . The algo-        and relationships (copy step) (2) Remove marked vertices
rithm detects the candidate vertices (Lines 4-8), selects the     and relationships from the host partitions (remove step).
top-k candidates (Line 9), and moves them to their respec-           In the first step, a list of all vertices selected for migration
tive target partitions. Note that the migration in Line 11 is     to a partition are received by that partition, which will re-
logical. After each phase of each iteration, the auxiliary data   quest these vertices and add them to its own local database.
associated with each migrated vertex v is updated. This is        At the end of the first step, all moved vertices are replicated.
because the neighbors of v may also be migrated, which            Because of the insertion-only operations, the complexity of
would mean that the degree of v in each partition, i.e., aux-     the operations is lower as all operations can be performed
iliary data associated with v, has changed. The algorithm         locally in each partition, meaning less network contention
continues moving vertices until there is no candidate vertex      and locks held for shorter periods.
for migration, i.e., further movement of vertices does not           Between the two steps there is a synchronization process
improve edge-cut.                                                 between all partitions to ensure that partitions have com-
   Example: To demonstrate the workings of the lightweight        pleted the copy process before removing marked vertices
repartitioner, we show two iterations of the repartitioning al-   from their original partitions. The synchronization itself
gorithm on the graph of Figure 3 in which there are α = 3         is not expensive as no locks or system resources are held,
partitions and the average weight of partitions is 10/3. As-      though partitions may need to wait until an occasional strag-
sume the value of γ is 1.3̄. Hence, the aggregate weight of a     gler finishes copying. In the remove step, all marked vertices
partition needs to be in range [2.2̄, 4.4̄]; otherwise the par-   will enter an unavailable state in which all queries referenc-
titioning is overloaded or underloaded. Figure 3a shows the       ing the vertex will be executed as if the vertex is not part
Partition 3                                                                                                                                        Partition 3
                                                                                           3 , ec  3                                                                                                                                       4 , ec  3
                                             Partition 3                                                                                                                            Partition 3
                                                3 , ec  3                                                                                                                           4 , ec  3

                                   Partition 2                                   Partition 2                                                                        Partition 2                                        Partition 2
                                     3 ,ec 4                                    3 ,ec 4                                                                          3 , ec  3                                      3 , ec  3
                                                                                                                                                                                                                                                                 converges after less than 50 iterations, while there are mil-
Partition 1
  4 ,ec 5               h       1      Pardtition 1
                                            4 ,ec1 5                  h       1       d 1                                        Partition 1                     h 1                d n 11
                                                                                                                                                                                   Partitio                            h 1                  d 1                  lions of vertices in the graph data sets.
                                                                                                                                     3 , ec  4                                   3 , ec  4

         1                                          1
                                                                                                                                                                                                                                                                    The lightweight repartitioner is designed for scalability
                                                                                                                                               1                                              1
   c                                          c                                                                                           c                              1             c                                       1                                 and with little overhead to the database engine. The sim-
                           1                                           1                                                                                           a                                                   a
                       a                     e      1              a                     e   1
                                                                                                                                                                                                  1                                              1
                                                                                                                                                                                                                                                                 plicity of the algorithm supports parallelization of operations
                                                                                                                                                                                            e                                               e                    and maximizes scalability. In the first phase, each iteration
               b 1                                         b 1                                                                                         b 1                                            b 1
                                                      f        1
                                                                                              f       1
                                                                                                                                                                                                      f       1
                                                                                                                                                                                                                                                 f       1       is performed in parallel on each server. The auxiliary data
               j           1
                                        g
                                              1            j
                                                                   i
                                                                        11
                                                                                     g
                                                                                         1
                                                                                                          i   1                                        j
                                                                                                                                                           1                        g
                                                                                                                                                                                          1           j
                                                                                                                                                                                                          1                1            g
                                                                                                                                                                                                                                            1                    information
                                                                                                                                                                                                                                                                 1             is fully local to each server, thus lines 4 through
                                                                                                                                                                                                                   i                                         i
                                              Partition 3                                Partition 3                                                                                      Partition 3                                       Partition 3
                                                                                                                                                                                                                                                                 9 of Algorithm 2 are executed independently on each server.
                                                 3 , ec  3                              3 , ec  3                                                                                      4 , ec  3                                     4 , ec  3
                                                                                                                                                                                                                                                                 In the second phase of the repartitioning algorithm, physi-
                                   (a) Initial graph, before the                                                              (b) After the first stage of the                                                                                                   cal data migration is performed. As mentioned in Section
                                   first     iteration Partition
                                     Partition 2         3 , ec  3
                                                                     2                                                        first iteration                                                                                                                    2.2, this part has been decomposed into two steps for sim-
                                          3 , ec  3

                                                                                                                                                               Partition 2
                                                                                                                                                                                                                       Partition 2
                                                                                                                                                                                                                             4 , ec  2
                                                                                                                                                                                                                                                                 plicity and performance. Because information is only copied
                                                                             h 1             d 1                                                                   4 , ec  2
                               h 1                d 1
                                         Partition 31
                                                                                                                                                                                                                                                                 in the first step (in which vertices are replicated), it allows
Partition 1                                   4 , ec                                                                                                    h 1             d 1                                    h 1                d 1
   4 , ec  3

                                                                                                                                                                                              1
                                                                                                                                                                                                                                                                 for maximum parallelization with little need to synchronize
                                   1                                             1                                                         1
      c
             1                 a                  c
                                                         1                   a                                                        c                                                 c                                                                        between servers.
                                                                                                                                                           a 1                                                a 1
                                                                                                      1                                                                                                                                          1
                                                        e      1                                  e                           Partition 1
                                                                                                                                3 , ec  2
                                                                                                                                                                                     1
                                                                                                                                                                           Partition 1
                                                                                                                                                                               3 , ec
                                                                                                                                                                                       e 2                                                 e                    3.3.2     Algorithm Convergence
                   b 1                                         b 1                                                                             b 1
                                                                                                                                                                                                   1
                                                                f 1                                       f   1                                                                                 fb 1                                                 f   1              When the lightweight repartitioner triggers, the algorithm
                   j
                                   g 1                         j
                                                                             g 1                                                                               g 1                                                     g 1                                        starts
                                                                                                                                                                                                                                                                   1
                                                                                                                                                                                                                                                                            by migrating vertices from overloaded partitions. Note
                       1                                           1       i 1                                    i 1                          j                                                  j
                                                                                                                                                                                                              i 1                                             i
                                                                                                              Partition 3
                                                                                                                                                   1                                                      1
                                                                                                                                                                                                          Partition 3
                                                                                                                                                                                                                                                                  that
                                                                                                                                                                                                                                                             Partition 3
                                                                                                                                                                                                                                                                           no vertex is a candidate for migration to an overloaded
                                                                       Partition 3
                                                                         3 , ec 2                            3 , ec 2                                                                                 3 , ec 2                                        3partition.
                                                                                                                                                                                                                                                                   , ec 2
                                                                                                                                                                                                                                                                                Hence, after a bounded number of iterations, the
                                   (c) After the second stage of the (d) After the first stage of the                                                                                                                                                             partitioning becomes valid in term of load balance. When
                                   first iteration                   second iteration                                                                                                                                                                             there is no overloaded partition, the algorithm moves a ver-
                                                                                                                                                                                                                                                                  tex only if there is a positive gain in moving it from the
                                Figure
                                  Partition 2
                                                 3: Two          iterations of the lightweight repartitioner.
                                                       Partition 2                                                                                                                                                                                                source to the target partition. This is the main idea behind
                                    3 , ec  3         3 , ec  3
                                Two metrics are attached to every partition:                       ω representing
                                                                                           Partition 2     Partition 2
                                                                                                                                                                                                                                                                  the following proof for the convergence of the algorithm.
                                                                                                                                                                        4 , ec  2                                         4 , ec  2
                               hthe
                                 1      weight
                                           d 1      of
                                                    h 1thed partition 1     and ec representing the edge-cut.
Partition 31                             Partition 31
  4 , ec                                   4 , ec                                                                                                        h 1                 d 1                            h 1                d 1                            Theorem 4. After a bounded number of iterations, the
             1                 a
                                    1                    1                   a
                                                                                 1
                                                                                                                                          c
                                                                                                                                                   1
                                                                                                                                                                                        c
                                                                                                                                                                                              1                                                                  lightweight repartitioner algorithm converges to a stable par-
      c                                           c                                                                                                               1                                                    1
                                   of the local vertex set. This allows performing         a             the atransac-                                                                                                                                           titioning in which further migration of vertices (as done by
                                            1            1
                                   tional eoperations much
                                                       e     faster as locks
                                                                       Partition 1 on unavailable
                                                                                             Partitione1
                                                                                                             1
                                                                                                                  verticese                                                                                                                      1               the algorithm) does not result in better partitionings.
                                                                          3 , ec  2         3 , ec  2
                   b 1             cannot bbe 1 acquired by any standard queries.
                                            f 1          f 1                           b 1
                                                                                                            b f1 1                                                                                                                                   f   1              Proof. We show that the algorithm constantly decreases
                                   g 1                                           g 1                                                                                g 1                                                g 1                                       the       number of edge-cuts. For each vertex v, let dex (v) denote
                   j
                       1           3.3                         Lightweight
                                                               j
                                                                1 i 1      iRepartitioner
                                                                             1         j
                                                                                         1 Analysis
                                                                                                j
                                                                                                  1                                                                                                                i    1
                                                                                                                                                                                                                                                              i 1
                                                                                                                                                                                                                                                                 the3 number of external neighbors of v, i.e., number of neigh-
                                                                       Partition 3                            Partition 3                                                                                         Partition 3                                Partition
                                                                         3 , ec 2                            3 , ec 2                                                                                         3 , ec 2                                3 , ec 2
                                                                                                                                                                                                                                                                 bors of v in partitions other than that of v. With this defi-
                                    3.3.1                              Memory and Time Analysis                                                                                                                                                                  nition, the number of edge-cuts in a partition is χ/2 where
                                      Recall that the main advantage of the lightweight reparti-                                                                                                                                                                            n
                                                                                                                                                                                                                                                                            P
                                   tioner over multilevel algorithms is that it makes use of only                                                                                                                                                                χ=           dex (v). Recall that the algorithm works in stages so
                                                                                                                                                                                                                                                                     v=1
                                   auxiliary data to perform repartitioning. Auxiliary data has                                                                                                                                                                  that if in a stage migration of vertices is allowed from one
                                   a small size compared to the size of the graph. This is for-                                                                                                                                                                  partition to another, in the subsequent stage the migration is
                                   malized in the following two theorems, the proofs of which                                                                                                                                                                    allowed in the opposite direction. We show that the value of
                                   can be found in the extended version of the paper [25].                                                                                                                                                                       χ decreases in every two subsequent stages; more precisely,
                                                                                                                                                                                                                                                                 we show that when a vertex v migrates in a stage t, the value
                                     Theorem 2. The amortized size of auxiliary data stored                                                                                                                                                                      of dex (v) either decreases at the end of the stage t or at the
                                   on each partition to perform repartitioning is n + Θ(α) on                                                                                                                                                                    end of the subsequent stage t + 1 (compared to when v does
                                   average. Here, n denotes the number of vertices in the input                                                                                                                                                                  not migrate). Let dtk (v) denote the number of neighbors of
                                   graph and α is the number of partitions.                                                                                                                                                                                      vertex v in partition k before stage t. Assume that vertex
                                   When compared to the multilevel algorithms, the memory                                                                                                                                                                        v is migrated from partition i to partition j at stage t (see
                                   requirement of the lightweight repartitioner is far less and                                                                                                                                                                  Figure 4). This implies that the number of neighbors of v in
                                   can be easily maintained without hardly any impact on                                                                                                                                                                         partition j is more than partition i. Hence, when v moves
                                   performance of the system. This is experimentally verified                                                                                                                                                                    to partition j, the value of dex (v) is expected to decrease.
                                   in Section 5.3.                                                                                                                                                                                                               However, in a worst-case scenario, some neighbors of v in
                                                                                                                                                                                                                                                                 partition j also move to other partitions at the same stage
                                      Theorem 3. Each iteration of the repartitioning algo-                                                                                                                                                                      (Figure 4b). Let x(v) denote the number of neighbors of v
                                   rithm takes O(αns ) time to complete. Here, α denotes the                                                                                                                                                                     in the target partition j which migrate at stage t; hence,
                                   number of partitions and ns is the number of vertices in the                                                                                                                                                                  at the end of the stage, the value of dex (v) decreases by at
                                   partition which runs the repartitioning algorithm.                                                                                                                                                                            least dtj (v) − x(v) units. Moreover, dex (v) is increased by at
                                                                                                                                                                                                                                                                 most dti (v); this is because the previous internal neighbors
                                   The above theorem implies that each iteration of the algo-                                                                                                                                                                    (those which remain at partition i) will become external af-
                                   rithm runs in linear time. Moreover, the algorithm converges                                                                                                                                                                  ter the migration of v. If dtj (v) − x(v) > dti (v), the value of
                                   to a stable partitioning after a small number of iterations rel-                                                                                                                                                              dex (v) decreases at the end of the stage and we are done.
                                   ative to the number of vertices, e.g., in our experiments, it                                                                                                                                                                 Otherwise, we say a bad migration occurred. In these cases,
b''
                                                                                                                                                                                                                                             e''                                         h'
                                                                                                            b'          b''               e'                  e''           h'           h''
                                                                                                                                                                                                                                                                            f
                                                                                                                                                                                                                   c'                              c                                i
                                                                                                                    c                                  f                         i                                 c''                                  f'                                          i''
                                                                                                                                                                                                                                                                                         i'
                                                                                                                                                                                                                                                                  f''
                                                                                                            c'          c''               f'                  f''           i'           i''
                                                                                                                                                                                                       Partition 1                         Partition 2                          Partition 3
                                                                                                       Partition 1(i=1)                  Partition 2 (j=2)                  Partition 3

                   a                               d                                                                                                                                                                                                                                g
                                                                             g                                                                a                                      g                  a'                                                   d'
                                                                                                               a'
                                                                                                                                                                                                                         a                                                  d
                                                                                                                                         d'                             d                                                                                                                           g''
          a'           a''                                                                                                                                                                      g''     a''                                            d''                                    g'
                                             d'          d''            g'           g''                 a''
                                                                                                                                    d''                                                   g'
                                                                                                                                                                                                                                                   e'
                                                   e                                                     b'                        b                                                                     b'                                                                         h
                   b                                                         h                                                                                                       h                                                                                  e
                                                                                                                                                                    e                                                    b
                                                                                                                                   e'                                                                                                                  e''                                         h''
         b'             b''             e'               e''                                                                                                                                   h''       b''
                                                                   h'                h''                 b''                                                                                                                                                                             h'
                                                                                                                                   e''                                                   h'
                                                                                                                                                                                                                                                        f'                  f
                                                                                                                                                                        f                                     c'
                   c                               f                                                        c'                            c                                                                                                                                          i
                                                                             i                                                                                                       i                                       c
                                                                                                                                                                                                               c''                                            f''                                    i''
                                        f'               f''                                                c''                                   f'                                             i''                                                                                     i'
          c'             c''                                       i'                i''                                                                                                 i'
                                                                                                                                                        f''

        Partition 1(i=1)               Partition 2 (j=2)                                         Partition 1                      Partition 2                                                          Partition 1                         Partition 2                          Partition 3
                                                                   Partition 3                                                                                               Partition 3

                               (a) Original
                                    a       graph                                g                                            (b) After the first stage                                                                  In the first substage, {a,b,c} move to partition 2, while {d,e,f} move to
                                                                                                                                                                                                                                 (c) After the second stage
                                                                                                                                                                                                                         partition 3. This increases the total edge-cut from 18 to 21. So, the
              a'
                                                                                                                                                                                     g                                   algorithm worsens the partitioning after the first substage. In the second
                                       d'                      d                                  a'                                               d'
                                                                                           g''
        a''                                                                                                         a                                                                                                    substage, {d,e,f} return to partition2. This kind of behaviour makes
Figure 4: The number
                   d''
                         of edge-cutsg'might increase
                                                a''       in the first
                                                                    d'' stage (ind the worst
                                                                                         g'
                                                                                             g''
                                                                                                 case), but it decreases after the second stage.                                                                         proving things very hard. In particular, we cannot even prove that edge-
                                                                                                                                                                                                                         cuts improves
In this example, the numbere of edge-cuts
     b'           b
                                   h          is b'initially 18 (a);e' this increases to 21 after the first stage (b), and decreases to 15
                  e'                                                                   h
at the
     b'' end of the second stage (c).
                                     h'
                                        h''           b
                                                                    e''
                                                                                e
                                                                                            h''
                                 e''                                                               b''
                                                                                                                                                                                         h'
                                                               f
          c'                            c                                        i                                                                f'
                                                                                                       c'                                                               f
         c''                                  f'                                                                                                                                     i
                                                                                           i''                      c
                                                                                     i'                  c''                                           f''
                                                   f''                                                                                                                                           i''
                                                                                                                                                                                         i'
assuming
   Partition 1 k is sufficiently  Partition 2              large, Partition
                                                                          in the3 subsequent stage t+1,                                              considering a few parameters which include the number of
v migrates back to partition i since there isPartition                                                      1
                                                                                                      a positive              Partition 2
                                                                                                                             gain                    partitions,     Partition  the3 structure of the graph (e.g., the average size

in such
     a'
                  a migration                d'
                                                   (Figure 4c),g and this results                                  insubstage,
                                                                                                          In the first    a de-                      of the
                                                                                                                                 {a,b,c} move to partition   2, whileclusters
                                                                                                                                                                      {d,e,f} move to formed by vertices), and the nature of chang-

crease aof dt+2                                                                                            t
                                                                                                          partition  3. This increases the total edge-cut from 18 to 21. So, the
                        i       (v) d''and an increase               d                ofg'' at most dalgorithm         − x(v)
                                                                                                           j (v) worsens       the partitioning aftering
                                                                                                                                                      the firstworkload              (whether the changes are mostly on the weight
                                                                                                                                                               substage. In the second
    a''                                                                             g'                    substage, {d,e,f} return to partition2. This kind of behaviour makes
in dex (v). Consequently, the net increase in dex                                                         provingafter
                                                                                                                    things verytwo
                                                                                                                                 hard. In particular,or     on even
                                                                                                                                                     we cannot    the  provedegree
                                                                                                                                                                              that edge- of vertices). In practice, we observed that a
                                                                                                                t+2
             is (dti (v) − (dtj (v) − x(v)))                                +h (dtj (v) − x(v) −cutsdimproves
                                        e'
stagesb'
             b
                                                                   e                                            i       (v)) =                       sub-optimal value of k does not degrade convergence rate by
  t b''             t+2                                                                h''
                                                                                                                                                     more than a few iterations; consequently the algorithm does
di (v) − di (v). Note                    e''
                                                       that if v does             h'            not move      at      all, dex
                        t                    t+2                                                                                                     not require fine tuning for finding the best value of k. In our
increases
        c'
                     d  i (v)     −     d f'
                                             i      (v)      units   f      after            two   stages.      Hence,            in
                                                                                i
the worstc''
               c
                      case, the net           f''
                                                       decrease in dex                   i''
                                                                                              (v) is at least 0 for all                              experiments, we set k as a small fraction of the number of
migrated vertices (compared to when                                               i'
                                                                                             they do not move). In-                                  vertices.
deed,
   Partition we1 show that       Partitionthere   2         are vertices   Partition 3for which the decrease
in dex is     In thestrictly           more
                     first substage, {a,b,c}    move tothan
                                                        partition 2,0    after
                                                                      while {d,e,f} move two  to  consecutive stages.
Assuming      partition 3. This increases the total edge-cut from 18 to 21. So, the
                      there
              algorithm             are
                          worsens the           α partitions,
                                       partitioning  after the first substage.these
                                                                               In the second   are the vertices which                                4. HERMES SYSTEM OVERVIEW
migratesubstage,
              proving
                         {d,e,f} return to partition2. This kind of behaviour makes
                  to things
                         partition                 α [inwe cannot
                               very hard. In particular,        stages even provewhere
                                                                                     that edge- vertices move from                                        In this section, we provide an overview of Hermes, which
lower ID             to higher ID partitions] or partition 1 [in stages
              cuts improves
                                                                                                                                                     we designed as an extension of Neo4j Version 1.7.3 to han-
where vertices move from higher ID to lower ID partitions].                                                                                          dle distribution of graph data and dynamic repartitioning.
In these cases, no vertex can move from the target partition                                                                                         Neo4j is an open source centralized graph database system
to another partition; so the actual decrease in dex (v) is the                                                                                       which provides a disk-based, transactional persistence en-
same as the calculated gain when moving the vertex and                                                                                               gine (ACID compliant). The main querying interface to
is more than 0. To summarize, for all vertices, the value                                                                                            Neo4j is traversal based. Traversals use the graph structure
of dex (v) does not increase after every two stages, and for                                                                                         and relationships between records to answer user queries.
some vertices, it decreases. For smaller values of k, after a                                                                                             To enable distribution, changes to several components of
bad migration, vertex v might not return from partition j to                                                                                         Neo4j were required as well as addition of new functionality.
its initial partitioning i in the subsequent stage (since there                                                                                      The modifications and extensions were done such that exist-
might be more gain in moving other vertices); however, since                                                                                         ing Neo4j features are preserved. Figure 5 shows the com-
there is a positive gain in moving v back to partition i, in                                                                                         ponents of Hermes with the components of Neo4j that were
subsequent stages, the algorithm moves v from partition j                                                                                            modified to enable distribution in light blue shading while
to another partition (i or another partition which results in                                                                                        the components in dark blue shading are newly added. De-
more gain). The only exception is when many neighbors of                                                                                             tailed descriptions of the remaining changes are omitted as
v move to partition j so that there is no positive gain in                                                                                           they pose technical challenges which were overcome using
moving v. In both cases, the value of dex (v) decreases with                                                                                         existing techniques. For example, as the centralized loop
the same argument as above. To conclude, as the algorithm                                                                                            detection algorithm used by Neo4j for deadlock detection
runs, the accumulated values of dex (v) (i.e., χ), and conse-                                                                                        does not scale well, it was replaced using a timeout-based
quently the number of edge-cuts, constantly decrease.                                                                                                detection scheme as described in [10].
                                                                                                                                                          Internally, Neo4j stores information in three main stores:
The graph structure in social networks does not evolve quickly                                                                                       node store, relationship store and property store. Splitting
and its evolution is towards community formation. Hence, as                                                                                          data into three stores allows Neo4j to keep only basic infor-
our experiments confirm, after a small number of iterations,                                                                                         mation on nodes and relationships in the first two stores.
the lightweight repartitioner converges to a stable partition-                                                                                       Further, this allows Neo4j to have fixed size node and rela-
ing. The speed of convergence depends on the value of k (the                                                                                         tionship records. Neo4j combines this feature with a mono-
number of migrated vertices from a partition in each itera-                                                                                          tonically increasing ID generator such that a) record offsets
tion). Larger values of k result in faster improvement on the                                                                                        are computed in O(1) time using their ID and b) contiguous
number of edge-cuts and subsequently achieve partitioning                                                                                            ID allocation allows records to be as tightly packed as pos-
with almost an optimal number of edge-cuts. However, as                                                                                              sible. The property store allows for dynamic length records.
mentioned earlier, large values of k can degrade the balance                                                                                         To store the offsets, Neo4j uses a two layer architecture
factor of partitioning. Finding the right of value of k requires                                                                                     where a fixed size record store is used to store the offsets and
Metadata Storage                      Deadlock                                      DistNeo4
                                                                                                                      Hermes j
                                                                       Detector                                                                      Neo4 j
                                                                                   Clients                                                          Instances
  Graph        Message            Repartition ing     Lightweight       Lock                  Queries
                                                                                                          Server 1                      Server 2
 Storage      Dispatcher             Manager         Repartitioner     Manager

                                 Node Manager

  Transaction Manager                                        Traversal API                                                Remote
                                                                                                          Server 3                      Server 4
                                                                                              Results                    traversals
                                  Neo4 j API
     Get    Create      Delete      Add Properties   ...        Traversal
                                                                                  Figure 6: Overview of how Hermes servers interact with
                                                                                  clients and with each other.
Figure 5: Hermes system layers together with modified and
new components designed to make it run in a distributed
environment.
                                                                                  machines. Each server has the following hardware config-
                                                                                  uration: 2 AMD Opteron 252 (2 cores), 8 GB RAM and
                                                                                  160GB SATA HDD. The servers are connected using 1Gb
a dynamic size record store is used to hold the properties.                       ethernet. In each experiment, one Hermes instance runs on
To shard data across multiple instances of Hermes, changes                        its own server.
were made to allow local nodes and relationships to connect                          The experiments are focused on typical social network
with remote ones. Hermes uses a doubly-linked list record                         traffic patterns, which based on previous work [8, 21], are
model when keeping track of relationships. Such a node in                         1-hop traversals and single record queries. We also consider
the graph needs to know only the first relationship in the list                   2-hop queries which are used for analytical queries such as
since the rest can be retrieved by following the links from                       ads and recommendations. Given the small diameters of
the first. Due to tight coupling between relationship records,                    social graphs (Table 1), queries with more than 2-hops are
referencing a remote node means that each partition would                         more typical of batch processing frameworks rather than so-
need to hold a copy of the relationship. Since replicating and                    cial graphs where querying most or all of the graph data is
maintaining all information related to a relationship would                       required. The submission of traversal queries was described
incur significant overhead, the relationship in one partition                     in Section 4.
has a ghost flag attached to it to connect it with its remote
counterpart. Relationships tagged by the ghost flag do not                        5.2   Datasets
hold any information related to the properties of the rela-
                                                                                     Three real-world datasets, namely Orkut, DBLP, and Twit-
tionship but are maintained to keep the graph structure
                                                                                  ter, are used to evaluate the performance of the lightweight
valid. One advantage of this is the complete locality in find-
                                                                                  repartitioner. We consider average path length, clustering
ing the adjacency list of a graph node. This is important
                                                                                  coefficient, and power law coefficient of these graphs to char-
since traversal operations build on top of adjacency list.
                                                                                  acterize the datasets (Table 1). Average path length is the
   The storage was also modified to use a tree-based index-
                                                                                  average length of the shortest path between all pairs of ver-
ing scheme (B+Tree) rather than an offset-based indexing
                                                                                  tices. The clustering coefficient (a value between 0 and 1)
scheme since record IDs can no longer be allocated in small
                                                                                  measures how tightly clustered vertices are in the graph. A
increments. In addition, data migration would make offset
                                                                                  high coefficient means strong (or well connected) communi-
based indexing impossible as records would need to be both
                                                                                  ties exist within the network. Finally, power law coefficient
compacted and still keep an offset based on their ID.
                                                                                  shows how the number of relationships increases as user pop-
   In Hermes, servers are connected in a peer-to-peer fash-
                                                                                  ularity increases.
ion similar to the one presented in Figure 6. A client can
connect to any server and perform a query. Generally, user                        5.3   Experimental Results
queries are in the form of a traversal. To submit a query the
                                                                                     The lightweight repartitioner is compared with two differ-
client would first lookup the vertex for the starting point of
                                                                                  ent partitioning algorithms. For an upper bound, we use a
the query, then send the traversal query to the server host-
                                                                                  member of Metis family of repartitioners that is specifically
ing the initial vertex. The query is forwarded to the server
                                                                                  designed for partitioning graphs whose degree distribution
containing the vertex such that data locality is maximized.
                                                                                  follows a power-law curve [6]. These graphs include social
On the server side, the traversal query will be processed by
                                                                                  networks which are the focus of this paper.
traversing the vertex’s relationships. If the information is
                                                                                     Several previous partitioning approaches (e.g.[26, 28]) are
not local to the server, remote traversals are executed using
the links between servers. When the traversal completes,
the query results will be returned to the client.                                                                    Twitter        Orkut           DBLP
                                                                                   Number of nodes                   11.3 million   3 million       317 thousand
                                                                                   Number of edges                   85.3 million   223.5 million   1 million
5.     PERFORMANCE EVALUATION                                                      Number of symmetric links         22.1%          100%            100%
                                                                                   Average path length               4.12           4.25            9.2
  In this section, we present the evaluation of the lightweight                    Clustering coefficient            unpublished    0.167           0.6324
repartitioner implemented into Hermes.                                             Power law coefficient             2.276          1.18            3.64

5.1        Experimental Setup                                                                Table 1: Summary description of datasets
     All experiments were executed on a cluster with 16 server
compared against Metis as it is considered the “gold stan-                                     100%
dard” for the quality of partitionings. It is also flexible                                                                          Metis
                                                                                               80%                                 Hermes

                                                                       Percent Vertices
enough to allow custom weights to be specified and used
as a secondary goal for partitioning. We also compare the                                      60%
lightweight repartitioner against random hash-based parti-                                     40%
tioning, which is a de-facto standard in many data stores
                                                                                               20%
due to its decentralized nature and good load balance prop-
erties. Note that Metis is an offline, static partitioning al-                                  0%
gorithm that requires a very large amount of memory for                                                   Orkut   Twitter   DBLP

execution. This means that either additional resources need                                               (a) Migrated vertices.
to be allocated to partition and reload the graph every time                                   100%
the partitioner is executed, or the system has to be taken

                                                                       Percent Relationships
                                                                                                                                     Metis
                                                                                               80%                                 Hermes
offline to load data on the servers. When the servers were
taken offline, it took 2 hours to load each of the Orkut and                                   60%
Twitter graphs separately. This long period of time is unac-                                   40%
ceptable for production systems. Alternatively, if Hermes is
augmented to run Metis on graphs, the resource overhead for                                    20%
running Metis would be much higher than the lightweight                                         0%
repartitioner. Metis’ memory requirements scale with the                                                  Orkut   Twitter   DBLP
number of relationships and coarsening stages, while the                                          (b) Changed or migrated relationships.
lightweight repartitioner scales with the number of vertices
and partitions. Since the number of relationships dominates      Figure 8: The number of vertices (a) and relationships (b)
by orders of magnitude, Metis will require significantly more    changed or migrated as a result of the lightweight reparti-
resources. For example, we found that Metis requires around      tioner (Hermes) versus running Metis.
23GB and 17GB of memory to partition the Orkut and Twit-
ter datasets, respectively; however, the lightweight reparti-
                                                                 pect that this very small difference could shift in the other
tioner only requires 2GB and 3GB for these datasets. While
                                                                 direction depending on factors such as query patterns and
Metis has been extended to support distributed computa-
                                                                 number of partitions. However, Figure 7 demonstrates that
tion (ParMetis [4]), the memory requirements for each server
                                                                 the lightweight repartitioner generates partitionings that are
would still be higher than the lightweight repartitioner.
                                                                 almost as good as those of Metis.
                                                                    A repartitioner’s performance is affected by the amount
5.3.1                    Lightweight Repartitioner Performance   of data that it needs to migrate. To quantify the impact
   Our experiments are derived from real world workloads         of migration on performance, the partitions resulting from
[21, 8] and are similar to the ones in related papers [27,       the lightweight repartitioner and Metis are compared with
24]. We first study 1-hop traversals on partitions with a        the initial partitioning. Figure 8a shows the number of
randomly selected starting vertex. At the start of the ex-       vertices migrated due to the skew based on the two par-
periments, the workload shifts such that the repartitioner is    titioning algorithms. The results show a much lower count
triggered, showing the performance impact of the reparti-        for the lightweight repartitioner. Figure 8b shows that the
tioner and the associated improvements. This shift in work-      lightweight repartitioner requires, on average, significantly
load is caused by a skewed traffic trace where the users on      fewer changes to relationships compared to Metis. This dif-
one partition are randomly selected as starting points for       ference is more extensive in the case of DBLP. The lightweight
traversals twice as many times as before, creating multiple      repartitioner is able to rebalance workload by moving 2% of
hotspots on a partition. This workload skew is applied for       the vertices and about 5% of the relationships, while Metis
the full duration of the experiments that follow.                migrates an order of magnitude more data.
   Figure 7 presents the percentage of edge-cuts among all          Overall, both the numbers of vertices and relationships
edges for both lightweight repartitioner and Metis on the        migrated are important as they directly relate to the perfor-
skewed data. As the figure shows, the difference in edge-        mance of the system. We note that, however, the relation-
cut is too small (1% or less) to be significant, and we ex-      ship count has a higher impact on performance as this num-
                                                                 ber will generally be much higher and relationship records
                        60%
                                                                 are larger, and thus more expensive to migrate.
                                                       Metis        Figure 9 presents the aggregate throughput performance
                        50%
     Percent edge-cut

                                                     Hermes      (i.e., the number of visited vertices) of 16 machines (par-
                        40%
                                                                 titions) using the three datasets. In these experiments, 32
                        30%                                      clients concurrently submit 1-hop traversal requests using
                        20%                                      the previously described skew. Before the experiments start,
                        10%                                      Metis is applied to form an initial partitioning which has
                        0%                                       a trace with no skew so as to remove partitioning bias by
                                  Orkut    Twitter   DBLP        starting out with a good partitioning. Once the experiment
                                                                 starts, the mentioned skew is applied; this skew triggers the
Figure 7: The number of edge-cuts in partitionings of the        repartitioning algorithm, whose performance is compared
lightweight repartitioner (as a component of Hermes) versus      with running Metis after the skew. For Orkut, results show
Metis. Results are presented as a percentage of edge-cuts        that by introducing the skew and triggering the lightweight
among the total number of edges.                                 repartitioner, a 1.7 times improvement in performance can
You can also read