TurboKV: Scaling Up the Performance of Distributed Key-value Stores with

Page created by Ramon Carpenter
 
CONTINUE READING
TurboKV: Scaling Up the Performance of Distributed Key-value Stores with
                                                                      In-Switch Coordination

                                                                     Hebatalla Eldakiky, David Hung-Chang Du, Eman Ramadan
                                                                         Department of Computer Science and Engineering
                                                                             University of Minnesota - Twin Cities, USA
                                                                       Emails: {eldak002, du}@umn.edu, eman@cs.umn.edu
arXiv:2010.14931v1 [cs.DC] 26 Oct 2020

                                                                  Abstract                                      switches can be controlled by network operators to adapt to
                                            The power and flexibility of software-defined networks              the current network state and application requirements.
                                         lead to a programmable network infrastructure in which in-                Recently, there has been an uptake in leveraging pro-
                                         network computation can help accelerating the performance              grammable switches to improve distributed systems, e.g., Net-
                                         of applications. This can be achieved by offloading some               Paxos [9, 10], NetCache [18], NetChain [17], DistCache [25]
                                         computational tasks to the network. However, what kind of              and iSwitch [24]. This uptake is due to the massive evolu-
                                         computational tasks should be delegated to the network to              tion on the capailities of these network switches, e.g., Tofino
                                         accelerate applications performance? In this paper, we pro-            ASIC from Barefoot [27] which provides sub-microsecond
                                         pose a way to exploit the usage of programmable switches               per-packet processing delay with bandwidth up to 6.5 Tbs
                                         to scale up the performance of distributed key-value stores.           and throughput of few billions of packets processed per sec-
                                         Moreover, as a proof-of-concept, we propose TurboKV, an                ond and Tofino2 with bandwidth up to 12.8 Tbs. These sys-
                                         efficient distributed key-value store architecture that utilizes       tems use the switches to provide orders of magnitude higher
                                         programmable switches as: 1) partition management nodes to             throughput than the traditional server-based solutions.
                                         store the key-value store partitions and replicas information;            On the other hand, through the massive use of mobile de-
                                         and 2) monitoring stations to measure the load of storage              vices, data clouds, and the rise of Internet of Things [3], enor-
                                         nodes, this monitoring information is used to balance the load         mous amount of data has been generated and analyzed for the
                                         among storage nodes. We also propose a key-based routing               benefit of society at a large scale. This data can be text, image,
                                         protocol to route the search queries of clients based on the           audio, video, etc., and is generated by different sources with
                                         requested keys to targeted storage nodes. Our experimental             un-unified structures. Hence, this data is often maintained in
                                         results of an initial prototype show that our proposed archi-          key-value storage, which is widely used due to its efficiency in
                                         tecture improves the throughput and reduces the latency of             handling data in key-value format, and flexibility to scale out
                                         distributed key-value stores when compared to the existing             without significant database redesign. Examples of popular
                                         architectures.                                                         key-value stores include Amazon’s Dynamo [11], Redis [34],
                                                                                                                RAMCloud [30], LevelDB [21] and RocksDB [35].
                                         1   Introduction                                                          Such huge amount of data can not be stored in a single
                                                                                                                storage server. Thus, this data has to be partitioned across
                                         Programmable switches in software-defined network promise              different storage instances inside the data center. The data
                                         flexibility and high throughput [4, 38]. Recently, the Program-        paritions and their storage node mapping (directory informa-
                                         ming Protocol-Independent Packet Processor (P4) [6] un-                tion) are either stored on a single coordinator node, e.g., the
                                         leashes capabilities that give the freedom to create intelligent       master coordinator in distributed Google file system [15], or
                                         network nodes performing various functions. Thus, applica-             replicated over all storage instances [11, 20, 34]. In the first
                                         tions can boost their performance by offloading part of their          approach, the master coordinator represents a single point of
                                         computational tasks to these programmable switches to be               failure, and introduces a bottleneck in the path between clients
                                         executed in the network. Nowadays, programmable networks               and storage nodes; as all queries are directed to it to know the
                                         get a bigger foot in the data center doors. Google cloud started       data location. Moreover, the query response time increases,
                                         to use the programmable switches with P4Runtime [31] to                and hence, the storage system performance decreases. In the
                                         build and control their smart networks [14]. Some Internet             second approach, where the directory information is repli-
                                         Service Providers (ISPs), such as AT&T, have already inte-             cated on all storage instances, there are two strategies that a
                                         grated programmable switches in their networks [2]. These              client can use to select a node where the request will be sent

                                                                                                            1
to: server-driven coordination and client-driven coordination.            reach the target storage node directly based on the directory
   In server-driven coordination, the client routes its request           information records stored in the switches’ data plane. This in-
through a generic load balancer that will select a node based             switch coordination approach removes the load of routing the
on load information. The selected node acts like the coordina-            requests from the client side in the client-driven coordination
tor for the client request. It answers the query if it has the data       without introducing an additional forwarding step introduced
partition or forwards the query to the right instance where               by the coordination node in the server-driven coordination.
the data partition resides. In this strategy, the client neither             To achieve both reliability and high availability, Tur-
has knowledge about the storage nodes nor needs to link any               boKV replicates key-value pair partitions on different storage
code specific to the key-value storage it contacts. Unfortu-              nodes. For each data partition, TurboKV maintains a list of
nately, this strategy has a higher latency because it introduces          nodes that are responsible for storing the data of this parti-
additional forwarding step when the request coordinator is                tion. TurboKV uses the chain replication model to guarantee
different from the node holding the target data.                          strong data consistency between all partition replicas. In case
   In client-driven coordination, the client uses a partition-            of having a failing node, requests will be served with other
aware client library that routes requests directly to the appro-          available nodes in the partition replica list.
priate storage node that holds the data. This approach achieves              TurboKV also handles load balancing by adapting a dy-
a lower latency compared to the server-driven coordination                namic allocation scheme that utilizes the architecture of
as shown in [11]. In [11], the client-driven coordination ap-             software-defined network [13, 26]. In our architecture, a log-
proach reduces the latencies by more than 50% for both 99.9th             ically centralized controller, which has a global view of the
percentile and average cases. This latency improvement is be-             whole system [16], makes decisions to migrate/replicate some
cause the client-driven coordination eliminates the overhead              of the popular data items to other under-utilized storage nodes
of the load balancer and skips a potential forwarding step                using monitoring reports from the programmable switches.
introduced in the server-driven coordination when a request is            Then, it updates the switches’ data plane with the new index-
assigned to a random node. However, it introduces additional              ing records. To the best of our knowledge, this is the first work
load on the client to periodically pickup a random node from              to use the programmable switches as the request coordinator
the key-value store cluster to download the updated directory             to manage the distributed key-value store’s partition informa-
information to perform the coordination locally on its side.              tion along with the key-based routing protocol. Overall, our
It also requires the client to equip its application with some            contributions in this paper are four-fold:
code specific to the key-value store used.
   Another challenge in maintaining distributed key-value                    • We propose the in-switch coordination paradigm, and
stores is how to handle dynamic workloads and cope with                        design an indexing scheme to manage the directory in-
changes in data popularity [1, 8]. Hot data receives more                      formation records inside the programmable switch along
queries than cold data, which leads to load imbalance between                  with protocols and algorithms to ensure the strong con-
the storage nodes; some nodes are heavily congested while                      sistency of data among replica, and achieve the reliability
others become under-utilized. This results in a performance                    and availability in case of having nodes failure.
degradation of the whole system and a high tail latency.
                                                                             • We introduce a data migration mechanism to provide
   In this paper, we propose TurboKV : a novel Distributed
                                                                               load balancing between the storage nodes based on the
Key-Value Store Architecture that leverages the power and
                                                                               query statistics collected from the network switches.
flexibility of the new generation of programmable switches.
TurboKV scales up the performance of the distributed key-                    • We propose a hierarchical indexing scheme based on our
value storage by offloading the partitions management and                      proposed rack scale switch coordinator design to scale
query routing to be carried out in network switches. Tur-                      up TurboKV to multiple racks inside the existing data
boKV uses a switch-driven coordination which utilizes the                      center network architecture.
programmable switches as: 1) partition management nodes to
store and manage the directory information of key-value store;               • We implemented a prototype of TurboKV using P4 on
and 2) monitoring stations to measure the load of storage                      top of the simple software switch architecture BMV2 [5].
nodes, where this monitoring information is used to balance                    Our experimental results show that our proposed archi-
the load among storage nodes.                                                  tecture improves the throughput and reduces the latency
   TurboKV adapts a hierarchical indexing scheme to dis-                       for the distributed key-value stores.
tribute the directory information records inside the data center
network switches. It uses a key-based routing protocol to map                The remaining sections of this paper are organized as fol-
the requested key in the query packet from the client to its              lows. Background and motivation is discussed in Section 2.
target storage node by injecting some information about the               Section 3 provides an overview of the TurboKV architecture,
requested data in packet headers. The programmable switches               while the detailed design of TurboKV is presented in Sec-
use this information to decide where to send the packet to                tion 4 and Section 5. Section 6 discusses how to scale up

                                                                      2
TurboKV inside the data center networks. TurboKV imple-                                   Pipe

mentation is discussed in Section 7, while Section 8 gives an                             Pipe

                                                                       Packet                        TM                                           Packet                       L2 Lookup      L3 Lookup        ACL Lookup
experimental evidence and analysis of TurboKV. Section 9                 In                                                                         out                          Table          Table            Table

provides a short survey about the related work to us, and                  Programmable Ingress     Traffic      Egress Programmable
                                                                              Parser    Pipelines   Manager     Piplines Deparser
finally, the paper is concluded in Section 10.
                                                                                (a) Switch Data Plane Architecture                                                             (b) Multiple Stages of Pipeline
                                                                                           Control Plane

2     Background and Motivation                                                                           Headers and Metadata
                                                                            Lookup Key                                 Hit
                                                                                          Key Action ID Action Data                                                              Match      Action           Action Data
                                                                                                                                                                                                        dst=00:00:00:00:01:01

                                                                                                                                                             To Other Stages
                                                                                                                                        Data

2.1    Preliminaries on Programmable Switches                                                                                                     Action                       10.0.1.1/32 L3_Forward

                                                                                                                             Selector
                                                                                                                             Hit/Miss
                                                                                                                                                                                                                port=1
                                                                                                                                                 Execution
                                                                                                                                                   Unit                                                 dst=00:00:00:00:01:02
                                                                                                                                                                               10.0.1.2/32 L3_forward
                                                                                                                                                                                                                port=2
                                                                                                                                        Action
                                                                                              Default Default Action                      ID                                      ***        drop
Software-Defined Network (SDN) simplifies network de-                                        Action ID    Data

vices by introducing a logically centralized controller (control            Headers and Metadata
                                                                                   (Input)

plane) to manage simple programmable switches (data plane).                     (c) Packet Processing inside one Stage                                                         (d) Example of L3 (IPv4) Table

SDN controllers set up forwarding rules at the programmable
                                                                                Figure 1: Primelinries on Programmable Switches
switches and collect their statistics using OpenFlow APIs [26].
As a result, SDN enables efficient and fine-grained network
management and monitoring in addition to allowing indepen-
                                                                       prefixes, where this rules corresponds to the default action.
dent evolution of the controller and programmable switches.
                                                                          After the packet finishes all the stages in the ingress
   Recently, P4 [6] has been introduced to enrich the capabili-
                                                                       pipeline, it is queued and switched by the traffic manager
ties of network devices by allowing developers to define their
                                                                       for an egress pipe for further processing. At the end, there is
own packet formats and build the processing graphs of these
                                                                       a programmable deparser that performs the reverse operation
customized packets. P4 is a programming language designed
                                                                       of the parser. It reassembles the packet back with the updated
to program parsing and processing of user-defined packets
                                                                       headers so that the packet can go back onto the wire to the
using a set of match/action tables. It is a target-independent
                                                                       next hop in the path to its destination. Developers of P4 pro-
language, thus a P4 compiler is required to translate P4 pro-
                                                                       grams should take into consideration some constraints when
grams into target-dependent switch configurations.
                                                                       designing a P4 Program. These constrains include: (i) the
   Figure 1(a) represents the five main data plane components
                                                                       number of pipes, (ii) the number of stages and ports in each
for most of modern switch ASICs. These components include
                                                                       pipe, and (iii) the amount of memory used for prefix matching
programmable parser, ingress pipeline, traffic manager, egress
                                                                       and data storage in each stage.
pipeline and programmable deparser. When a packet is first
received by one of the ingress ports, it goes through the pro-
grammable parser. The programmable parser, as shown in                 2.2         Why In-Switch Coordination?
Figure 1(a), is modeled as a simple deterministic state ma-
chine, that consumes packet data and identifies headers that           We utilize the programmable switches to scale up the per-
will be recognized by the data plane program. It makes transi-         formance of distributed key-value stores with the in-switch
tions between states typically by looking at specific fields in        coordination because of three reasons. First, programmable
the previously identified headers. For example, after parsing          switches have become the backbone technology used in mod-
the Ethernet header in the packet, the next state will be deter-       ern data centers or rack-scale clusters that allow developers
mined based on the Ethertype field, whether it will be reading         to define their own functions for packet processing and pro-
a VLAN tag, an IPv4 or IPv6 header.                                    vide flexibility to program the hardware. For example, Google
   After the packet is processed by the parser and decomposed          uses the programmable switches to build and control their
into different headers, it goes through one of the ingress pipes       data centers networks [14].
to enter a set of match-action processing stages as shown in              Second, request latency is one of the most important factors
Figure 1(b). In each stage, a key is formed from the extracted         that affect the performance of all key-value stores. This la-
data and matched against a table that contains some entries.           tency is introduced through the multiple hops that the request
These entries can be added and removed through the control             traverses before arriving to its final destination, in addition
plane. If there is a match with one of the entries, the corre-         to the processing time to fetch the desired key-value pair(s)
sponding action will be executed, otherwise a default action           from that destination. When the key-value store is distributed
will be executed. After the action execution, the packet with          among several storage nodes, the request latency increases
all the updated headers from this stage enters the next stages         because of the latency introduced to figure out the location
in the pipeline. Figure 1(c) illustrates the processing inside         of the desired key-value pair(s) before fetching them from
a pipeline stage, while Figure 1(d) shows an example IPv4              that location. This process is referred to as partition manage-
table that picks an egress port based on destination IP address.       ment and request coordination, which can be organized by
The last rule drops all packets that do not match any of the IP        the storage nodes (server-based coordination) or by the client

                                                                   3
- (2) Request forwarded to S4 based on load
                                                                                              - (3),(4) S4 forwarded the request to S1                                       Client Application
                                      Switch Match-Action Table
                                      Key    Action      Data
                                                                                                     Switch Routing Table                                                                                Clients
        (1) Request: Get(K2)                                      (1) Request: Get(K2)                     IP Port                                                           TurboKV Client
                                      K1 port_forward port=P3                                                        Switch uses
                                                                                                           S1   P1    L3 routing                                                 Library
                                      K2 port_forward port=P1
                     Programmable
                                       **     drop                                 Switch                  S4   P4                                                                                                         TurboKV
    (2) Request         Switch
    forwarded                                                                                                                                                                                                              Controller
                P1     P2      P3    P4                            (4)   P1     P2       P3           P4
                                                                                                             S4 Mapping Table
                                                                                                                                                                                                                                                    Data Center
    to S1
                                                                                               (3)             Key Node                                                                                                                              Network
                                                                                                     (2)
                                                                                                                K1      S3
           S1        S2         S3        S4                        S1        S2            S3         S4       K2      S1                                                             Standard L2/L3          Key-based         Query Statistics
                                                                                                                                                                                          Routing               Routing             Module
                                                                                                                                                                                                          Programmable Switch
             (a) In-Switch (2 hops)                                      (b) Server-based (4 hops)
                                                                                                                                                       TurboKV Server
                                                                                                                                                           Library

                                                                                                                                                                                ....              ....             ....         ....
                     Figure 2: Request Coordination Models                                                                                             Key-vlaue Store
                                                                                                                                                                                       Distributed Storage Servers

(client-based coordination).
                                                                                                                                                                         Figure 3: TurboKV Architecture
   Because client requests already pass through network
switches to arrive at their target, offloading the partition man-
agement and query routing to be carried out in the network                                                                                      formation represents the routing information to reach one of
switches with the in-switch coordination approach will re-                                                                                      the storage nodes (Section 4.1). Following this approach, the
duce the latency introduced by the server-based coordination                                                                                    programmable switches act as request coordinator nodes that
approach; the number of hops that the request will travel from                                                                                  manage the data partitions and route requests to target storage
the client to the target storage node will be reduced as shown                                                                                  nodes.
in Figure 2. In-switch coordination also removes the load                                                                                          In addition to using the key-based routing module, other
from the client in the client-based coordination approach by                                                                                    packets are processed and routed using the standard L2/L3
making the programmable switch manage all the information                                                                                       protocols which makes TurboKV compatible with other
for the request routing.                                                                                                                        network functions and protocols (Section 4.2). Each pro-
   Third, the partition mangement and request coordination                                                                                      grammable switch has a query statistics module to collect
are communication-bounded rather than being computation-                                                                                        information about each partition’s popularity to estimate the
bounded. So, the massive evolution in the capailities of the                                                                                    load of storage nodes (Section 5.1). This is vital to make data
network switches, which provide orders of magnitude higher                                                                                      migration decisions to balance the load among storage nodes
throughput than the highly optimized servers, makes them the                                                                                    especially to handle dynamic workloads where the key-value
best option to be used as partition mangement and request cor-                                                                                  pair popularity changes overtime.
dination nodes. For example, Tofino ASIC from Barefoot [27]
provides few billions of packets processed per second with                                                                                      Controller. The controller is primarily responsible for system
6.5 Tbps bandwidth. Such performance is orders of magni-                                                                                        reconfigurations including (a) achieving load balancing be-
tude higher than NetBricks [32] which processes millions of                                                                                     tween the distributed storage nodes (Section 5.1), (b) handling
packets per second and has 10-100 Gbps bandwidth [17].                                                                                          failures in the storage network (Section 5.2), and (c) updating
                                                                                                                                                each switch’s match-action tables with the new location of
                                                                                                                                                data. The controller receives periodic reports form switches
3      TurboKV Architecture Overview                                                                                                            about the popularity of each data partition. Based on these re-
                                                                                                                                                ports, it decides to migrate/replicate part of the popular data to
TurboKV is a new architecture of the future distributed key-                                                                                    another storage node to achieve load balancing. Through the
value stores that leverages the capability of programmable                                                                                      control plane, the controller updates the match-action tables in
switches. TurboKV uses an in-switch coordination approach                                                                                       the switches with the new data locations. TurboKV controller
to maintain the partition management information (directory                                                                                     is an application controller that is different from the network
information) of distributed key-value stores, and route clients’                                                                                controller in SDN, and it does not interfere with other network
search queries based on their requested keys to the target stor-                                                                                protocols or functions managed by the SDN controller. Our
age nodes. Figure 3 shows the architecture of TurboKV which                                                                                     controller only manages the key-range based routing and data
consists of the programmable switches, controller, storage                                                                                      migrations and failures associated with them. Both controllers
nodes, and system clients.                                                                                                                      can be co-located on the same server, or on different servers.

Programmable Switches. Programmable switches are the                                                                                            Storage Nodes. They represent the location where the key-
essential component in our new proposed architecture. We                                                                                        value pairs reside in the system. The key-value pairs are
augment the programmable switches with a key-based rout-                                                                                        partitioned among these storage nodes (Section 4.1.1). Each
ing approach to deliver TurboKV query packets to the target                                                                                     storage node runs a simple shim that is responsible for reform-
key-value storage node. We leverage match-action tables and                                                                                     ing TurboKV query packets to API calls for the key-value
switch’s registers to design the in-switch coordination where                                                                                   store, and handling TurboKV controller’s data migration re-
the partition management information will be stored on the                                                                                      quests between the storage nodes. This layer makes it easy
path from the client to the storage node. This directory in-                                                                                    to integrate our design with existing key-value stores without

                                                                                                                                            4
Recirculate the cloned packet to go through ingress pipeline
any modifications to the storage layer.
                                                                                                                    range                    Key-based Routing                                   Update
                                                                                             Yes                            Range match-                                                 Yes
                                                                                                        Parse                action Table                                    key-range          Headers
                                                                                                       TurboKV                                 Query       Query              OpCode           and packet
System Clients. TurboKV provides a client library which                            TurboKV
                                                                                                        Header
                                                                                                                    hash
                                                                                                                             Hash match-
                                                                                                                                             Processing   Statistics
                                                                                                                                                                                         No
                                                                                                                                                                                                 cloning
                                                                                                                             action Table
can be integrated with the client applications to send Tur-            Packet In
                                                                                    Client
                                                                                    Packet   No
                                                                                                                              IPv4 Routing
                                                                                                                                                                                                    Packet Out
                                                                                                       Parse IP
boKV packets through the network, and access the key-value                                             Headers                   Table

store without any modifications to the application. Like other                                Parser                                Ingress Pipeline                              Egress Pipeline

key-value stores such as LevelDB [21] and RocksDB [35], the
library provides an interface for all key-value pair operations           Figure 4: Logical View of TurboKV Data Plane Pipeline
that is responsible for constructing the TurboKV packets and
translates the reply back to the application.                          which is an extremely random hash function, ensures that
                                                                       records are distributed uniformly across the entire set of pos-
                                                                       sible hash values. We developed a variation of the consis-
4     TurboKV Data Plane Design                                        tent hashing [19] to distribute the data over multiple storage
                                                                       nodes. The whole output range of the hash function is treated
The data plane provides in-switch coordination model for               as a fixed space. This space is partitioned into sub-ranges,
the key-value stores. In this model, all partition management          each sub-range represents a consecutive set of hashing val-
information and query routing are managed by the switches.             ues. These sub-ranges are distributed evenly on the storage
Figure 4 represents the whole pipeline that the packet tra-            nodes. Each storage node can be responsible for one or more
verses inside the switch before being forwarded to the right           sub-ranges based on its load, and each sub-range is assigned
storage node. In this section, we discuss how the switch data          to one node (or multiple nodes depending on the replication
plane supports these functions.                                        factor).
                                                                          Like the consistent hashing, partitioning one of the sub-
4.1     On-Switch Partition Management                                 ranges or merging two sub-ranges due to the addition or
                                                                       removal of nodes affects only the nodes that hold these sub-
4.1.1   Data Partitioning                                              ranges and other nodes remain unaffected. Each data item
In large-scale key-value stores, data is partitioned among stor-       identified by a key is assigned to a storage node if the hashed
age nodes. TurboKV supports two different partitioning tech-           value of its key falls in the sub-range of hashing values which
niques. Applications can choose any of these two partitioning          the storage node is responsible for. This method requires a
techniques according to their needs.                                   mapping table, shown in Figure 5(b), where each record repre-
                                                                       sents a hashed sub-range and its associated nodes where data
Range Partitioning. In range partitioning, the whole key               resides. Support of hash partitioning in the switch data plane
space is partitioned into small disjoint sub-ranges. Each sub-         is discussed in Section 4.1.3. The disadvantage of this tech-
range is assigned to a node (or multiple nodes depending on            nique, like all hashing partitioning techniques, is that range
the replication factor). The data with keys fall into a certain        queries can not be supported. On the storage nodes side, data
sub-range is stored on the same storage node that is responsi-         is managed in hash-tables and collisions are handled using
ble for this sub-range. The key itself is used to map the user         separate chaining in the form of binary search tree.
request to one of the storage nodes. Each storage node can be             For both partitioning techniques, we assume that there is
responsible for one or more sub-ranges.                                no fixed space assigned to each sub-range (partition) on the
   Each storage node has LevelDB [21] or any of its enhanced           storage node (i.e., the space that each partition consumes on
versions installed where keys are stored in lexicographic or-          a storage node can grow as long as there is available space
der on SSTs (Sorted String Tables). This key-value library             on the storage node). Unfortunately, multiple insertions to the
is responsible for managing the sub-ranges associated to the           same partition may exceed the capacity of the storage node.
storage node and handling the key-value pair operations di-            In this case, The sub-range of this partition will be divided
rected to the storage node. The advantage of this partitioning         into two smaller sub-ranges. One of these small sub-ranges
scheme is that range queries on keys can be supported, but un-         will be migrated to another storage node with available space.
fortunately, this scheme suffers from load imbalance problem           Other storage nodes that have the same divided sub-range
discussed in Section 5.1. Some sub-ranges contain popular              with available space keep the data of the whole sub-range and
keys which receive more requests than others. The mapping              manipulate it as it is. The mapping table will be updated with
table for this type of partitioning represents the key range           the new changes of the divided sub-ranges.
and the associated nodes where the data resides as shown in
Figure 5(a). Support of range partitioning in the switch data          4.1.2           Data Replication
plane is discussed in Section 4.1.3.
                                                                       To achieve high availability and durability, data is replicated
Hash Partitioning. In hash partitioning, each key is hashed            on multiple storage nodes. The data of each partition (sub-
into a 20-byte fixed-length digest using RIPEMD160 [12]                range) is replicated as one unit on different storage nodes.

                                                                   5
Match                Action               Action data
           Total Key Span                                                      Hash Function Output Range                                                   Programmable
                                                                                                                                                                                                                                   chain = 1,2,3
                                                                                                                                                               Switch                   sub-range1      key_based_routing
                                                                                                                                                                                                                                   length = 3
                                                             h(K) = H                                                                                  P1     P2      P3      P4
                                                                                     HR1 HR2                           HRN                                                                                                         chain = 2,3,4
     KR1        KR2                           KRN                                                                                                                                       sub-range2      key_based_routing
                                                                                                                                                                                                                                   length = 3
                                                                                                                                                                                                                                   chain = 3,4,1
K1        Ki          Kj                    Km        Kn                        H1        Hi        Hj           Hm          Hn                                                         sub-range3      key_based_routing
                                                                                                                                                                                                                                   length = 3
                                                                                                                                                S1          S2         S3          S4
                                                                                                                                                                                                                                   chain = 4,1,2
                                                 with Replication Factor (r) = 2                                                                                                        sub-range4      key_based_routing
                                                                                                                                                                                                                                   length = 3

 Key Range Storage Nodes                               R1 : Replica1(head)           Hash Range Storage Nodes                                  (a) Network Physical Topology                      (b) Switch Match-Action Table
  K1 -- Ki                                             R2 : Replica2(tail)            H1 -- Hi           R1 = S2, R2 = S3
           R1 = S2, R2 = S3                                                                                                                                                                Key_based_routing(chain, length)
                                                                                                                                                                                           {
     Ki+1 -- Kj       R1 = S1, R2 = S4                                                Hi+1-- Hj          R1 = S1, R2 = S4              Node IP Array        IP1 IP2 IP3 IP4
                                                                                                                                                                                               1. process_node_ip_array(chain, length)
       ------                                                                             ------                                                                                               2. process_node_port_array(chain,length)
                                                                                                                                       Node port Array      P1 P2 P3 P4                        3. query_processing_module(length)
     Km -- Kn         R1 = S4, R2 = S3                                                Hm -- Hn           R1 = S4, R2 = S3
                                                                                                                                                                                           }

(a) Range Partitioning                                                            (b) Hash Partitioning                                        (c) Register Arrays                                (d) Key-based Routing Process

       Figure 5: TurboKV Data Partitioning and Replication                                                                                Figure 7: TurboKV Partition Management inside Switch

                                                                                                                                       4.1.3           Management Support in Switch Data Plane

                               ba
                                  cku
                                      p          S2                                                                                    In TurboKV, the switch data plane has three types of match-
                           W

     R/W Reply
                                      Ac
                                        k
                                              Replica1
                                                           W Request                                R Request
                                                                                                                       R/W Reply       action tables: range partition management table, hash partition
                                                                              W Request            W Request
                   S1      W
                                ba
                                      cku
                                                                        S1                  S2                  S3                     management table and the normal IPv4 routing table. We will
 R/W Request                              p
                 Primary       Ac
                                  k
                                                 S3
                                                                       Head               Replica               Tail
                                                                                                                                       discuss the design of the partition management tables which
                                              Replica2                                                                                 are related to the key-based routing protocol for the rack scale
         (a) Primary Backup                                                   (b) Chain Replication                                    cluster shown in Figure 7(a). The design of both tables is the
                                                                                                                                       same but the value we used for matching and the value we
                                       Figure 6: Replication Models                                                                    match against are different based on the type of partitioning.
                                                                                                                                       We will refer to the value we use for matching as the matching
                                                                                                                                       value, this value represents the key itself in case of range
                                                                                                                                       partitioning and the hashed value of the key in case of hash
The replication factor (r) can be adjusted according to appli-
                                                                                                                                       partitioning.
cation needs. The list of nodes that is responsible for storing
                                                                                                                                          The partition management match-action table design is
a particular partition is called the replica list. This replica list
                                                                                                                                       shown in Figure 7(b). Each record in the table consists of three
is stored in the mapping table as shown in Figure 5.
                                                                                                                                       parts: match, action and action data. The match represents
   TurboKV follows the chain replication (CR) model [39].                                                                              the value that we match the matching value against, we refer
Chain replication [39] is a form of primary backup protocols                                                                           to it as a sub-range. This sub-range represents the start and
to control clusters of fail-stop storage servers. This approach                                                                        end keys of a sub-range from the whole key span in range
is intended for supporting large-scale storage services that                                                                           partitioning, or the start and end hash values of a consecutive
exhibit high throughput and availability without sacrificing                                                                           set of hash values in hash partitioning. The action represents
strong consistency guarantees. In this model as shown in                                                                               the key-based routing that will be executed when a matching
Figure 6(b), the storage nodes, holding replicas of data, are                                                                          value falls within the sub-range. The action data consists of
organized in a sequential chain structure. Read queries are                                                                            two parts: chain and length. Chain represents the forwarding
handled by the tail of the chain, while write queries are sent                                                                         information for nodes forming the chain of the sub-range.
to the head of the chain, the head process the request and for-                                                                        This information includes node’s IP address and the port from
warded it to its successor in the chain structure. This process                                                                        the switch to the storage node. Nodes’ information is sorted
continues till reaching the tail of the chain. Finally, the tail                                                                       according to node’s position in the chain structure (i.e., first
processes the request and replies back to the client.                                                                                  node is the head and last node is the tail), and is used in
                                                                                                                                       updating packet during the action execution.
   In CR, Each node in the chain needs to know only about its                                                                             In TurboKV, each node holds the data of one or more sub-
successor, where it forwards the request. This makes the CR                                                                            ranges. This makes each node’s forwarding information ap-
simpler than the classical primary backup protocol [7], shown                                                                          pears more than once in the match-action table records. Tur-
in Figure 6(a) which requires the primary node to know about                                                                           boKV uses two arrays of registers in the switch’s data plane
all replicas and keep track of all acknowledgement received                                                                            to save the forwarding information: node IP array, and node
from all replicas. Moreover, In chain replication, write queries                                                                       port array. For each storage node, the forwarding information
also use fewer messages than the classical primary backup                                                                              is stored at the same index in the two arrays as shown in
protocol, (n+1) instead of (2n) where n is number of nodes.                                                                            Figure 7(c). For example, the information of storage node S1
With r replicas, TurboKV can sustain up to (r-1) node failures                                                                         is stored at index 1 in the two arrays and the same for the
as requests will be served with other replicas on the chain                                                                            other storage nodes. The index of the storage nodes in the reg-
structure.                                                                                                                             ister arrays is stored as action data in the match-action table

                                                                                                                                   6
L2/L3 Routing                TurboKV Header for key-based Routing                                       L2/L3 Routing                                                                                                                                                             S1     Head
                                                                                                                                                                                                   From Client To Switch
  Ethernet        IP      OPCode          Key         endKey/hashKey Payload                              Ethernet    IP    Payload                                                Ether    IP Put        K      h(K) Payload (Value)
                                                                                                                                                                                                                                                                                    S2     Replica
 ETH Type to
                   ToS for                                 The Range end in case of                                        Result from the                                                     From Switch to S1 (head)
  distinguish                    GEt, PUT, Delete,
                 partitioning                             Range OpCode/ hashedKey                                          key-value store                           dst = S1 IP
   TurboKV                            Range                                                                                                                 Ether     ToS = 2
                                                                                                                                                                                    Put    K     h(K) Clength = 3 S2 IP S3 IP C IP Payload (Value)
                    type                                   in case hash partitioning                                         in Payload                                                                                                                                             S3     Tail
   Packets

                                                                                                                                                                                                   From S1 to S2 (replica)                                 (b) Chain for the requested subrange
                         (a) Client Packet                                                            (b) Storage Node Packet                                 Ether
                                                                                                                                                                    dst = S2 IP
                                                                                                                                                                     ToS = 2
                                                                                                                                                                                      Put      K    h(K) Clength = 2 S3 IP C IP Payload (Value)
                                                                                                                                                                                                                                                                            From Client To Switch
 L2/L3 Routing TurboKV Header for key-based Routing                                       Chain Header                                                                                             From S2 to S3 (tail)                                          Ether     IP Get      K    h(K)     Payload
                                                                                                                                                                            dst = S3 IP
                                                                                                                                                                    Ether    ToS = 2
                                                                                                                                                                                           Put      K   h(K) Clength = 1 C IP Payload (Value)

 Ethernet       IP      OPCode           Key         endKey/hashKey             CLength S1 IP S2 IP .... SN IP C IP                 Payload                                                                                                                                   From Switch to S3 (tail)
                                                                                                                                                                                                   From S3 to client                                      dst = S3 IP
                                                                                                                                                                                                                                                  Ether    ToS = 2
                                                                                                                                                                                                                                                                         Get
                                                                                                                                                                                                                                                                         FromKClient
                                                                                                                                                                                                                                                                                 h(K)To Switch
                                                                                                                                                                                                                                                                                        Clength = 1 C IP       Payload
ETH Type to                                          The Range end in case of Range                                                                                                       Ether     dst = C IP   Payload (Result)
                  ToS for       GEt, PUT, Delete,                                                      IPs of nodes in chain           value added to
 distinguish                                         OpCode/ hashedKey in case hash    Chain Length
                partitioning         Range                                                                                               payload of
  TurboKV
                   type
                                                              partitioning                            followed by client IP at             packets
                                                                                                                                                                                                                                                                              From S3 to client
  Packets
                                                                                                               the end                                                                 (a) PUT Operation Processing                                                 Ether     dst = C IP   Payload (Result)
                                                            (c) Switch Packet
                                                                                                                                                                                                                                                                  (c) GET Operation Processing

                                  Figure 8: TurboKV Packet Format                                                                                                                    Figure 9: TurboKV KV Storage Operations
records to form the chain as shown in Figure 7(b). The key-
based routing uses these indexes to process the register arrays                                                                                             hash partitioning) from TurboKV header, and looks up the
and fetch the forwarding information for the replica nodes.                                                                                                 corresponding key-based match-action table using the value
This information is used by the query processing module to                                                                                                  of the extracted field (matching value). If there is a hit, the
forward packets to their next hop.                                                                                                                          switch fetches the chain nodes’ information from the registers
                                                                                                                                                            based on the specified indexes in the match-action table. Then,
                                                                                                                                                            the switch processes this information based on the query type
4.2             Network Protocol Design                                                                                                                     (OpCode). After that, the switch updates the packet with the
Packet Format. Figure 8(a) shows the format of TurboKV re-                                                                                                  target node’s information, and forwards it to the next hop on
quest packet sent from clients. The programmable switches                                                                                                   the path to this target node.
use the Ethernet Type in the Ethernet header to identify                                                                                                       The switch uses the range matching for table lookup, in
TurboKV packets, and execute the key-based routing. Other                                                                                                   which, it matches the matching value against the sub-range
switches in the network do not need to understand the for-                                                                                                  in the corresponding key-based match-action table. If the
mat of TurboKV header, and treat all packets as normal IP                                                                                                   matching value falls within one of the sub-ranges (hit), the
packets. The ToS (Type of Service) in the IP header is used to                                                                                              key-based routing action is processed using the action data.
distinguish between three types of TurboKV packets: range                                                                                                   The programmable switches route the previously processed
partitioned data packet, hash partitioned data packet, and Tur-                                                                                             TurboKV packets and the storage node to client packets using
boKV packet previously processed by the switch. The Tur-                                                                                                    the standard L2/L3 protocol without passing the key-based
boKV header consists of three main fields: OpCode, Key, and                                                                                                 routing and perform the match-action based on the destination
endKey/hashedKey. The OpCode is a one-byte field which                                                                                                      IP in the IP header.
stands for the code of the key-value operation (Get, Put, Del,
and Range). Key stores the key of a key-value pair. In Range                                                                                                4.3                Key-value Storage Operations
operation, Key and endKey/hashedKey are used to represent
the start and end of the key range, respectively. In case of hash                                                                                           PUT and DELETE Queries. In chain replication, PUT and
partitioning, the endKey/hashedKey is set with the hashed                                                                                                   DELETE queries are processed by each node along the chain
value of the key to perform the routing based on it instead of                                                                                              starting from the head and replied by the tail. On the switch
the key itself.                                                                                                                                             side, after processing one of the key-based match-action tables
   After the client packet is processed by the programmable                                                                                                 with the matching value and fetching the corresponding chain
switch, the switch adds the chain header shown in Figure 8(c).                                                                                              information from the register arrays, the query processing
This header is used by the storage nodes for the chain replica-                                                                                             module sets the destination IP address in the IP header with
tion model. It includes two fields. The first field is the number                                                                                           the IP address of the chain head node and changes the ToS
of nodes which the packet passes by in the chain including                                                                                                  value to mark the packet as previously processed. The egress
the client IP (CLength). The second field has these nodes’                                                                                                  port is set with the forwarding port of the chain head node,
IP addresses ordered according to their position in the chain                                                                                               then the chain header is added to the packet with CLength
followed by the client IP at the end. The packet format of the                                                                                              equals to the length of the chain and the chain nodes’ IP
reply from storage node to client is a standard IP packet shown                                                                                             addresses ordered according to their position in the chain
in Figure 8(b). The result is added to the packet payload.                                                                                                  followed by the client IP at the end as shown in Figure 9(a).
                                                                                                                                                            Finally, the packet is forwarded to the head of chain node.
Network Routing. TurboKV uses a key-based routing pro-                                                                                                         As shown in Figure 9(a), when the packet arrives at a stor-
tocol to route packets from clients to storage nodes. The                                                                                                   age node, it is processed by TurboKV storage library. The
client sends the packet to the network. Once it reaches the                                                                                                 node updates its local copy of the data. Then, it reads the
programmable switch, the switch extracts the Key (in case                                                                                                   chain header, sets the destination address in the IP header
of range partitioning) or the endkey/hashedKey (in case of                                                                                                  with the IP of its successor and reduces the CLength by 1,

                                                                                                                                                        7
Old Chain: S1,S2,S3,S4
                                                                                                                                                                              New Chain: S1,S3,S4,S5
Algorithm 1 Range Query Handling                                                                            TurboKV Controller
                                                                                            Controller Index Table          System Global View
                                                                                                                                                                     S1     S2          S3       S4
1: Input pkt: packet entering the egress pipeline                                               Key Range    Node IP    Node IP   Node Specification   Current Load

2: matched_subrange: the subrange where the start key of the requested range falls           Control Plane                                                                        (1)
3: Output pkt_out: packet to be forwarded to the next hop                                            Add/Remove/ Update
                                                                                                                                         Reports to Controller

4: Output pkt_cir: packet to be circulated and sent to ingress pipeline as new packet                   Table Entries

5: Begin:                                                                                    Match-Action                                                            S1     S3          S4       S5
                                                                                                Table             Per-Key Range Counter for each record
6: pkt_out = pkt // clone the packet
7: if pkt.OpCode == range then                                                                                Switch Data Plane                                                   (2)
8:      // check if range spans multiple nodes                                               (a) Logical View of The Controller                                       (b) Chain Reconfiguration
9:      if pkt.request.endKey > matched_subrange.endKey then
10:           pkt_cir = pkt // clone the packet
11:           pkt_out.request.endKey = matched_subrange.endKey                                                          Figure 10: TurboKV Control Plane
12:           pkt_cir.request.Key = Next(matched_subrange.endKey)
13:       end if
14: end if
15: if pkt_cir.exist() then                                                                 5          TurboKV Control Plane Design
16:       circulate(pkt_cir) // send packet to ingress pipeline again
17: end if
18: send_to_output_port(pkt_out)
                                                                                            The control plane provides load balancing module based on
                                                                                            the query statistics collected in the data plane. It also provide
then forwards the packet to its successor. Packets received by                              a failure handling module to guarantee availability and fault
the tail node, with CLength = 1, have their chain header and                                tolerance. In this section, we discuss how the control plane
TurboKV header removed, and the result is sent back using                                   supports these functions.
the client IP as the destination address in the IP header.

GET Queries. Following the chain replication, GET queries
are handled by the tail of the chain. After performing the
                                                                                            5.1             Query Statistics and Load Balancing
matching on the matching value and fetching the correspond-
ing chain information from the register arrays, the query pro-                              In TurboKV, the data plane has a query statistics module to
cessing module sets the destination IP address in the IP header                             provide query statistics reports to the TurboKV controller.
with the IP address of the chain tail node and changes the ToS                              Thus, the controller can estimate the load of each storage
value to mark the packet as previously processed. The egress                                node, and make decisions to migrate part of the popular data
port is set with the forwarding port of the chain tail node,                                to one of the under-utilized storage nodes. As shown in Fig-
then the chain header is added to the packet with CLength                                   ure 10(a), the switch’s data plane maintains a per-key range
equals to 1 and one node IP which represents the client IP as                               counter for each key range in the match-action table. Upon
shown in Figure 9(c). Finally, the packet is forwarded to the                               each hit for a key range, its corresponding counter is incre-
storage node. When the packet arrives at the storage node, it                               mented by one. The controller receives reports periodically
is processed by TurboKV storage library. The query result is                                from the data plane including these statistics, and resets these
added to the payload of the packet. and the client IP is popped                             counters in the beginning of each time period. Then, the con-
up from the chain header and put in the destination address in                              troller compares the received statistics with the specifications
the IP header. The chain and TurboKV headers are removed                                    of the storage nodes. If a storage node is over-utilized, the
from the packet and the result is sent back to the client.                                  controller migrates a subset of the hot data in a sub-range to
                                                                                            another storage node and reconfigures the chain of the up-
Range Queries. If the data is range partitioned, TurboKV can                                dated sub-range. Then, the controller updates records in the
handle range queries. The requested range in the Tur-                                       match-action table of the switches with the new chain configu-
boKV header may span multiple storage nodes. Thus, the                                      rations. After the sub-range’s data is migrated to other storage
switch divides the range into several sub-ranges, each sub-                                 nodes, the old copy is removed from the over-utilized one.
range corresponds to a packet. Each of these packets contains                               Currently, the controller follows a greedy selection algorithm
the start and end keys of the corresponding sub-range. Each                                 to select the least utilized node where data will be migrated.
packet is handled by the switch like a separate read query                                     TurboKV uses the physical migration of the data to achieve
and forwarded to the tail node of its partition’s chain. Un-                                load balancing between the storage nodes. This approach
fortunately, the switch cannot construct new packets from                                   adapts with all kinds of workloads compared to the caching
scratch on its own. Therefore, in order to achieve the pre-                                 approach. In caching [18], the cache absorbs the read requests
vious scenario, we placed the range operation check in the                                  of the very popular key-value pairs, which makes it performs
egress pipeline as shown in Figure 4. We use the clone and                                  well in highly skewed read-only workload, but the effect of
circulate operations, supported in the architecture of the                                  caching in load balancing decreases if the workload’s ratio
programmable switches and P4, to solve the packet construc-                                 of updates increases. This behavior resulted from the invalid
tion problem. When a range operation is detected in the egress                              cached key-value pairs, which make the request directed to
pipeline, the packet is processed as shown in Algorithm 1.                                  the target storage node to retrieve the valid pairs.

                                                                                        8
5.2    Failures Handling                                                                              Core   Core

                                                                                                AGG                 AGG
We assume that the controller process is a reliable process
and it is different from the SDN controller. We also assume
                                                                                          ToR         ToR    ToR          ToR
that links between storage nodes and switches in data centers
are reliable and are not prone to failures.

Storage Node Failure. When the controller detects a stor-                    Figure 11: Scaling Up inside Data center Network
age node failure, it reconfigures the chains of the sub-ranges
on the failed storage node and updates their corresponding              range match-action tables of sub-ranges in its connected
records in the key-based match-action table through the con-            AGG switches. With each sub-range in either the AGG or
trol plane. The controller removes the failed storage node              Core switches, the action data represents only the forwarding
from its position in all chains. In each chain, the predeces-           port towards the head or the tail of the sub-range’s chain. No
sor of the failed node will be followed by the successor of             chains are stored in these switches. When a packet is received
the failed node, reducing the chain length by 1 as shown in             by an AGG switch or core switch, this packet is processed
Figure 10(b). If the failed node was the head of the chain,             by the key-based routing protocol without adding any chain
then the new head will be its successor. If the failed node was         header to the packet and forwarded to one of the ports to-
the tail of the chain, then the new tail will be its predecessor.       wards the head or tail of the sub-range chain based on the
Reducing the chain length by 1 makes the system able to                 query (write or read respectively). When the packet arrives at
sustain less number of failures. That is why the controller dis-        ToR switch, the switch processes the packet as discussed in
tributes the data of the failed node in sub-range units among           Section 4.3. Replicas of a specific sub-range may be located
other functional nodes, and adds these new nodes at the end             on different racks. This design leverages the existing data
of these sub-ranges’ chains in their corresponding records in           center network architecture and does not introduce additional
the key-based match-action table. This process of sub-range             network topology changes.
redistribution restores the chain to its original length.

Switch failure. The storage servers in the rack of the failed           7   Implementation Details
switch would lose access to the network. The controller will
detect that these storage servers are unreachable. The con-             We have implemented a prototype of TurboKV, including
troller treats these storage servers as failed storage nodes,           all switch data plane and control plane features, described
and distributes the load of these storage servers among other           in Section 4 and Section 5. We have also implemented the
reachable servers as described before. Then, the failed switch          client and server libraries that interface with applications and
needs to be rebooted or replaced. The new switch starts with            storage nodes, respectively, as described in Section 3. The
an index table which contains all the key ranges handled by             switch data plane is written in P4 and is compiled to the
its connected storage servers.                                          simple software switch BMV2 [5] running on Mininet [28].
                                                                        The key size of the key-value pair is 16 bytes with total key
                                                                        range spans from 0 to 2128 . This range is divided into index
6     Scaling Up to Multiple Racks                                      records which are saved on the switch data plane. We used 4
                                                                        register arrays, one for saving the storage nodes’ IP addresses,
We have discussed the in-switch coordination for distributed            one for saving the forwarding port of the storage nodes, one
key-value stores within a rack of storage nodes with the Top-           for counting the read access requests of the indexing records
of-Rack (ToR) switch as the coordinator. We now discuss                 and the last one for counting the update access requests of
how to scale out the in-switch coordination in the data center          the indexing records. The controller is able to update/read
network. Figure 11 shows the network architecture of data               the values of these registers through the control plane. It also
centers. All the servers in the same rack are connected by a            can add or remove table entries to balance the load of the
ToR switch. In the highest level, there are Aggregate switches          storage nodes. The switch data plane does not store any key-
(AGG) and Core switches (Core).                                         value pairs as these pairs are saved on the storage nodes. This
   To scale out distributed key-value stores with in-switch             approach makes TurboKV consumes a small amount of the
coordination, we develop a "hierarchical indexing" scheme.              on-chip memory leaving enough space for processing other
Each ToR switch has the directory information of all sub-               network operations. The controller is written in Python and
ranges located on its connected storage nodes as described              can update the switch data plane through the switch driver by
before in Figure 7. In addition to the IPv4 routing table,              using the Thrift API generated by the P4 compiler.
each AGG switch has two range match-action tables (range                   The client and server libraries are written in Python using
and hash), where each table consists of the sub-ranges in               Scapy [36] for packet manipulation. The client can translate a
its connected ToR switches. The Core switches have the                  generated YCSB [8] workload with different distributions and

                                                                    9
10.10.9.17 10.10.9.18 10.10.9.19 10.10.9.20
                                                                                                        rectory information and can act as the request coordinator. In
                           h17        h18         h19 h20              Clients                          client-driven coordination, the client acts as the request coor-
                                                                                                        dinator. The client has to download the directory information
                                       S7         S8                      Core Switches                 periodically from a random storage node, because, with lots of
                                                                                                        outdated directory information, the client-driven coordination
                   S5                                                      S6      Agg. Switches        tends to act as the server-driven coordination as the wrong
                                                                                                        storage node that the client contacts will forward the request
                                                                                           T.O.R        again to the target storage node. Both of client-driven and
        S1                 S2                              S3                        S4                 server-driven approaches route the query using the standard
                                                                                                        L2/L3 routing protocols.
h1 h2    h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h13 h14 h15 h16                                                  Note that in our experiments, we compare with the ideal
                                                                                                        case of the client driven coordination where the client has the
      10.10.9.1   to 10.10.9.8                            10.10.9.9        to 10.10.9.16
                                                                                                        updated directory information and sends the query directly
                  Figure 12: Experiment Topology                                                        to the target storage node, ignoring the latency introduced by
                                                                                                        pulling this information periodically from a random storage
mixed key-value operations into TurboKV packets’ format                                                 node because this latency will depend on the client location
and send them through the network. We used Plyvel [33]                                                  and also the load of the storage node that the client contacts
which is a Python interface for levelDB [21] as the storage                                             to pull this information. The ideal case of the client-driven co-
agent. The server library translates TurboKV packets into                                               ordination represents the least latency that the client’s request
Plyvel format and connects to levelDB to perform the key-                                               can achieve because it represents the direct path from the
value pair operations. We used chain replication with chain                                             client to the target storage node ignoring any latency resulted
length equals to 3 for data reliability.                                                                from having outdated directory information which may cause
                                                                                                        extra forwarding steps in the path from the client to the target
                                                                                                        storage node. With lots of updates to the directory information
8   Performance Evaluation                                                                              of the key-value store, the ideal client-driven coordination can
                                                                                                        not be achieved in real life systems.
In this section, we provide the experimental evaluation results
of TurboKV running on the programmable switches. These                                                  Workloads. We use both uniform and skewed workloads to
results show the performance improvement of TurboKV on                                                  measure the performance of TurboKV under different work-
the key-value operations latency and system throughput.                                                 loads. The skewed workloads follow Zipf distribution with
                                                                                                        different skewness parameters (0.9, 0.95, 1.2). These work-
Experimental Setup. Our experiments consist of eight sim-
                                                                                                        loads are generated using YCSB [8] basic database with 16
ple software switches BMV2 [5] connecting 16 storage nodes
                                                                                                        byte key size and 128 byte value size. The generated data is
and 4 clients as described in Figure 12. Each of the clients
                                                                                                        stored into records’ files and queries’ files, then parsed by the
(h17 , h18 , h19 , h20 ) runs the client library and generates the
                                                                                                        client library to convert them into TurboKV packet format.
key-value queries. These clients represent the request aggre-
                                                                                                        We generate different types of workloads: read-only work-
gation servers who are the direct clients of the storage nodes
                                                                                                        load, scan-only workload, write-only workload and mixed
in data centers. Each storage node (h1 , h2 ,...., h15 , and h16 )
                                                                                                        workload with multiple write ratios.
runs the server library and uses LevelDB as the storage agent.
The whole topology is running on Mininet. The data is dis-
tributed over the storage nodes using the range partitioning                                            8.1    Effect on System Throughput
described in Section 4.1.1 with 128 records index table. Each
storage node is responsible for 24 sub-ranges (head of chain                                            Impact of Read-only Workloads. Figure 13(a) shows the
of 8 sub-ranges, replica for 8 sub-ranges and tail of chain for                                         system throughput under different skewness parameters with
8 sub-ranges ), and stores the data whose keys fall in these                                            read-only queries. We compare TurboKV throughput vs the
sub-ranges.                                                                                             ideal client-driven throughput and server-driven throughput.
                                                                                                        As shown in Figure 13(a), TurboKV performs nearly the same
Comparison. We compared our in-switch coordination (Tur-                                                as the ideal client-driven coordination in the highly skewed
boKV) with server-driven coordination and client-driven co-                                             workload (zipf-0.99 and zipf-1.2), and less than the ideal
ordination described in Section 1. In TurboKV, the directory                                            client-driven coordination by maximum of 5% in the uniform
information is stored in the switch data plane and updated                                              and zipf-0.95 workloads. This result is because the in-switch
by TurboKV controller, also the key-based routing is used to                                            coordination manages all directory information in the switch
route the query from client to target storage node. In server-                                          data plane and uses the key-based routing to deliver the re-
driven coordination which is implemented on most of existing                                            quests to the storage nodes directly. Moreover, the in-switch
key-value stores [11, 20], all the storage nodes store the di-                                          coordination eliminates the load of downloading the updated

                                                                                                   10
In-switch Coordination(TurboKV)
                                                                                                       100                                                                              100

                                                                                Avg. Throughput(Q/s)

                                                                                                                                                                Avg. Throughput(Q/s)
                                           Client-driven Coordination(Ideal)                                                                                                                             In-Switch Coordination(TurboKV)
                                                                                                                       In-Switch Coordination(TurboKV)
                                                 Server-driven Coordination
 Avg. Throughput(Q/s)

                                                                                                                                                                                                          Client-driven Coordination(Ideal)
                                                                                                                        Client-driven Coordination(Ideal)
                         100                                                                                                 Server-driven Coordination
                                                                                                                                                                                                                Server-driven Coordination
                                                                                                        75                                                                               75
                          80
                          60
                          40                                                                            50                                                                               50
                          20
                                                                                                        25                                                                               25
                                 uniform   zipf-0.9   zipf-0.95   zipf-1.2                                   0   0.2   0.4         0.6         0.8          1                                 0    0.2    0.4          0.6         0.8        1
                                    Workload Distribution                                                    Workload Write Ratio (uniform)                                                   Workload Write Ratio (zipf-0.95)
                        (a) Throughput vs Skewness - Read only                                    (b) Throughput vs Write Ratio - Uniform                                              (c) Throughput vs Write Ratio - zipf0.95

                                                                               Figure 13: Effect on System Throughput

directory information periodically from storage nodes, as this                                                                   node and also the elimination of further mapping steps on
part is managed by the controller who will update the direc-                                                                     each storage node for knowing the chain successor.
tory information in the switch data plane through the control
plane. In addition, the in-switch coordination outperforms the
server-driven coordination and improves the system through-                                                                      8.2            Effect on Key-value Operations Latency
put with minimum of 26% and maximum of 39%. This result
is because the in-switch coordination eliminates the overhead                                                                    Figures 14 and 15 show the CDF of all key-value opera-
of the load balancer and skips a potential forwarding step                                                                       tions latencies under uniform and Zipf-1.2 workloads, re-
introduced in the server-driven coordination when a request                                                                      spectively, for TurboKV, ideal client-driven coordination and
is assigned to a random storage node.                                                                                            server-driven coordination. The analysis of these two fig-
                                                                                                                                 ures is shown in Table 1 and Table 2. Figures 14(a), 15(a),
Impact of Write Ratio. Figure 13(b) and Figure 13(c) show                                                                        14(b), and 15(b) show that the read and write latencies in Tur-
the system throughput under uniform and skewed workload,                                                                         boKV are very close to the ideal client-driven coordination
respectively, with varying the workload write ratio. As shown                                                                    for uniform and skewed workload, because TurboKV skips
in Figure 13(b) and Figure 13(c), the throughput decreases                                                                       potential forwarding step like the client-driven coordination
as the write ratio increases for the three approaches, because                                                                   by managing the directory information in the switch itself.
each write query has to update all the copies of the key-value                                                                   Compared to server-driven coordination, TurboKV reduces
pair following the chain replication approach before returning                                                                   the read latency by 16.3% on the average and 19.2% for the
the reply to the client. TurboKV performs roughly the same as                                                                    99th percentile for the uniform workload, and by 30% on the
the ideal client-driven coordination in the workloads with low                                                                   average and 49% for the 99th percentile for the skewed work-
write ratio, but TurboKV outperforms the client-driven coordi-                                                                   load. TurboKV also reduces the write latency by 11% on the
nation as the write ratio increases. This behavior is because of                                                                 average and 12.3% for the 99th percentile for the uniform
the chain replication implemented in the system for availabil-                                                                   workload, and by 29% on the average and 48% for the 99th
ity and fault tolerance. In in-switch coordination, the switch                                                                   percentile for the skewed workload. The reduction in case
inserts all the chain nodes in the packet. When the packet                                                                       of skewed workload is better than the reduction of uniform
arrives at a storage node, the node updates its local copy and                                                                   workload, because TurboKV does not only skip the excess
forwards the request to the next storage node directly without                                                                   forwarding step but also removes the load from the storage
any further mapping to know its chain successor. But, in the                                                                     nodes from being the request coordinator, and makes them fo-
client-driven coordination, the client sends the write query to                                                                  cus only on answering the queries themselves, which reduces
the head of chain’s node. When the query arrives at the stor-                                                                    tail latencies at the storage nodes.
age node, the node updates its local copy and then accesses its                                                                     Figures 14(c), and 15(c) shows that TurboKV reduces the
saved directory information to know its chain successor, then                                                                    scan latency by 23% on average and 13% for 99t h percentiles
forwards the packet to it. So, when the write ratio increases,                                                                   for uniform workloads, and by 22% on average and 16% for
this scenario is performed for larger portion of queries which                                                                   99t h percentile for skewed workloads when compared with
affects the system throughput. Also, TurboKV outperforms                                                                         the server-driven coordination. But, when compared with
the server-driven coordination by minimum of 26% and max-                                                                        the ideal client-driven coordination, TurboKV increases the
imum of 44% in case of uniform workload, and by minimum                                                                          latency of the scan operation by range of 2 - 15% for uniform
of 37% and maximum of 47% in case of the skewed workload.                                                                        and skewed workloads, because of the latency introduced
This improvement is because of the elimination of the for-                                                                       inside the switch from packet circulation and cloning to divide
warding step when the request is assigned to a random storage                                                                    the requested range when it spans multiple storage nodes.

                                                                                                                        11
You can also read