MICA: A Holistic Approach to Fast In-Memory Key-Value Storage

Page created by Floyd Warner
 
CONTINUE READING
MICA: A Holistic Approach to Fast In-Memory Key-Value Storage
MICA: A Holistic Approach to Fast In-Memory
              Key-Value Storage
Hyeontaek Lim, Carnegie Mellon University; Dongsu Han, Korea Advanced Institute of
  Science and Technology (KAIST); David G. Andersen, Carnegie Mellon University;
                          Michael Kaminsky, Intel Labs
         https://www.usenix.org/conference/nsdi14/technical-sessions/presentation/lim

          This paper is included in the Proceedings of the
          11th USENIX Symposium on Networked Systems
               Design and Implementation (NSDI ’14).
                           April 2–4, 2014 • Seattle, WA, USA
                                   ISBN 978-1-931971-09-6

                                                Open access to the Proceedings of the
                                                    11th USENIX Symposium on
                                                  Networked Systems Design and
                                                     Implementation (NSDI ’14)
                                                      is sponsored by USENIX
MICA: A Holistic Approach to Fast In-Memory Key-Value Storage
MICA: A Holistic Approach to Fast In-Memory Key-Value Storage
              Hyeontaek Lim,1 Dongsu Han,2 David G. Andersen,1 Michael Kaminsky3
                         1 Carnegie Mellon University, 2 KAIST, 3 Intel Labs

Abstract                                                           at the client to amortize this overhead, but batching in-
                                                                   creases latency, and large batches are unrealistic in large
MICA is a scalable in-memory key-value store that han-             cluster key-value stores because it is more difficult to ac-
dles 65.6 to 76.9 million key-value operations per second          cumulate multiple requests being sent to the same server
using a single general-purpose multi-core system. MICA             from a single client [36].
is over 4–13.5x faster than current state-of-the-art sys-             MICA (Memory-store with Intelligent Concurrent Ac-
tems, while providing consistently high throughput over            cess) is an in-memory key-value store that achieves high
a variety of mixed read and write workloads.                       throughput across a wide range of workloads. MICA can
   MICA takes a holistic approach that encompasses all             provide either store semantics (no existing items can be
aspects of request handling, including parallel data access,       removed without an explicit client request) or cache se-
network request handling, and data structure design, but           mantics (existing items may be removed to reclaim space
makes unconventional choices in each of the three do-              for new items). Under write-intensive workloads with a
mains. First, MICA optimizes for multi-core architectures          skewed key popularity, a single MICA node serves 70.4
by enabling parallel access to partitioned data. Second,           million small key-value items per second (Mops), which
for efficient parallel data access, MICA maps client re-            is 10.8x faster than the next fastest system. For skewed,
quests directly to specific CPU cores at the server NIC             read-intensive workloads, MICA’s 65.6 Mops is at least
level by using client-supplied information and adopts a            4x faster than other systems even after modifying them to
light-weight networking stack that bypasses the kernel.            use our kernel bypass. MICA achieves 75.5–76.9 Mops
Finally, MICA’s new data structures—circular logs, lossy           under workloads with a uniform key popularity. MICA
concurrent hash indexes, and bulk chaining—handle both             achieves this through the following techniques:
read- and write-intensive workloads at low overhead.
                                                                   Fast and scalable parallel data access: MICA’s data
                                                                   access is fast and scalable, using data partitioning and
1    Introduction                                                  exploiting CPU parallelism within and between cores.
In-memory key-value storage is a crucial building block            Its EREW mode (Exclusive Read Exclusive Write) min-
for many systems, including popular social networking              imizes costly inter-core communication, and its CREW
sites (e.g., Facebook) [36]. These storage systems must            mode (Concurrent Read Exclusive Write) allows multiple
provide high performance when serving many small ob-               cores to serve popular data. MICA’s techniques achieve
jects, whose total volume can grow to TBs and more [5].            consistently high throughput even under skewed work-
   While much prior work focuses on high performance               loads, one weakness of prior partitioned stores.
for read-mostly workloads [15, 30, 32, 37], in-memory              Network stack for efficient request processing: MICA
key-value storage today must also handle write-intensive           interfaces with NICs directly, bypassing the kernel, and
workloads, e.g., to store frequently-changing objects [2,          uses client software and server hardware to direct remote
5, 36]. Systems optimized only for reads often waste re-           key-value requests to appropriate cores where the requests
sources when faced with significant write traffic; their             can be processed most efficiently. The network stack
inefficiencies include lock contention [32], expensive up-          achieves zero-copy packet I/O and request processing.
dates to data structures [15, 30], and complex memory              New data structures for key-value storage: New mem-
management [15, 32, 36].                                           ory allocation and indexing in MICA, optimized for store
   In-memory key-value storage also requires low-                  and cache separately, exploit properties of key-value work-
overhead network communication between clients and                 loads to accelerate write performance with simplified
servers. Key-value workloads often include a large num-            memory management.
ber of small key-value items [5] that require key-value
storage to handle short messages efficiently. Systems us-           2    System Goals
ing standard socket I/O, optimized for bulk communica-
tion, incur high network stack overhead at both kernel-            In this section, we first clarify the non-goals and then
and user-level. Current systems attempt to batch requests          discuss the goals of MICA.

                                                               1
USENIX Association                       11th USENIX Symposium on Networked Systems Design and Implementation 429
Non-Goals: We do not change the cluster architecture. It        store mode, the system must not remove any stored items
can still shard data and balance load across nodes, and         without clients’ permission while striving to achieve good
perform replication and failure recovery.                       memory utilization.
   We do not aim to handle large items that span multiple       Commodity hardware: Using general-purpose hardware
packets. Most key-value items will fit comfortably in a          reduces the cost of development, equipment, and oper-
single packet [5]. Clients can store a large item in a tradi-   ation. Today’s server hardware can provide high-speed
tional key-value system and put a pointer to that system        I/O [12, 22], comparable to that of specialized hardware
in MICA. This only marginally increases total latency;          such as FPGAs and RDMA-enabled NICs.
one extra round-trip time for indirection is smaller than          Although recent studies tried to achieve some of these
the transfer time of a large item sending multiple packets.     goals, none of their solutions comprehensively address
   We do not strive for durability: All data is stored in       them. Some systems achieve high throughput by support-
DRAM. If needed, log-based mechanisms such as those             ing only small fixed-length keys [33]. Many rely on client-
from RAMCloud [37] would be needed to allow data to             based request batching [15, 30, 33, 36] to amortize high
persist across power failures or reboots.                       network I/O overhead, which is less effective in a large
   MICA instead strives to achieve the following goals:         installation of key-value stores [14]; use specialized hard-
High single-node throughput: Sites such as Facebook             ware, often with multiple client-server round-trips and/or
replicate some key-value nodes purely to handle load [36].      no support for item eviction (e.g., FPGAs [7, 29], RDMA-
Faster nodes may reduce cost by requiring fewer of them         enabled NICs [35]); or do not specifically address remote
overall, reducing the cost and overhead of replication          request processing [45]. Many focus on uniform and/or
and invalidation. High-speed nodes are also more able to        read-intensive workloads; several systems lack evalua-
handle load spikes and popularity hot spots. Importantly,       tion for skewed workloads [7, 33, 35], and some systems
using fewer nodes can also reduce job latency by reduc-         have lower throughput for write-intensive workloads than
ing the number of servers touched by client requests. A         read-intensive workloads [30]. Several systems attempt
single user request can create more than 500 key-value          to handle memory fragmentation explicitly [36], but there
requests [36], and when these requests go to many nodes,        are scenarios where the system never reclaims fragmented
the time until all replies arrive increases, delaying comple-   free memory, as we describe in the next section. The fast
tion of the user request [10]. Having fewer nodes reduces       packet processing achieved by software routers and low-
fan-out, and thus, can improve job completion time.             overhead network stacks [12, 19, 20, 41, 43] set a bar for
Low end-to-end latency: The end-to-end latency of a re-         how fast a key-value system might operate on general-
mote key-value request greatly affects performance when         purpose hardware, but do not teach how their techniques
a client must send back-to-back requests (e.g., when sub-       apply to the higher-level processing of key-value requests.
sequent requests are dependent). The system should min-
imize both local key-value processing latency and the           3     Key Design Choices
number of round-trips between the client and server.
Consistent performance across workloads: Real work-             Achieving our goals requires rethinking how we design
loads often have a Zipf-distributed key popularity [5],         parallel data access, the network stack, and key-value
and it is crucial to provide fast key-value operations re-      data structures. We make an unconventional choice for
gardless of skew. Recent uses of in-memory key-value            each; we discuss how we overcome its potential draw-
storage also demand fast processing for write-intensive         backs to achieve our goals. Figure 1 depicts how these
workloads [2, 36].                                              components fit together.
Handle small, variable-length key-value items: Most             3.1    Parallel Data Access
key-value items are small [5]. Thus, it is important
to process requests for them efficiently. Ideally, key-          Exploiting the parallelism of modern multi-core systems
value request processing over the network should be             is crucial for high performance. The most common access
as fast as packet processing in software routers—40 to          models are concurrent access and exclusive access:
80 Gbps [12, 19]. Variable-length items require careful         Concurrent access is used by most key-value sys-
memory management to reduce fragmentation that can              tems [15, 30, 36]. As in Figure 2 (a), multiple CPU cores
waste substantial space [5].                                    can access the shared data. The integrity of the data struc-
Key-value storage interface and semantics: The                  ture must be maintained using mutexes [36], optimistic
system must support standard single-key requests (e.g.,         locking [15, 30], or lock-free data structures [34].
GET(key), PUT(key,value), DELETE(key))                             Unfortunately, concurrent writes scale poorly: they in-
that are common in systems such as Memcached. In cache          cur frequent cache line transfer between cores, because
mode, the system performs automatic cache management            only one core can hold the cache line of the same memory
that may evict stored items at its discretion (e.g., LRU); in   location for writing at the same time.

430 11th USENIX Symposium on Networked Systems Design and Implementation                               USENIX Association
Server
                                                                                                      Memory
                       Client                               CPU core
                                                                             Parallel               Key-value
                       Client           Network             CPU core           data                    data
                                 NIC      stack
                       Client                               CPU core          access                structures
                                       (§3.2/§4.2)
                                                                           (§3.1/§4.1)             (§3.3/§4.3)
                       Client                               CPU core

   Figure 1: Components of in-memory key-value stores. MICA’s key design choices in §3 and their details in §4.

   CPU core                     CPU core        Partition        Direct NIC access is common in software routers to
   CPU core            Shared   CPU core        Partition        achieve line-rate packet processing [12, 19]. This raw
   CPU core             data    CPU core        Partition        access to NIC hardware bypasses the kernel to minimize
   CPU core                     CPU core        Partition
                                                                 the packet I/O overhead. It delivers packets in bursts to
                                                                 efficiently use CPU cycles and the PCIe bus connecting
        (a) Concurrent access     (b) Exclusive access           NICs and CPUs. Direct access, however, precludes useful
           Figure 2: Parallel data access models.                TCP features such as retransmission, flow control, and
                                                                 congestion control.
Exclusive access has been explored less often for key-           MICA’s network I/O uses direct NIC access. By target-
value storage [6, 25, 33]. Only one core can access part         ing only small key-value items, it needs fewer transport-
of the data, as in Figure 2 (b). By partitioning the data        layer features. Clients are responsible for retransmitting
(“sharding”), each core exclusively accesses its own parti-      packets if needed. Section 4.2 describes such issues and
tion in parallel without inter-core communication.               our design in more detail.
   Prior work observed that partitioning can have the best       3.2.2      Request Direction
throughput and scalability [30, 45], but cautions that it        Request direction delivers client requests to CPU cores for
lowers performance when the load between partitions is           processing.1 Modern NICs can deliver packets to specific
imbalanced, as happens under skewed key popularity [15,          cores for load balancing or core affinity using hardware-
30, 45]. Furthermore, because each core can access only          based packet classification and multi-queue support.
data within its own partition, request direction is needed
to forward requests to the appropriate CPU core.                                               Server                            Server
                                                                       Client    KEY0                    Client     KEY0
                                                                                             CPU core                          CPU core
MICA’s parallel data access: MICA partitions data and                    A       KEY1                      A
                                                                                                                  KEY1
mainly uses exclusive access to the partitions. MICA ex-               Client    KEY0                    Client
                                                                                                                  KEY0
                                                                                             CPU core                          CPU core
ploits CPU caches and packet burst I/O to disproportion-                 B       KEY1                      B        KEY1
ately speed more loaded partitions, nearly eliminating                    (a) Flow-level core affinity     (b) Object-level core affinity
the penalty from skewed workloads. MICA can fall back
                                                                            Figure 3: Request direction mechanisms.
to concurrent reads if the load is extremely skewed, but
avoids concurrent writes, which are always slower than           Flow-level core affinity is available using two methods:
exclusive writes. Section 4.1 describes our data access          Receive-Side Scaling (RSS) [12, 19] sends packets to
models and partitioning scheme.                                  cores based by hashing the packet header 5-tuple to iden-
                                                                 tify which RX queue to target. Flow Director (FDir) [41]
3.2      Network Stack                                           can more flexibly use different parts of the packet header
This section discusses how MICA avoids network stack             plus a user-supplied table to map header values to RX
overhead and directs packets to individual cores.                queues. Efficient network stacks use affinity to reduce
                                                                 inter-core contention for TCP control blocks [20, 41].
3.2.1    Network I/O                                                Flow affinity reduces only transport layer contention,
Network I/O is one of the most expensive processing              not application-level contention [20], because a single
steps for in-memory key-value storage. TCP processing            transport flow can contain requests for any objects (Fig-
alone may consume 70% of CPU time on a many-core                 ure 3 (a)). Even for datagrams, the benefit of flow affinity
optimized key-value store [33].                                  is small due to a lack of locality across datagrams [36].
The socket I/O used by most in-memory key-value                  Object-level core affinity distributes requests to cores
stores [15, 30, 33, 45] provides portability and ease of         based upon the application’s partitioning. For example,
development. However, it underperforms in packets per            requests sharing the same key would all go to the core
second because it has high per-read() overhead. Many             handling that key’s partition (Figure 3 (b)).
systems therefore often have clients include a batch of re-         1 Because we target small key-value requests, we will use requests

quests in a single larger packet to amortize I/O overhead.       and packets interchangeably.

USENIX Association                         11th USENIX Symposium on Networked Systems Design and Implementation 431
Systems using exclusive access require object-level                particularly suited for bulk data writes. This approach is
core affinity, but commodity NIC hardware cannot di-                   common in flash memory stores due to the high cost of
rectly parse and understand application-level semantics.              random flash writes [3, 4, 28], but has been used in only
Software request redirection (e.g., message passing [33])             a few in-memory key-value systems [37].
incurs inter-core communication, which the exclusive ac-                 Garbage collection is crucial to space efficiency. It re-
cess model is designed to avoid.                                      claims space occupied by overwritten and deleted objects
MICA’s request direction uses Flow Director [23, 31].                 by moving live objects to a new log and removing the old
Its clients then encode object-level affinity information              log. Unfortunately, garbage collection is costly and often
in a way Flow Director can understand. Servers, in turn,              reduces performance because of the large amount of data
inform clients about the object-to-partition mapping. Sec-            it must copy, trading memory efficiency against request
tion 4.2 describes how this mechanism works.                          processing speed.
3.3     Key-Value Data Structures                                     MICA’s memory allocator: MICA uses separate mem-
This section describes MICA’s choice for two main data                ory allocators for cache and store semantics. Its cache
structures: allocators that manage memory space for stor-             mode uses a log structure with inexpensive garbage collec-
ing key-value items and indexes to find items quickly.                 tion and in-place update support (Section 4.3.1). MICA’s
                                                                      allocator provides fast inserts and updates, and exploits
3.3.1    Memory Allocator                                             cache semantics to eliminate log garbage collection and
                                                                      drastically simplify free space defragmentation. Its store
           New item                                       New item    mode uses segregated fits [42, 47] that share the unified
                                                                      memory space to avoid rebalancing size classes (Sec-
                                                                      tion 4.3.3).

                                                            free
                                                                      3.3.2   Indexing: Read-oriented vs. Write-friendly
                       free
      (a) Dynamic object allocator   (b) Append-only log structure
                                                                      Read-oriented index: Common choices for indexing are
                 Figure 4: Memory allocators.                         hash tables [15, 33, 36] or tree-like structures [30]. How-
A dynamic object allocator is a common choice for stor-               ever, conventional data structures are much slower for
ing variable-length key-value items (Figure 4 (a)). Sys-              writes compared to reads; hash tables examine many slots
tems such as Memcached typically use a slab approach:                 to find a space for the new item [15], and trees may require
they divide object sizes into classes (e.g., 48-byte, 56-byte,        multiple operations to maintain structural invariants [30].
..., 1-MiB2 ) and maintain separate (“segregated”) memory             Write-friendly index: Hash tables using chaining [33,
pools for these classes [15, 36]. Because the amount of               36] can insert new items without accessing many memory
space that each class uses typically varies over time, the            locations, but they suffer a time-space tradeoff: by hav-
systems use a global memory manager that allocates large              ing long chains (few hash buckets), an item lookup must
memory blocks (e.g., 1 MiB) to the pools and dynamically              follow a long chain of items, this requiring multiple ran-
rebalances allocations between classes.                               dom dependent memory accesses; when chains are short
    The major challenge for dynamic allocation is the mem-            (many hash buckets), memory overhead to store chain-
ory fragmentation caused when blocks are not fully filled.             ing pointers increases. Lossy data structures are rather
There may be no free blocks or free objects for some size             unusual in in-memory key-value storage and studied only
classes while blocks from other classes are partly empty              in limited contexts [7], but it is the standard design in
after deletions. Defragmentation packs objects of each ob-            hardware indexes such as CPU caches [21].
ject tightly to make free blocks, which involves expensive            MICA’s index: MICA uses new index data structures
memory copy. This process is even more complex if the                 to offer both high-speed read and write. In cache mode,
memory manager performs rebalancing concurrently with                 MICA’s lossy index also leverages the cache semantics
threads accessing the memory for other reads and writes.              to achieve high insertion speed; it evicts an old item in
Append-only log structures are write-friendly, placing                the hash table when a hash collision occurs instead of
new data items at the end of a linear data structure called           spending system resources to resolve the collision. By
a “log” (Figure 4 (b)). To update an item, it simply in-              using the memory allocator’s eviction support, the MICA
serts a new item to the log that overrides the previous               lossy index can avoid evicting recently-used items (Sec-
value. Inserts and updates thus access memory sequen-                 tion 4.3.2). The MICA lossless index uses bulk chaining,
tially, incurring fewer cache and TLB misses, making logs             which allocates cache line-aligned space to a bucket for
   2 Binary prefixes (powers of 2) end with an “i” suffix, whereas SI   each chain segment. This keeps the chain length short and
prefixes (powers of 10) have no “i” suffix.                             space efficiency high (Section 4.3.3).

432 11th USENIX Symposium on Networked Systems Design and Implementation                                    USENIX Association
4       MICA Design                                                          quests, while the latter still reduces expensive cache line
                                                                             transfer. CREW handles reads efficiently under highly
This section describes each component in MICA and                            skewed load, at the cost of managing read-write conflicts.
discusses how they operate together to achieve its goals.                    MICA minimizes the synchronization cost with efficient
4.1     Parallel Data Access                                                 optimistic locking [48] (Section 4.3.2).
                                                                                Supporting cache semantics in CREW, however, raises
This section explains how CPU cores access data in
                                                                             a challenge for read (GET) requests: During a GET, the
MICA, but assumes that cores process only the requests
                                                                             cache may need to update cache management informa-
for which they are responsible. Later in Section 4.2, we
                                                                             tion. For example, policies such as LRU use bookkeeping
discuss how MICA assigns remote requests to CPU cores.
                                                                             to remember recently used items, which can cause con-
4.1.1    Keyhash-Based Partitioning                                          flicts and cache-line bouncing among cores. This, in turn,
MICA creates one or more partitions per CPU core and                         defeats the purpose of using exclusive writes.
stores key-value items in a partition determined by their                       To address this problem, we choose an approximate ap-
key. Such horizontal partitioning is often used to shard                     proach: MICA counts reads only from the exclusive-write
across nodes [4, 11], but some key-value storage systems                     core. Clients round-robin CREW reads across all cores
also use it across cores within a node [6, 25, 33].                          in a NUMA domain, so this is effectively a sampling-
   MICA uses a keyhash to determine each item’s partition.                   based approximation to, e.g., LRU replacement as used
A keyhash is the 64-bit hash of an item’s key calculated                     in MICA’s item eviction support (Section 4.3.1).
by the client and used throughout key-value processing                          To show performance benefits of EREW and CREW,
in MICA. MICA uses the first few high order bits of the                       our MICA prototype also provides the CRCW (Concur-
keyhash to obtain the partition index for the item.                          rent Read Concurrent Write) mode, in which MICA al-
   Keyhash partitioning uniformly maps keys to partitions,                   lows multiple cores to read and write any partition. This
reducing the request distribution imbalance. For example,                    effectively models concurrent access to the shared data in
in a Zipf-distributed population of size 192 × 220 (192                      non-partitioned key-value systems.
Mi) with skewness 0.99 as used by YCSB [9],3 the most
popular key is 9.3 × 106 times more frequently accessed
                                                                             4.2     Network Stack
than the average; after partitioning keys into 16 partitions,                The network stack in MICA provides network I/O to trans-
however, the most popular partition is only 53% more                         fer packet data between NICs and the server software, and
frequently requested than the average.                                       request direction to route requests to an appropriate CPU
   MICA retains high throughput under this remaining                         core to make subsequent key-value processing efficient.
partition-level skew because it can process requests in                         Exploiting the small key-value items that MICA tar-
“hot” partitions more efficiently, for two reasons. First, a                  gets, request and response packets use UDP. Despite
partition is popular because it contains “hot” items; these                  clients not benefiting from TCP’s packet loss recovery
hot items naturally create locality in data access. With                     and flow/congestion control, UDP has been used widely
high locality, MICA experiences fewer CPU cache misses                       for read requests (e.g., GET) in large-scale deployments
when accessing items. Second, the skew causes packet                         of in-memory key-value storage systems [36] for low la-
I/O to be more efficient for popular partitions (described                    tency and low overhead. Our protocol includes sequence
in Section 4.2.1). As a result, throughput for the Zipf-                     numbers in packets, and our application relies on the
distributed workload is 86% of the uniformly-distributed                     idempotency of GET and PUT operations for simple and
workload, making MICA’s partitioned design practical                         stateless application-driven loss recovery, if needed: some
even under skewed workloads.                                                 queries may not be useful past a deadline, and in many
4.1.2    Operation Modes                                                     cases, the network is provisioned well, making retrans-
                                                                             mission rare and congestion control less crucial [36].
MICA can operate in EREW (Exclusive Read Exclusive
Write) or CREW (Concurrent Read Exclusive Write).                            4.2.1   Direct NIC Access
EREW assigns a single CPU core to each partition for all                     MICA uses Intel’s DPDK [22] instead of standard socket
operations. No concurrent access to partitions eliminates                    I/O. This allows our user-level server software to control
synchronization and inter-core communication, making                         NICs and transfer packet data with minimal overhead.
MICA scale linearly with CPU cores. CREW allows any                          MICA differs from general network processing [12, 19,
core to read partitions, but only a single core can write.                   41] that has used direct NIC access in that MICA is an
This combines the benefit of concurrent read and exclu-                       application that processes high-level key-value requests.
sive write; the former allows all cores to process read re-                     In NUMA (non-uniform memory access) systems with
   3 i-th key constitutes 1/(i0.99 H
                                         n,0.99 ) of total requests, where
                                                                             multiple CPUs, NICs may have different affinities to
Hn,0.99 = ∑ni=1 (1/i0.99 ) and n is the total number of keys.                CPUs. For example, our evaluation hardware has two

USENIX Association                                 11th USENIX Symposium on Networked Systems Design and Implementation 433
Head (old items are evicted) Tail (new items are inserted)
CPUs, each connected to two NICs via a direct PCIe bus.
MICA uses NUMA-aware memory allocation so that each              Circular log
CPU and NIC only accesses packet buffers stored in their          Log entry
respective NUMA domains.
                                                                        Initial size Keyhash Key/value length Expire time
   MICA uses NIC multi-queue support to allocate a ded-                                       Key
icated RX and TX queue to each core. Cores exclusively                                       Value
access their own queues without synchronization in a sim-
                                                                           Figure 5: Design of a circular log.
ilar way to EREW data access. By directing a packet to
an RX queue, the packet can be processed by a specific
core, as we discuss in Section 4.2.2.                         The client then embeds the request direction information
                                                              in the packet header: If the request uses exclusive data
Burst packet I/O: MICA uses the DPDK’s burst packet
                                                              access (read/write on EREW and write on CREW), the
I/O to transfer multiple packets (up to 32 in our imple-
                                                              client calculates the partition index from the keyhash of
mentation) each time it requests packets from RX queues
                                                              the request. If the request can be handled by any core (a
or transmits them to TX queues. Burst I/O reduces the per-
                                                              CREW read), it picks a server core index in a round-robin
packet cost of accessing and modifying the queue, while
                                                              way (across requests, but in the same NUMA domain
adding only trivial delay to request processing because
                                                              (Section 4.2.1)). Finally, the client encodes the partition
the burst size is small compared to the packet processing
                                                              or core index as the UDP destination port.4 The server
rate.
                                                              programs FDir to use the UDP destination port, without
   Importantly, burst I/O helps handle skewed workloads.
                                                              hashing, (“perfect match filter” [23]), as an index into
A core processing popular partitions spends more time
                                                              a table mapping UDP port numbers to a destination RX
processing requests, and therefore performs packet I/O
                                                              queue. Key hashing only slightly burdens clients. Using
less frequently. The lower I/O frequency increases the
                                                              fast string hash functions such as CityHash [8], a sin-
burst size, reducing the per-packet I/O cost (Section 5.2).
                                                              gle client machine equipped with dual 6-core CPUs on
Therefore, popular partitions have more CPU available for
                                                              our testbed can generate over 40 M requests/second with
key-value processing. An unpopular partition’s core has
                                                              client-side key hashing. Clients include the keyhash in
higher per-packet I/O cost, but handles fewer requests.
                                                              requests, and servers reuse the embedded keyhash when
Zero-copy processing: MICA avoids packet data copy
                                                              they need a keyhash during the request processing to ben-
throughout RX/TX and request processing. MICA uses
                                                              efit from offloaded hash computation.
MTU-sized packet buffers for RX even if incoming re-
                                                                 Client-assisted request direction using NIC hardware
quests are small. Upon receiving a request, MICA avoids
                                                              allows efficient request processing. Our results in Sec-
memory allocation and copying by reusing the request
                                                              tion 5.5 show that an optimized software-based request
packet to construct a response: it flips the source and des-
                                                              direction that receives packets from any core and dis-
tination addresses and ports in the header and updates
                                                              tributes them to appropriate cores is significantly slower
only the part of the packet payload that differs between
                                                              than MICA’s hardware-based approach.
the request and response.
                                                              4.3     Data Structure
4.2.2   Client-Assisted Hardware Request Direction
                                                              MICA, in cache mode, uses circular logs to manage mem-
Modern NICs help scale packet processing by directing         ory for key-value items and lossy concurrent hash indexes
packets to different RX queues using hardware features        to index the stored items. Both data structures exploit
such as Receiver-Side Scaling (RSS) and Flow Director         cache semantics to provide fast writes and simple memory
(FDir) [12, 19, 41] based on the packet header.               management. Each MICA partition consists of a single
   Because each MICA key-value request is an individual       circular log and lossy concurrent hash index.
packet, we wish to use hardware packet direction to di-          MICA provides a store mode with straightforward ex-
rectly send packets to the appropriate queue based upon       tensions using segregated fits to allocate memory for key-
the key. Doing so is much more efficient than redirecting      value items and bulk chaining to convert the lossy concur-
packets in software. Unfortunately, the NIC alone cannot      rent hash indexes into lossless ones.
provide key-based request direction: RSS and FDir cannot
classify based on the packet payload, and cannot examine      4.3.1    Circular Log
variable length fields such as request keys.                   MICA stores items in its circular log by appending them
Client assistance: We instead take advantage of the op-       to the tail of the log (Figure 5). This results in a space-
portunity to co-design the client and server. The client      efficient packing. It updates items in-place as long as the
caches information from a server directory about the             4 To avoid confusion between partition indices and the core indices,
operation mode (EREW or CREW), number of cores,               we use different ranges of UDP ports; a partition may be mapped to a
NUMA domains, and NICs, and number of partitions.             core whose index differs from the partition index.

434 11th USENIX Symposium on Networked Systems Design and Implementation                                    USENIX Association
new size of the key+value does not exceed the size of the         Lossy concurrent
                                                                                                        Index entries
                                                                     hash index
item when it was first inserted. The size of the circular
log is bounded and does not change, so to add a new item                Bucket 0    Version                             ...
to a full log, MICA evicts the oldest item(s) at the head               Bucket 1
of the log to make space.
                                                                        Bucket 2
   Each entry includes the key and value length, key, and
                                                                           ...
value. To locate the next item in the log and support item
resizing, the entry contains the initial item size, and for                                 Tag Item offset
fast lookup, it stores the keyhash of the item. The entry         Figure 6: Design of a lossy concurrent hash index.
has an expire time set by the client to ignore stale data.
Garbage collection and defragmentation: The circular                        0                                     234
                                                                                      Virtual address space
log eliminates the expensive garbage collection and free
space defragmentation that are required in conventional
log structures and dynamic memory allocators. Previously
deleted items in the log are automatically collected and                               Physical address space
removed when new items enter the log. Almost all free                        first page mapped twice

space remains contiguously between the tail and head.             Our MICA prototype implements this scheme in
Exploiting the eviction of live items: Items evicted at         userspace by mapping a pool of hugepages to virtual
the head are not reinserted to the log even if they have        addresses using the mmap() system call.
not yet expired. In other words, the log may delete items
                                                                4.3.2    Lossy Concurrent Hash Index
without clients knowing it. This behavior is valid in cache
workloads; a key-value store must evict items when it           MICA’s hash index locates key-value items in the log
becomes full. For example, Memcached [32] uses LRU              using a set-associative cache similar to that used in CPU
to remove items and reserve space for new items.                caches. As shown in Figure 6, a hash index consists of
   MICA uses this item eviction to implement common             multiple buckets (configurable for the workload), and
eviction schemes at low cost. Its “natural” eviction is         each bucket has a fixed number of index entries (config-
FIFO. MICA can provide LRU by reinserting any re-               urable in the source code; 15 in our prototype to occupy
quested items at the tail because only the least recently       exactly two cache lines). MICA uses a portion of the key-
used items are evicted at the head. MICA can approxi-           hashes to determine an item’s bucket; the item can occupy
mate LRU by reinserting requested items selectively—by          any index entry of the bucket unless there is a duplicate.
ignoring items recently (re)inserted and close to the tail;        Each index entry contains partial information for the
this approximation offers eviction similar to LRU with-         item: a tag and the item offset within the log. A tag is
out frequent reinserts, because recently accessed items         another portion of the indexed item’s keyhash used for
remain close to the tail and far from the head.                 filtering lookup keys that do not match: it can tell whether
   A second challenge for conventional logs is that any         the indexed item will never match against the lookup key
reference to an evicted item becomes dangling. MICA             by comparing the stored tag and the tag from the lookup
does not store back pointers in the log entry to discover all   keyhash. We avoid using a zero tag value by making it
references to the entry; instead, it provides detection, and    one because we use the zero value to indicate an empty
removes dangling pointers incrementally (Section 4.3.2).        index entry. Items are deleted by writing zero values to
Low-level memory management: MICA uses                          the index entry; the entry in the log will be automatically
hugepages and NUMA-aware allocation. Hugepages                  garbage collected.
(2 MiB in x86-64) use fewer TLB entries for the same               Note that the parts of keyhashes used for the partition
amount of memory, which significantly reduces TLB                index, the bucket number, and the tag do not overlap. Our
misses during request processing. Like the network stack,       prototype uses 64-bit keyhashes to provide sufficient bits.
MICA allocates memory for circular logs such that cores         Lossiness: The hash index is lossy. When indexing a new
access only local memory.                                       key-value item into a full bucket of the hash index, the
   Without explicit range checking, accessing an entry          index evicts an index entry to accommodate the new item.
near the end of the log (e.g., at 234 − 8 in the example        The item evicted is determined by its age; if the item
below) could cause an invalid read or segmentation fault        offset is most behind the tail of the log, the item is the
by reading off the end of the range. To avoid such errors       oldest (or least recently used if the log is using LRU), and
without range checking, MICA manually maps the virtual          the associated index entry of the item is reclaimed.
memory addresses right after the end of the log to the             This lossy property allows fast insertion. It avoids ex-
same physical page as the first page of the log, making          pensive resolution of hash collisions that lossless indexes
the entire log appear locally contiguous:                       of other key-value stores require [15, 33]. As a result,

USENIX Association                        11th USENIX Symposium on Networked Systems Design and Implementation 435
Item offset             Item offset                                   Free space that can fit

      0        dangling pointer           valid pointer   2   48                    Free space that can fit

   Offset space
                                                                                       Free space that can fit

   Circular log                                                       Figure 8: Segregated free lists for a unified space.
 Figure 7: Offset space for dangling pointer detection.
                                                                   CRCW mode, which allows multiple writers to access the
MICA’s insert speed is comparable to lookup speed.                 same bucket, a writer also spins until the initial version
Handling dangling pointers: When an item is evicted                number is even (i.e., no other writers to this bucket) using
from the log, MICA does not delete its index entry. Al-            a compare-swap operation instruction.
though it is possible to store back pointers in the log entry,        Our MICA prototype uses different code to optimize
updating the hash index requires a random memory write             locking. It uses conventional instructions to manipulate
and is complicated due to locking if the index is being            version numbers to exploit memory access ordering on the
accessed concurrently, so MICA does not. As a result,              x86 architecture [48] in CREW mode where there is only
index pointers can “dangle,” pointing to invalid entries.          one writer. EREW mode does not require synchroniza-
   To address this problem, MICA uses large pointers               tion between cores, so MICA ignores version numbers.
for head/tail and item offsets. As depicted in Figure 7,           Because of such a hard-coded optimization, the current
MICA’s index stores log offsets that are wider than needed         prototype lacks support for runtime switching between
to address the full size of the log (e.g., 48-bit offsets vs 34    the operation modes.
bits for a 16 GiB log). MICA detects a dangling pointer            Multi-stage prefetching: To retrieve or update an item,
before using it by checking if the difference between the          MICA must perform request parsing, hash index lookup,
log tail and the item offset is larger than the actual log         and log entry retrieval. These stages cause random mem-
size.5 If the tail wraps around the 48-bit size, however, a        ory access that can significantly lower system perfor-
dangling pointer may appear valid again, so MICA scans             mance if cores stall due to CPU cache and TLB misses.
the index incrementally to remove stale pointers.                     MICA uses multi-stage prefetching to interleave com-
   This scanning must merely complete a full cycle before          putation and memory access. MICA applies memory
the tail wraps around in its wide offset space. The speed          prefetching for random memory access done at each pro-
at which it wraps is determined by the increment rate of           cessing stage in sequence. For example, when a burst
the tail and the width of the item offset. In practice, full       of 8 RX packets arrives, MICA fetches packets 0 and 1
scanning is infrequent even if writes occur very frequently.       and prefetches packets 2 and 3. It decodes the requests in
For example, with 48-bit offsets and writes occurring at           packets 0 and 1, and prefetches buckets of the hash index
230 bytes/second (millions of operations/second), the tail         that these requests will access. MICA continues packet
wraps every 248−30 seconds. If the index has 224 buckets,          payload prefetching for packets 4 and 5. It then prefetches
MICA must scan only 26 buckets per second, which adds              log entries that may be accessed by the requests of pack-
negligible overhead.                                               ets 0 and 1 while prefetching the hash index buckets for
Supporting concurrent access: MICA’s hash index                    packets 2 and 3, and the payload of packet 6 and 7. MICA
must behave correctly if the system permits concurrent             continues this pipeline until all requests are processed.
operations (e.g., CREW). For this, each bucket contains            4.3.3   Store Mode
a 32-bit version number. It performs reads optimistically
                                                                   The store mode of MICA uses segregated fits [42, 47]
using this version counter to avoid generating memory
                                                                   similar to fast malloc implementations [27], instead of the
writes while satisfying GET requests [15, 30, 48]. When
                                                                   circular log. Figure 8 depicts this approach. MICA defines
accessing an item, MICA checks if the initial state of the
                                                                   multiple size classes incrementing by 8 bytes covering
version number of the bucket is even-numbered, and upon
                                                                   all supported item sizes, and maintains a freelist for each
completion of data fetch from the index and log, it reads
                                                                   size class (a linked list of pointers referencing unoccupied
the version number again to check if the final version
                                                                   memory regions that are at least as large as the size class).
number is equal to the initial version number. If either
                                                                   When a new item is inserted, MICA chooses the smallest
check fails, it repeats the read request processing from
                                                                   size class that is at least as large as the item size and has
the beginning. For writes, MICA increments the version
                                                                   any free space. It stores the item in the free space, and
number by one before beginning, and increments the ver-
                                                                   inserts any unused region of the free space into a freelist
sion number by one again after finishing all writes. In
                                                                   that matches that region’s size. When an item is deleted,
   5 (Tail − ItemOffset + 248 )   mod 248 > LogSize.               MICA coalesces any adjacent free regions using boundary

436 11th USENIX Symposium on Networked Systems Design and Implementation                                   USENIX Association
Lossless concurrent
       hash index
                                 Index entries                  5     Evaluation
   Main bucket 0    Version                      ...            We answer four questions about MICA in this section:
   Main bucket 1                                                  • Does it perform well under diverse workloads?
   Main bucket 2                                                  • Does it provide good latency?
        ...                                                       • How does it scale with more cores and NIC ports?
   Spare bucket i   Version                      ...              • How does each component affect performance?
   Spare bucket j                                                  Our results show that MICA has consistently high
  Spare bucket k                                                throughput and low latency under a variety of workloads.
       ...                                                      It scales nearly linearly, using CPU cores and NIC ports
Figure 9: Bulk chaining in MICA’s lossless hash index.          efficiently. Each component of MICA is needed. MICA
                                                                achieves 65.6–76.9 million operations/second (Mops),
                                                                which is over 4–13.5x faster than the next fastest system;
tags [26] to recreate a large free region.                      the gap widens as the fraction of write requests increases.
                                                                   MICA is written in 12 K lines of C and runs on x86-64
   MICA’s segregated fits differ from the simple segre-          GNU/Linux. Packet I/O uses the Intel DPDK 1.4.1 [22].
gated storage used in Memcached [15, 32]. MICA main-            Compared systems: We use custom versions of open-
tains a unified space for all size classes; on the contrary,     source Memcached [32], MemC3 [15], Masstree [30],
Memcached’s SLAB allocator dynamically assigns mem-             and RAMCloud [37]. The revisions of the original code
ory blocks to size classes, which effectively partitions the    we used are: Memcached: 87e2f36; MemC3: an internal
memory space according to size classes. The unified space        version; Masstree: 4ffb946; RAMCloud: a0f6889.
of MICA eliminates the need to rebalance size classes un-          Note that the compared systems often offer additional
like the simple segregated storage. Using segregated fits        capabilities compared to others. For example, Masstree
also makes better use of memory because MICA already            can handle range queries, and RAMCloud offers low la-
has partitioning done with keyhashes; a SLAB alloca-            tency processing on InfiniBand; on the other hand, these
tor introducing another partitioning would likely waste         key-value stores do not support automatic item eviction
memory by allocating a whole block for only a few items,        as Memcached systems do. Our evaluation focuses on
resulting in low memory occupancy.                              the performance of the standard features (e.g., single key
   MICA converts its lossy concurrent hash index into a         queries) common to all the compared systems, rather than
lossless hash index by using bulk chaining. Bulk chaining       highlighting the potential performance impact from these
is similar to the traditional chaining method in hash tables;   semantic differences.
it adds more memory space to the buckets that contain an        Modifications to compared systems: We modify the
excessive number of items.                                      compared systems to use our lightweight network stack to
                                                                avoid using expensive socket I/O or special hardware (e.g.,
   Figure 9 shows the design of the lossless hash index.
                                                                InfiniBand). When measuring Memcached’s baseline la-
MICA uses the lossy concurrent hash index as the main
                                                                tency, we use its original network stack using the kernel to
buckets and allocates space for separate spare buckets
                                                                obtain the latency distribution that typical Memcached de-
that are fewer than the main buckets. When a bucket
                                                                ployments would experience. Our experiments do not use
experiences an overflow, whether it is a main bucket or
                                                                any client-side request batching. We also modified these
spare bucket, MICA adds an unused spare bucket to the
                                                                systems to invoke memory allocation functions though
full bucket to form a bucket chain. If there are no more
                                                                our framework if they use hugepages, because the DPDK
spare buckets available, MICA rejects the new item and
                                                                requests all hugepages from the OS at initialization and
returns an out-of-space error to the client.
                                                                would make the unmodified systems inoperable if they
   This data structure is friendly to memory access. The        request hugepages from the OS; we kept other memory
main buckets store most of items (about 95%), keeping           allocations using no hugepages as-is. Finally, while run-
the number of random memory read for an index lookup            ning experiments, we found that statistics collection in
close to 1; as a comparison, cuckoo hashing [39] used           RAMCloud caused lock contention, so we disabled it for
in improved Memcached systems [15] would require 1.5            better multi-core performance.
random memory accesses per index lookup in expecta-
tion. MICA also allows good memory efficiency; because           5.1    Evaluation Setup
the spare buckets only store overflow items, making the          Server/client configuration: MICA server runs on a ma-
number of spare buckets 10% of the main buckets allows          chine equipped with dual 8-core CPUs (Intel Xeon E5-
the system to store the entire dataset of 192 Mi items in       2680 @2.70 GHz), 64 GiB of total system memory, and
our experiments (Section 5).                                    eight 10-Gb Ethernet ports (four Intel X520-T2’s). Each

USENIX Association                           11th USENIX Symposium on Networked Systems Design and Implementation 437
CPU has 20 MiB of L3 cache. We disabled logical pro-                       request processing (which can boost measured through-
cessor support (“Hyper-Threading”). Each CPU accesses                      put more than 10%), as is commonly done in throughput
the 32 GiB of the system memory that resides in its local                  measurement of software routers [12, 19], but we avoid
NUMA domain over a quad-channel DDR3-1600 bus.                             this method because we expect that real deployments of
Each CPU socket is directly connected to two NICs using                    in-memory key-value stores would not tolerate excessive
PCIe gen2. Access to hardware resources in the remote                      packet losses, and such flooding can distort the intended
NUMA domain uses an interconnect between two CPUs                          skew in the workload by causing biased packet losses at
(Intel QuickPath).                                                         different cores.
   We reserved the half of the memory (16 GiB in each                         The workload generator does not receive every re-
NUMA domain) for hugepages regardless of how MICA                          sponse from the server. On our client machines, receiving
and the compared systems use hugepages.                                    packets whose size is not a multiple of 64 bytes is sub-
   MICA allocates 16 partitions in the server, and these                   stantially slower due to an issue in the PCIe bus [18].
partitions are assigned to different cores. We configured                      The workload generator works around this slow RX
the cache version of MICA to use approximate LRU to                        by sampling responses to perform fewer packet RX from
evict items; MICA reinserts any recently accessed item at                  NIC to CPU. It uses its real source MAC addresses for
the tail if the item is closer to the head than to the tail of             only a fraction of requests, causing its NIC to drop the
the circular log.                                                          responses to the other requests. By looking at the sampled
   Two client machines with dual 6-core CPUs (Intel Xeon                   responses, the workload generator can validate that the
L5640 @2.27 GHz) and two Intel X520-T2’s generate                          server has correctly processed the requests. Our server is
workloads. The server and clients are directly connected                   unaffected from this issue and performs full packet RX.
without a switch. Each client is connected to the NICs
from both NUMA domains of the server, allowing a client                    5.2    System Throughput
to send a request to any server CPU.                                       We first compare the full-system throughput. MICA uses
Workloads: We explore different aspects of the systems                     EREW with all 16 cores. However, we use a different
by varying the item size, skew, and read-write ratio.                      number of cores for the other systems to obtain their best
   We use three datasets as shown in the following table:                  throughput because some of them (Memcached, MemC3,
                                                                           and RAMCloud) achieve higher throughput with fewer
     Dataset      Key Size (B)       Value Size (B)        Count           cores (Section 5.4). The throughput numbers are calcu-
      Tiny                    8                    8     192 Mi            lated from the actual number of responses sent to the
      Small                  16                   64     128 Mi            clients after processing the requests at the server. We de-
      Large                 128                 1024       8 Mi            note the cache version of MICA by MICA-c and the store
                                                                           version of MICA by MICA-s.
   We use two workload types: uniform and skewed. Uni-                        Figure 10 (top) plots the experiment result using tiny
form workloads use the same key popularity for all re-                     key-value items. MICA performs best, regardless of the
quests; skewed workloads use a non-uniform key popu-                       skew or the GET ratio. MICA’s throughput reaches 75.5–
larity that follows a Zipf distribution of skewness 0.99,                  76.9 Mops for uniform workloads and 65.6–70.5 Mops
which is the same as YCSB’s [9].                                           for skewed ones; its parallel data access does not incur
   Workloads have a varied ratio between GET and PUT.                      more than a 14% penalty for skewed workloads. MICA
50% GET (50% PUT) workloads are write-intensive, and                       uses 54.9–66.4 Gbps of network bandwidth at this pro-
95% GET (5% PUT) workloads are read-intensive. They                        cessing speed—this speed is very close to 66.6 Gbps that
correspond to YCSB’s A and B workloads, respectively.                      our network stack can handle when doing packet I/O only.
Workload generation: We use our custom key-value                           The next best system is Masstree at 16.5 Mops, while
request generator that uses similar techniques to our                      others are below 6.1 Mops. All systems except MICA
lightweight network stack to send more than 40 Mops                        suffer noticeably under write-intensive 50% GET.
of key-value requests per machine to saturate the link.6 It                   Small key-value items show similar results in Figure 10
uses approximation techniques of Zipf distribution gener-                  (middle). However, the gap between MICA and the other
ation [17, 38] for fast skewed workload generation.                        systems shrinks because MICA becomes network bottle-
   To find the maximum meaningful throughput of a sys-                      necked while the other systems never saturate the network
tem, we adjust the workload generation rate to allow only                  bandwidth in our experiments.
marginal packet losses (< 1% at any NIC port). We could                       Large key-value items, shown in Figure 10 (bottom),
generate requests at the highest rate to cause best-effort                 exacerbates the network bandwidth bottleneck, further
    6 MICA clients are still allowed to use standard socket I/O in cases   limiting MICA’s throughput. MICA achieves 12.6–14.6
where the socket overhead on the client machines is acceptable because     Mops for 50% GET and 8.6–9.4 Mops for 95% GET; note
the MICA server and clients use the plain UDP protocol.                    that MICA shows high throughput with lower GET ratios,

438 11th USENIX Symposium on Networked Systems Design and Implementation                                         USENIX Association
80                                                                                                                                                                                       8
                                               Uniform, 50% GET                                                                76.9 76.3                  76.9
                                                                                                                                                                 75.5

                                                                                                                                                                                         Throughput (Mops)
                       70
                                                                                                                                            70.4                        70.5
                                                                                                                                                                                                                7
                                               Uniform, 95% GET                                                                                    65.6                        65.6

                                               Skewed, 50% GET                                                                                                                                                  6
   Throughput (Mops)
                       60
                                               Skewed, 95% GET                                                                                                                                                  5
                       50
                                                                                                                                                                                                                4
                       40                                                                                                                                                                                       3
                       30          Tiny key-value items                                                                                                                                                         2                Skewed
                       20
                                                                                                                                                                                                                1                Uniform
                                                                                                                        16.5

                                                                                                           12.2                                                                                                 0
                       10                                              5.7         6.1         5.8   5.7          6.5
                                                                                                                                                                                                                         0   1   2     3      4   5    6       7   8    9 10 11 12 13 14 15
                                                           5.4

                             0.7   1.3   0.7   1.3   0.9         0.9         0.9         1.0
                        0
                            Memcached MemC3                                  RAMCloud Masstree                                  MICA-c                     MICA-s                     Figure 11: Per-core breakdown of end-to-end throughput.
                       80
                                               Uniform, 50% GET                                                                                                                                                                      Uniform, 50% GET                           Skewed, 50% GET
                       70
                                               Uniform, 95% GET                                                                64.1 64.4                  64.1
                                                                                                                                                                                                                                     Uniform, 95% GET                           Skewed, 95% GET
                                               Skewed, 50% GET
   Throughput (Mops)

                                                                                                                                            59.7                 59.9 59.7
                       60                                                                                                                          55.6                        55.6                             200
                                               Skewed, 95% GET
                       50                                                                                                                                                                                                                                                                       174.8

                                                                                                                                                                                         Throughput (Mops)
                                                                                                                                                                                                                             147.3    149.6   148.4    151.7                            149.1
                       40                                                                                                                                                                                       150                                                     136.5   138.4

                       30          Small key-value items
                                                                                                                                                                                                                100
                       20                                                                                               17.7

                                                                                                           10.8
                       10                                                                      5.7   5.7          6.5                                                                                               50
                                                           5.3         5.6
                                                                                   3.1
                             0.7   1.3   0.7   1.3   0.9         0.9         0.5         0.5
                        0
                            Memcached MemC3                                  RAMCloud Masstree                                  MICA-c                     MICA-s
                                                                                                                                                                                                                     0
                       16                                                                                                                                                                                                               EREW                                     CREW
                                               Uniform, 50% GET                                                                             14.6                        14.6

                       14
                                               Uniform, 95% GET                                                                12.6
                                                                                                                                                          13.2
                                                                                                                                                                                      Figure 12: Local throughput of key-value data structures.
                                               Skewed, 50% GET
   Throughput (Mops)

                       12
                                               Skewed, 95% GET
                       10
                                                                                                                                                                                      67.8% to 77.8%. A local benchmark in Figure 12 (without
                                                                                                                                      9.4                        9.4
                                                                                                                                                   8.6                         8.6

                        8          Large key-value items                                                                8.0

                        6
                                                                                                                                                                                      network processing) also shows that skewed workloads
                                                                                                                                                                                      grant good throughput for local key-value processing due
                                                                       5.5
                                                           5.2
                                                                                                           4.9

                        4
                                                                                                                                                                                      to the data locality. These results further justify the parti-
                                                                                                                  3.5
                                                                                               2.9   3.1
                                                                                   2.3
                        2
                                                                                                                                                                                      tioned design of MICA and explains why MICA retains
                                   1.2         1.4
                                         0.7         0.8         0.9
                             0.7
                                                                             0.3         0.4

                        0
                            Memcached MemC3                                  RAMCloud Masstree                                  MICA-c                     MICA-s                     high throughput under skewed workloads.
Figure 10: End-to-end throughput of in-memory key-                                                                                                                                    Summary: MICA’s throughput reaches 76.9 Mops, at
value systems. All systems use our lightweight network                                                                                                                                least 4x faster than the next best system. MICA delivers
stack that does not require request batching. The bottom                                                                                                                              consistent performance across different skewness, write-
graph (large key-value items) uses a different Y scale                                                                                                                                intensiveness, and key-value sizes.
from the first two graphs’.
                                                                                                                                                                                      5.3                       Latency
                                                                                                                                                                                      To show that MICA achieves comparably low latency
which require less network bandwidth as the server can
                                                                                                                                                                                      while providing high throughput, we compare MICA’s
omit the key and value from the responses. Unlike MICA,
                                                                                                                                                                                      latency with that of the original Memcached implemen-
however, all other systems achieve higher throughput un-
                                                                                                                                                                                      tation that uses the kernel network stack. To measure the
der 95% GET than under 50% GET because these systems
                                                                                                                                                                                      end-to-end latency, clients tag each request packet with
are bottleneck locally, not by the network bandwidth.
                                                                                                                                                                                      the current timestamp. When receiving responses, clients
   In those measurements, MICA’s cache and store modes                                                                                                                                compare the current timestamp and the previous times-
show only minor differences in the performance. We will                                                                                                                               tamp echoed back in the responses. We use uniform 50%
refer to the cache version of MICA as MICA in the rest
of the evaluation for simplicity.
Skew resistance: Figure 11 compares the per-core                                                                                                                                                                100
                                                                                                                                                                                         Average latency (μs)

                                                                                                                                                                                                                 90                                                Original Memcached
throughput under uniform and skewed workloads of 50%                                                                                                                                                             80
                                                                                                                                                                                                                                                                   MICA
                                                                                                                                                                                                                 70
GET with tiny items. MICA uses EREW. Several cores                                                                                                                                                               60
process more requests under the skewed workload than                                                                                                                                                             50
                                                                                                                                                                                                                 40
under the uniform workload because they process requests                                                                                                                                                         30
                                                                                                                                                                                                                 20
more efficiently. The skew in the workload increases the                                                                                                                                                          10
                                                                                                                                                                                                                  0
RX burst size of the most loaded core from 10.2 packets                                                                                                                                                           0.0 0.1 0.2 0.3 0                   10   20      30      40     50     60     70      80
per I/O to 17.3 packets per I/O, reducing its per-packet I/O                                                                                                                                                                                                   Throughput (Mops)
cost, and the higher data locality caused by the workload                                                                                                                             Figure 13: End-to-end latency of the original Memcached
skew improves the average cache hit ratio of all cores from                                                                                                                           and MICA as a function of throughput.

USENIX Association                                                                                                               11th USENIX Symposium on Networked Systems Design and Implementation 439
MICA           MemC3         RAMCloud                                     MICA         MemC3           RAMCloud
                                  Masstree       Memcached                                                  Masstree     Memcached
                        102                                                                       102

                                                                             Throughput (Mops)
   Throughput (Mops)

                              50% GET

                        101                                                                       101

                        100                                                                       100   50% GET

                       10−1                                                                      10−1
                              2              4            8         16                                  2                   4                8
                                             Number of cores                                                       Number of NIC ports
                        102                                                                       102

                                                                             Throughput (Mops)
   Throughput (Mops)

                              95% GET

                        101                                                                       101

                        100                                                                       100   95% GET

                       10−1                                                                      10−1
                              2              4            8         16                                  2                   4                8
                                             Number of cores                                                       Number of NIC ports

Figure 14: End-to-end throughput of in-memory key-                        Figure 15: End-to-end throughput of in-memory key-
value systems using a varying number of cores. All sys-                   value systems using a varying number of NIC ports. All
tems use our lightweight network stack.                                   systems use our lightweight network stack.

GET workloads on tiny items. MICA uses EREW. The                          (lower); MICA and Masstree scale well as before. The
client varies the request rate to observe the relationship                rest also achieve higher throughput, but still do not scale.
between throughput and latency.                                           Note that some systems scale differently from their origi-
   Figure 13 plots the end-to-end latency as a function                   nal papers. For example, MemC3 achieves 5.7 Mops at
of throughput; the error bars indicate 5th- and 95th-                     4 cores, while the original paper shows 4.4 Mops at 16
percentile latency. The original Memcached exhibits al-                   cores [15]. This is because using our network stack in-
most flat latency up to certain throughput, whereas MICA                   stead of their network stack reduces I/O cost, which may
shows varied latency depending on the throughput it                       expose a different bottleneck (e.g., key-value data struc-
serves. MICA’s latency lies between 24–52 μs. At the sim-                 tures) that can change the optimal number of cores for the
ilar latency level of 40 μs, MICA shows 69 Mops—more                      best throughput.
than two orders of magnitude faster than Memcached.                       Network scalability: We also change the available net-
   Because MICA uses a single round-trip per request                      work bandwidth by varying the number of NIC ports we
unlike RDMA-based systems [35], we believe that MICA                      use for request processing. Figure 15 shows that MICA
provides best-in-class low-latency key-value operations.                  again scales well with high network bandwidth, because
Summary: MICA achieves both high throughput and                           MICA can use almost all available network bandwidth
latency near the network minimum.                                         for request processing. The GET ratio does not affect
                                                                          the result for MICA significantly. This result suggests
5.4                    Scalability                                        that MICA can possibly scale further with higher net-
CPU scalability: We vary now the number of CPU cores                      work bandwidth (e.g., multiple 40 Gbps NICs). MICA
and compare the end-to-end throughput. We allocate cores                  and Masstree achieve similar performance under the 95%
evenly to both NUMA domains so that cores can effi-                        GET workload when using 2 ports, but Masstree and other
ciently access NICs connected to their CPU socket. We                     systems do not scale well with more ports.
use skewed workloads on tiny items because it is gener-                   Summary: MICA scales well with more CPU cores
ally more difficult for partitioned stores to handle skewed                and more network bandwidth, even under write-intensive
workloads. MICA uses EREW.                                                workloads where other systems tend to scale worse.
   Figure 14 (upper) compares core scalability of systems
with 50% GET. Only MICA and Masstree perform better
                                                                          5.5                    Necessity of the Holistic Approach
with more cores. Memcached, MemC3, and RAMCloud                           In this section, we demonstrate how each component of
scale poorly, achieving their best throughput at 2 cores.                 MICA contributes to its performance. Because MICA is
   The trend continues for 95% GET requests in Figure 14                  a coherent system that exploits the synergy between its

440 11th USENIX Symposium on Networked Systems Design and Implementation                                                          USENIX Association
You can also read