MICA: A Holistic Approach to Fast In-Memory Key-Value Storage

Page created by Floyd Warner

Style & Fashion

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

MICA: A Holistic Approach to Fast In-Memory Key-Value Storage

MICA: A Holistic Approach to Fast In-Memory
              Key-Value Storage
Hyeontaek Lim, Carnegie Mellon University; Dongsu Han, Korea Advanced Institute of
  Science and Technology (KAIST); David G. Andersen, Carnegie Mellon University;
                          Michael Kaminsky, Intel Labs
         https://www.usenix.org/conference/nsdi14/technical-sessions/presentation/lim

          This paper is included in the Proceedings of the
          11th USENIX Symposium on Networked Systems
               Design and Implementation (NSDI ’14).
                           April 2–4, 2014 • Seattle, WA, USA
                                   ISBN 978-1-931971-09-6

                                                Open access to the Proceedings of the
                                                    11th USENIX Symposium on
                                                  Networked Systems Design and
                                                     Implementation (NSDI ’14)
                                                      is sponsored by USENIX

MICA: A Holistic Approach to Fast In-Memory Key-Value Storage
              Hyeontaek Lim,1 Dongsu Han,2 David G. Andersen,1 Michael Kaminsky3
                         1 Carnegie Mellon University, 2 KAIST, 3 Intel Labs

Abstract                                                           at the client to amortize this overhead, but batching in-
                                                                   creases latency, and large batches are unrealistic in large
MICA is a scalable in-memory key-value store that han-             cluster key-value stores because it is more difﬁcult to ac-
dles 65.6 to 76.9 million key-value operations per second          cumulate multiple requests being sent to the same server
using a single general-purpose multi-core system. MICA             from a single client [36].
is over 4–13.5x faster than current state-of-the-art sys-             MICA (Memory-store with Intelligent Concurrent Ac-
tems, while providing consistently high throughput over            cess) is an in-memory key-value store that achieves high
a variety of mixed read and write workloads.                       throughput across a wide range of workloads. MICA can
   MICA takes a holistic approach that encompasses all             provide either store semantics (no existing items can be
aspects of request handling, including parallel data access,       removed without an explicit client request) or cache se-
network request handling, and data structure design, but           mantics (existing items may be removed to reclaim space
makes unconventional choices in each of the three do-              for new items). Under write-intensive workloads with a
mains. First, MICA optimizes for multi-core architectures          skewed key popularity, a single MICA node serves 70.4
by enabling parallel access to partitioned data. Second,           million small key-value items per second (Mops), which
for efﬁcient parallel data access, MICA maps client re-            is 10.8x faster than the next fastest system. For skewed,
quests directly to speciﬁc CPU cores at the server NIC             read-intensive workloads, MICA’s 65.6 Mops is at least
level by using client-supplied information and adopts a            4x faster than other systems even after modifying them to
light-weight networking stack that bypasses the kernel.            use our kernel bypass. MICA achieves 75.5–76.9 Mops
Finally, MICA’s new data structures—circular logs, lossy           under workloads with a uniform key popularity. MICA
concurrent hash indexes, and bulk chaining—handle both             achieves this through the following techniques:
read- and write-intensive workloads at low overhead.
                                                                   Fast and scalable parallel data access: MICA’s data
                                                                   access is fast and scalable, using data partitioning and
1    Introduction                                                  exploiting CPU parallelism within and between cores.
In-memory key-value storage is a crucial building block            Its EREW mode (Exclusive Read Exclusive Write) min-
for many systems, including popular social networking              imizes costly inter-core communication, and its CREW
sites (e.g., Facebook) [36]. These storage systems must            mode (Concurrent Read Exclusive Write) allows multiple
provide high performance when serving many small ob-               cores to serve popular data. MICA’s techniques achieve
jects, whose total volume can grow to TBs and more [5].            consistently high throughput even under skewed work-
   While much prior work focuses on high performance               loads, one weakness of prior partitioned stores.
for read-mostly workloads [15, 30, 32, 37], in-memory              Network stack for efﬁcient request processing: MICA
key-value storage today must also handle write-intensive           interfaces with NICs directly, bypassing the kernel, and
workloads, e.g., to store frequently-changing objects [2,          uses client software and server hardware to direct remote
5, 36]. Systems optimized only for reads often waste re-           key-value requests to appropriate cores where the requests
sources when faced with signiﬁcant write trafﬁc; their             can be processed most efﬁciently. The network stack
inefﬁciencies include lock contention [32], expensive up-          achieves zero-copy packet I/O and request processing.
dates to data structures [15, 30], and complex memory              New data structures for key-value storage: New mem-
management [15, 32, 36].                                           ory allocation and indexing in MICA, optimized for store
   In-memory key-value storage also requires low-                  and cache separately, exploit properties of key-value work-
overhead network communication between clients and                 loads to accelerate write performance with simpliﬁed
servers. Key-value workloads often include a large num-            memory management.
ber of small key-value items [5] that require key-value
storage to handle short messages efﬁciently. Systems us-           2    System Goals
ing standard socket I/O, optimized for bulk communica-
tion, incur high network stack overhead at both kernel-            In this section, we ﬁrst clarify the non-goals and then
and user-level. Current systems attempt to batch requests          discuss the goals of MICA.

                                                               1
USENIX Association                       11th USENIX Symposium on Networked Systems Design and Implementation 429

Non-Goals: We do not change the cluster architecture. It store mode, the system must not remove any stored items
can still shard data and balance load across nodes, and without clients’ permission while striving to achieve good
perform replication and failure recovery. memory utilization.
We do not aim to handle large items that span multiple Commodity hardware: Using general-purpose hardware
packets. Most key-value items will fit comfortably in a reduces the cost of development, equipment, and oper-
single packet [5]. Clients can store a large item in a tradi- ation. Today’s server hardware can provide high-speed
tional key-value system and put a pointer to that system I/O [12, 22], comparable to that of specialized hardware
in MICA. This only marginally increases total latency; such as FPGAs and RDMA-enabled NICs.
one extra round-trip time for indirection is smaller than Although recent studies tried to achieve some of these
the transfer time of a large item sending multiple packets. goals, none of their solutions comprehensively address
We do not strive for durability: All data is stored in them. Some systems achieve high throughput by support-
DRAM. If needed, log-based mechanisms such as those ing only small fixed-length keys [33]. Many rely on client-
from RAMCloud [37] would be needed to allow data to based request batching [15, 30, 33, 36] to amortize high
persist across power failures or reboots. network I/O overhead, which is less effective in a large
MICA instead strives to achieve the following goals: installation of key-value stores [14]; use specialized hard-
High single-node throughput: Sites such as Facebook ware, often with multiple client-server round-trips and/or
replicate some key-value nodes purely to handle load [36]. no support for item eviction (e.g., FPGAs [7, 29], RDMA-
Faster nodes may reduce cost by requiring fewer of them enabled NICs [35]); or do not specifically address remote
overall, reducing the cost and overhead of replication request processing [45]. Many focus on uniform and/or
and invalidation. High-speed nodes are also more able to read-intensive workloads; several systems lack evalua-
handle load spikes and popularity hot spots. Importantly, tion for skewed workloads [7, 33, 35], and some systems
using fewer nodes can also reduce job latency by reduc- have lower throughput for write-intensive workloads than
ing the number of servers touched by client requests. A read-intensive workloads [30]. Several systems attempt
single user request can create more than 500 key-value to handle memory fragmentation explicitly [36], but there
requests [36], and when these requests go to many nodes, are scenarios where the system never reclaims fragmented
the time until all replies arrive increases, delaying comple- free memory, as we describe in the next section. The fast
tion of the user request [10]. Having fewer nodes reduces packet processing achieved by software routers and low-
fan-out, and thus, can improve job completion time. overhead network stacks [12, 19, 20, 41, 43] set a bar for
Low end-to-end latency: The end-to-end latency of a re- how fast a key-value system might operate on general-
mote key-value request greatly affects performance when purpose hardware, but do not teach how their techniques
a client must send back-to-back requests (e.g., when sub- apply to the higher-level processing of key-value requests.
sequent requests are dependent). The system should min-
imize both local key-value processing latency and the 3 Key Design Choices
number of round-trips between the client and server.
Consistent performance across workloads: Real work- Achieving our goals requires rethinking how we design
loads often have a Zipf-distributed key popularity [5], parallel data access, the network stack, and key-value
and it is crucial to provide fast key-value operations re- data structures. We make an unconventional choice for
gardless of skew. Recent uses of in-memory key-value each; we discuss how we overcome its potential draw-
storage also demand fast processing for write-intensive backs to achieve our goals. Figure 1 depicts how these
workloads [2, 36]. components fit together.
Handle small, variable-length key-value items: Most 3.1 Parallel Data Access
key-value items are small [5]. Thus, it is important
to process requests for them efficiently. Ideally, key- Exploiting the parallelism of modern multi-core systems
value request processing over the network should be is crucial for high performance. The most common access
as fast as packet processing in software routers—40 to models are concurrent access and exclusive access:
80 Gbps [12, 19]. Variable-length items require careful Concurrent access is used by most key-value sys-
memory management to reduce fragmentation that can tems [15, 30, 36]. As in Figure 2 (a), multiple CPU cores
waste substantial space [5]. can access the shared data. The integrity of the data struc-
Key-value storage interface and semantics: The ture must be maintained using mutexes [36], optimistic
system must support standard single-key requests (e.g., locking [15, 30], or lock-free data structures [34].
GET(key), PUT(key,value), DELETE(key)) Unfortunately, concurrent writes scale poorly: they in-
that are common in systems such as Memcached. In cache cur frequent cache line transfer between cores, because
mode, the system performs automatic cache management only one core can hold the cache line of the same memory
that may evict stored items at its discretion (e.g., LRU); in location for writing at the same time.

430 11th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Server
Memory
Client CPU core
Parallel Key-value
Client Network CPU core data data
NIC stack
Client CPU core access structures
(§3.2/§4.2)
(§3.1/§4.1) (§3.3/§4.3)
Client CPU core

Figure 1: Components of in-memory key-value stores. MICA’s key design choices in §3 and their details in §4.

CPU core CPU core Partition Direct NIC access is common in software routers to
CPU core Shared CPU core Partition achieve line-rate packet processing [12, 19]. This raw
CPU core data CPU core Partition access to NIC hardware bypasses the kernel to minimize
CPU core CPU core Partition
the packet I/O overhead. It delivers packets in bursts to
efficiently use CPU cycles and the PCIe bus connecting
(a) Concurrent access (b) Exclusive access NICs and CPUs. Direct access, however, precludes useful
Figure 2: Parallel data access models. TCP features such as retransmission, flow control, and
congestion control.
Exclusive access has been explored less often for key- MICA’s network I/O uses direct NIC access. By target-
value storage [6, 25, 33]. Only one core can access part ing only small key-value items, it needs fewer transport-
of the data, as in Figure 2 (b). By partitioning the data layer features. Clients are responsible for retransmitting
(“sharding”), each core exclusively accesses its own parti- packets if needed. Section 4.2 describes such issues and
tion in parallel without inter-core communication. our design in more detail.
Prior work observed that partitioning can have the best 3.2.2 Request Direction
throughput and scalability [30, 45], but cautions that it Request direction delivers client requests to CPU cores for
lowers performance when the load between partitions is processing.1 Modern NICs can deliver packets to specific
imbalanced, as happens under skewed key popularity [15, cores for load balancing or core affinity using hardware-
30, 45]. Furthermore, because each core can access only based packet classification and multi-queue support.
data within its own partition, request direction is needed
to forward requests to the appropriate CPU core. Server Server
Client KEY0 Client KEY0
CPU core CPU core
MICA’s parallel data access: MICA partitions data and A KEY1 A
KEY1
mainly uses exclusive access to the partitions. MICA ex- Client KEY0 Client
KEY0
CPU core CPU core
ploits CPU caches and packet burst I/O to disproportion- B KEY1 B KEY1
ately speed more loaded partitions, nearly eliminating (a) Flow-level core affinity (b) Object-level core affinity
the penalty from skewed workloads. MICA can fall back
Figure 3: Request direction mechanisms.
to concurrent reads if the load is extremely skewed, but
avoids concurrent writes, which are always slower than Flow-level core affinity is available using two methods:
exclusive writes. Section 4.1 describes our data access Receive-Side Scaling (RSS) [12, 19] sends packets to
models and partitioning scheme. cores based by hashing the packet header 5-tuple to iden-
tify which RX queue to target. Flow Director (FDir) [41]
3.2 Network Stack can more flexibly use different parts of the packet header
This section discusses how MICA avoids network stack plus a user-supplied table to map header values to RX
overhead and directs packets to individual cores. queues. Efficient network stacks use affinity to reduce
inter-core contention for TCP control blocks [20, 41].
3.2.1 Network I/O Flow affinity reduces only transport layer contention,
Network I/O is one of the most expensive processing not application-level contention [20], because a single
steps for in-memory key-value storage. TCP processing transport flow can contain requests for any objects (Fig-
alone may consume 70% of CPU time on a many-core ure 3 (a)). Even for datagrams, the benefit of flow affinity
optimized key-value store [33]. is small due to a lack of locality across datagrams [36].
The socket I/O used by most in-memory key-value Object-level core affinity distributes requests to cores
stores [15, 30, 33, 45] provides portability and ease of based upon the application’s partitioning. For example,
development. However, it underperforms in packets per requests sharing the same key would all go to the core
second because it has high per-read() overhead. Many handling that key’s partition (Figure 3 (b)).
systems therefore often have clients include a batch of re- 1 Because we target small key-value requests, we will use requests

quests in a single larger packet to amortize I/O overhead. and packets interchangeably.

USENIX Association 11th USENIX Symposium on Networked Systems Design and Implementation 431

Systems using exclusive access require object-level particularly suited for bulk data writes. This approach is
core affinity, but commodity NIC hardware cannot di- common in flash memory stores due to the high cost of
rectly parse and understand application-level semantics. random flash writes [3, 4, 28], but has been used in only
Software request redirection (e.g., message passing [33]) a few in-memory key-value systems [37].
incurs inter-core communication, which the exclusive ac- Garbage collection is crucial to space efficiency. It re-
cess model is designed to avoid. claims space occupied by overwritten and deleted objects
MICA’s request direction uses Flow Director [23, 31]. by moving live objects to a new log and removing the old
Its clients then encode object-level affinity information log. Unfortunately, garbage collection is costly and often
in a way Flow Director can understand. Servers, in turn, reduces performance because of the large amount of data
inform clients about the object-to-partition mapping. Sec- it must copy, trading memory efficiency against request
tion 4.2 describes how this mechanism works. processing speed.
3.3 Key-Value Data Structures MICA’s memory allocator: MICA uses separate mem-
This section describes MICA’s choice for two main data ory allocators for cache and store semantics. Its cache
structures: allocators that manage memory space for stor- mode uses a log structure with inexpensive garbage collec-
ing key-value items and indexes to find items quickly. tion and in-place update support (Section 4.3.1). MICA’s
allocator provides fast inserts and updates, and exploits
3.3.1 Memory Allocator cache semantics to eliminate log garbage collection and
drastically simplify free space defragmentation. Its store
New item New item mode uses segregated fits [42, 47] that share the unified
memory space to avoid rebalancing size classes (Sec-
tion 4.3.3).

free
3.3.2 Indexing: Read-oriented vs. Write-friendly
free
(a) Dynamic object allocator (b) Append-only log structure
Read-oriented index: Common choices for indexing are
Figure 4: Memory allocators. hash tables [15, 33, 36] or tree-like structures [30]. How-
A dynamic object allocator is a common choice for stor- ever, conventional data structures are much slower for
ing variable-length key-value items (Figure 4 (a)). Sys- writes compared to reads; hash tables examine many slots
tems such as Memcached typically use a slab approach: to find a space for the new item [15], and trees may require
they divide object sizes into classes (e.g., 48-byte, 56-byte, multiple operations to maintain structural invariants [30].
..., 1-MiB2 ) and maintain separate (“segregated”) memory Write-friendly index: Hash tables using chaining [33,
pools for these classes [15, 36]. Because the amount of 36] can insert new items without accessing many memory
space that each class uses typically varies over time, the locations, but they suffer a time-space tradeoff: by hav-
systems use a global memory manager that allocates large ing long chains (few hash buckets), an item lookup must
memory blocks (e.g., 1 MiB) to the pools and dynamically follow a long chain of items, this requiring multiple ran-
rebalances allocations between classes. dom dependent memory accesses; when chains are short
The major challenge for dynamic allocation is the mem- (many hash buckets), memory overhead to store chain-
ory fragmentation caused when blocks are not fully filled. ing pointers increases. Lossy data structures are rather
There may be no free blocks or free objects for some size unusual in in-memory key-value storage and studied only
classes while blocks from other classes are partly empty in limited contexts [7], but it is the standard design in
after deletions. Defragmentation packs objects of each ob- hardware indexes such as CPU caches [21].
ject tightly to make free blocks, which involves expensive MICA’s index: MICA uses new index data structures
memory copy. This process is even more complex if the to offer both high-speed read and write. In cache mode,
memory manager performs rebalancing concurrently with MICA’s lossy index also leverages the cache semantics
threads accessing the memory for other reads and writes. to achieve high insertion speed; it evicts an old item in
Append-only log structures are write-friendly, placing the hash table when a hash collision occurs instead of
new data items at the end of a linear data structure called spending system resources to resolve the collision. By
a “log” (Figure 4 (b)). To update an item, it simply in- using the memory allocator’s eviction support, the MICA
serts a new item to the log that overrides the previous lossy index can avoid evicting recently-used items (Sec-
value. Inserts and updates thus access memory sequen- tion 4.3.2). The MICA lossless index uses bulk chaining,
tially, incurring fewer cache and TLB misses, making logs which allocates cache line-aligned space to a bucket for
2 Binary prefixes (powers of 2) end with an “i” suffix, whereas SI each chain segment. This keeps the chain length short and
prefixes (powers of 10) have no “i” suffix. space efficiency high (Section 4.3.3).

432 11th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

4 MICA Design quests, while the latter still reduces expensive cache line
transfer. CREW handles reads efficiently under highly
This section describes each component in MICA and skewed load, at the cost of managing read-write conflicts.
discusses how they operate together to achieve its goals. MICA minimizes the synchronization cost with efficient
4.1 Parallel Data Access optimistic locking [48] (Section 4.3.2).
Supporting cache semantics in CREW, however, raises
This section explains how CPU cores access data in
a challenge for read (GET) requests: During a GET, the
MICA, but assumes that cores process only the requests
cache may need to update cache management informa-
for which they are responsible. Later in Section 4.2, we
tion. For example, policies such as LRU use bookkeeping
discuss how MICA assigns remote requests to CPU cores.
to remember recently used items, which can cause con-
4.1.1 Keyhash-Based Partitioning flicts and cache-line bouncing among cores. This, in turn,
MICA creates one or more partitions per CPU core and defeats the purpose of using exclusive writes.
stores key-value items in a partition determined by their To address this problem, we choose an approximate ap-
key. Such horizontal partitioning is often used to shard proach: MICA counts reads only from the exclusive-write
across nodes [4, 11], but some key-value storage systems core. Clients round-robin CREW reads across all cores
also use it across cores within a node [6, 25, 33]. in a NUMA domain, so this is effectively a sampling-
MICA uses a keyhash to determine each item’s partition. based approximation to, e.g., LRU replacement as used
A keyhash is the 64-bit hash of an item’s key calculated in MICA’s item eviction support (Section 4.3.1).
by the client and used throughout key-value processing To show performance benefits of EREW and CREW,
in MICA. MICA uses the first few high order bits of the our MICA prototype also provides the CRCW (Concur-
keyhash to obtain the partition index for the item. rent Read Concurrent Write) mode, in which MICA al-
Keyhash partitioning uniformly maps keys to partitions, lows multiple cores to read and write any partition. This
reducing the request distribution imbalance. For example, effectively models concurrent access to the shared data in
in a Zipf-distributed population of size 192 × 220 (192 non-partitioned key-value systems.
Mi) with skewness 0.99 as used by YCSB [9],3 the most
popular key is 9.3 × 106 times more frequently accessed
4.2 Network Stack
than the average; after partitioning keys into 16 partitions, The network stack in MICA provides network I/O to trans-
however, the most popular partition is only 53% more fer packet data between NICs and the server software, and
frequently requested than the average. request direction to route requests to an appropriate CPU
MICA retains high throughput under this remaining core to make subsequent key-value processing efficient.
partition-level skew because it can process requests in Exploiting the small key-value items that MICA tar-
“hot” partitions more efficiently, for two reasons. First, a gets, request and response packets use UDP. Despite
partition is popular because it contains “hot” items; these clients not benefiting from TCP’s packet loss recovery
hot items naturally create locality in data access. With and flow/congestion control, UDP has been used widely
high locality, MICA experiences fewer CPU cache misses for read requests (e.g., GET) in large-scale deployments
when accessing items. Second, the skew causes packet of in-memory key-value storage systems [36] for low la-
I/O to be more efficient for popular partitions (described tency and low overhead. Our protocol includes sequence
in Section 4.2.1). As a result, throughput for the Zipf- numbers in packets, and our application relies on the
distributed workload is 86% of the uniformly-distributed idempotency of GET and PUT operations for simple and
workload, making MICA’s partitioned design practical stateless application-driven loss recovery, if needed: some
even under skewed workloads. queries may not be useful past a deadline, and in many
4.1.2 Operation Modes cases, the network is provisioned well, making retrans-
mission rare and congestion control less crucial [36].
MICA can operate in EREW (Exclusive Read Exclusive
Write) or CREW (Concurrent Read Exclusive Write). 4.2.1 Direct NIC Access
EREW assigns a single CPU core to each partition for all MICA uses Intel’s DPDK [22] instead of standard socket
operations. No concurrent access to partitions eliminates I/O. This allows our user-level server software to control
synchronization and inter-core communication, making NICs and transfer packet data with minimal overhead.
MICA scale linearly with CPU cores. CREW allows any MICA differs from general network processing [12, 19,
core to read partitions, but only a single core can write. 41] that has used direct NIC access in that MICA is an
This combines the benefit of concurrent read and exclu- application that processes high-level key-value requests.
sive write; the former allows all cores to process read re- In NUMA (non-uniform memory access) systems with
3 i-th key constitutes 1/(i0.99 H
n,0.99 ) of total requests, where
multiple CPUs, NICs may have different affinities to
Hn,0.99 = ∑ni=1 (1/i0.99 ) and n is the total number of keys. CPUs. For example, our evaluation hardware has two

USENIX Association 11th USENIX Symposium on Networked Systems Design and Implementation 433

Head (old items are evicted) Tail (new items are inserted)
CPUs, each connected to two NICs via a direct PCIe bus.
MICA uses NUMA-aware memory allocation so that each              Circular log
CPU and NIC only accesses packet buffers stored in their          Log entry
respective NUMA domains.
                                                                        Initial size Keyhash Key/value length Expire time
   MICA uses NIC multi-queue support to allocate a ded-                                       Key
icated RX and TX queue to each core. Cores exclusively                                       Value
access their own queues without synchronization in a sim-
                                                                           Figure 5: Design of a circular log.
ilar way to EREW data access. By directing a packet to
an RX queue, the packet can be processed by a speciﬁc
core, as we discuss in Section 4.2.2.                         The client then embeds the request direction information
                                                              in the packet header: If the request uses exclusive data
Burst packet I/O: MICA uses the DPDK’s burst packet
                                                              access (read/write on EREW and write on CREW), the
I/O to transfer multiple packets (up to 32 in our imple-
                                                              client calculates the partition index from the keyhash of
mentation) each time it requests packets from RX queues
                                                              the request. If the request can be handled by any core (a
or transmits them to TX queues. Burst I/O reduces the per-
                                                              CREW read), it picks a server core index in a round-robin
packet cost of accessing and modifying the queue, while
                                                              way (across requests, but in the same NUMA domain
adding only trivial delay to request processing because
                                                              (Section 4.2.1)). Finally, the client encodes the partition
the burst size is small compared to the packet processing
                                                              or core index as the UDP destination port.4 The server
rate.
                                                              programs FDir to use the UDP destination port, without
   Importantly, burst I/O helps handle skewed workloads.
                                                              hashing, (“perfect match ﬁlter” [23]), as an index into
A core processing popular partitions spends more time
                                                              a table mapping UDP port numbers to a destination RX
processing requests, and therefore performs packet I/O
                                                              queue. Key hashing only slightly burdens clients. Using
less frequently. The lower I/O frequency increases the
                                                              fast string hash functions such as CityHash [8], a sin-
burst size, reducing the per-packet I/O cost (Section 5.2).
                                                              gle client machine equipped with dual 6-core CPUs on
Therefore, popular partitions have more CPU available for
                                                              our testbed can generate over 40 M requests/second with
key-value processing. An unpopular partition’s core has
                                                              client-side key hashing. Clients include the keyhash in
higher per-packet I/O cost, but handles fewer requests.
                                                              requests, and servers reuse the embedded keyhash when
Zero-copy processing: MICA avoids packet data copy
                                                              they need a keyhash during the request processing to ben-
throughout RX/TX and request processing. MICA uses
                                                              eﬁt from ofﬂoaded hash computation.
MTU-sized packet buffers for RX even if incoming re-
                                                                 Client-assisted request direction using NIC hardware
quests are small. Upon receiving a request, MICA avoids
                                                              allows efﬁcient request processing. Our results in Sec-
memory allocation and copying by reusing the request
                                                              tion 5.5 show that an optimized software-based request
packet to construct a response: it ﬂips the source and des-
                                                              direction that receives packets from any core and dis-
tination addresses and ports in the header and updates
                                                              tributes them to appropriate cores is signiﬁcantly slower
only the part of the packet payload that differs between
                                                              than MICA’s hardware-based approach.
the request and response.
                                                              4.3     Data Structure
4.2.2   Client-Assisted Hardware Request Direction
                                                              MICA, in cache mode, uses circular logs to manage mem-
Modern NICs help scale packet processing by directing         ory for key-value items and lossy concurrent hash indexes
packets to different RX queues using hardware features        to index the stored items. Both data structures exploit
such as Receiver-Side Scaling (RSS) and Flow Director         cache semantics to provide fast writes and simple memory
(FDir) [12, 19, 41] based on the packet header.               management. Each MICA partition consists of a single
   Because each MICA key-value request is an individual       circular log and lossy concurrent hash index.
packet, we wish to use hardware packet direction to di-          MICA provides a store mode with straightforward ex-
rectly send packets to the appropriate queue based upon       tensions using segregated ﬁts to allocate memory for key-
the key. Doing so is much more efﬁcient than redirecting      value items and bulk chaining to convert the lossy concur-
packets in software. Unfortunately, the NIC alone cannot      rent hash indexes into lossless ones.
provide key-based request direction: RSS and FDir cannot
classify based on the packet payload, and cannot examine      4.3.1    Circular Log
variable length ﬁelds such as request keys.                   MICA stores items in its circular log by appending them
Client assistance: We instead take advantage of the op-       to the tail of the log (Figure 5). This results in a space-
portunity to co-design the client and server. The client      efﬁcient packing. It updates items in-place as long as the
caches information from a server directory about the             4 To avoid confusion between partition indices and the core indices,
operation mode (EREW or CREW), number of cores,               we use different ranges of UDP ports; a partition may be mapped to a
NUMA domains, and NICs, and number of partitions.             core whose index differs from the partition index.

434 11th USENIX Symposium on Networked Systems Design and Implementation                                    USENIX Association

new size of the key+value does not exceed the size of the Lossy concurrent
Index entries
hash index
item when it was ﬁrst inserted. The size of the circular
log is bounded and does not change, so to add a new item Bucket 0 Version ...
to a full log, MICA evicts the oldest item(s) at the head Bucket 1
of the log to make space.
Bucket 2
Each entry includes the key and value length, key, and
...
value. To locate the next item in the log and support item
resizing, the entry contains the initial item size, and for Tag Item offset
fast lookup, it stores the keyhash of the item. The entry Figure 6: Design of a lossy concurrent hash index.
has an expire time set by the client to ignore stale data.
Garbage collection and defragmentation: The circular 0 234
Virtual address space
log eliminates the expensive garbage collection and free
space defragmentation that are required in conventional
log structures and dynamic memory allocators. Previously
deleted items in the log are automatically collected and Physical address space
removed when new items enter the log. Almost all free first page mapped twice

space remains contiguously between the tail and head. Our MICA prototype implements this scheme in
Exploiting the eviction of live items: Items evicted at userspace by mapping a pool of hugepages to virtual
the head are not reinserted to the log even if they have addresses using the mmap() system call.
not yet expired. In other words, the log may delete items
4.3.2 Lossy Concurrent Hash Index
without clients knowing it. This behavior is valid in cache
workloads; a key-value store must evict items when it MICA’s hash index locates key-value items in the log
becomes full. For example, Memcached [32] uses LRU using a set-associative cache similar to that used in CPU
to remove items and reserve space for new items. caches. As shown in Figure 6, a hash index consists of
MICA uses this item eviction to implement common multiple buckets (configurable for the workload), and
eviction schemes at low cost. Its “natural” eviction is each bucket has a fixed number of index entries (config-
FIFO. MICA can provide LRU by reinserting any re- urable in the source code; 15 in our prototype to occupy
quested items at the tail because only the least recently exactly two cache lines). MICA uses a portion of the key-
used items are evicted at the head. MICA can approxi- hashes to determine an item’s bucket; the item can occupy
mate LRU by reinserting requested items selectively—by any index entry of the bucket unless there is a duplicate.
ignoring items recently (re)inserted and close to the tail; Each index entry contains partial information for the
this approximation offers eviction similar to LRU with- item: a tag and the item offset within the log. A tag is
out frequent reinserts, because recently accessed items another portion of the indexed item’s keyhash used for
remain close to the tail and far from the head. filtering lookup keys that do not match: it can tell whether
A second challenge for conventional logs is that any the indexed item will never match against the lookup key
reference to an evicted item becomes dangling. MICA by comparing the stored tag and the tag from the lookup
does not store back pointers in the log entry to discover all keyhash. We avoid using a zero tag value by making it
references to the entry; instead, it provides detection, and one because we use the zero value to indicate an empty
removes dangling pointers incrementally (Section 4.3.2). index entry. Items are deleted by writing zero values to
Low-level memory management: MICA uses the index entry; the entry in the log will be automatically
hugepages and NUMA-aware allocation. Hugepages garbage collected.
(2 MiB in x86-64) use fewer TLB entries for the same Note that the parts of keyhashes used for the partition
amount of memory, which significantly reduces TLB index, the bucket number, and the tag do not overlap. Our
misses during request processing. Like the network stack, prototype uses 64-bit keyhashes to provide sufficient bits.
MICA allocates memory for circular logs such that cores Lossiness: The hash index is lossy. When indexing a new
access only local memory. key-value item into a full bucket of the hash index, the
Without explicit range checking, accessing an entry index evicts an index entry to accommodate the new item.
near the end of the log (e.g., at 234 − 8 in the example The item evicted is determined by its age; if the item
below) could cause an invalid read or segmentation fault offset is most behind the tail of the log, the item is the
by reading off the end of the range. To avoid such errors oldest (or least recently used if the log is using LRU), and
without range checking, MICA manually maps the virtual the associated index entry of the item is reclaimed.
memory addresses right after the end of the log to the This lossy property allows fast insertion. It avoids ex-
same physical page as the first page of the log, making pensive resolution of hash collisions that lossless indexes
the entire log appear locally contiguous: of other key-value stores require [15, 33]. As a result,

USENIX Association 11th USENIX Symposium on Networked Systems Design and Implementation 435

Item offset Item offset Free space that can fit

0 dangling pointer valid pointer 2 48 Free space that can fit

Offset space
Free space that can fit

Circular log Figure 8: Segregated free lists for a unified space.
Figure 7: Offset space for dangling pointer detection.
CRCW mode, which allows multiple writers to access the
MICA’s insert speed is comparable to lookup speed. same bucket, a writer also spins until the initial version
Handling dangling pointers: When an item is evicted number is even (i.e., no other writers to this bucket) using
from the log, MICA does not delete its index entry. Al- a compare-swap operation instruction.
though it is possible to store back pointers in the log entry, Our MICA prototype uses different code to optimize
updating the hash index requires a random memory write locking. It uses conventional instructions to manipulate
and is complicated due to locking if the index is being version numbers to exploit memory access ordering on the
accessed concurrently, so MICA does not. As a result, x86 architecture [48] in CREW mode where there is only
index pointers can “dangle,” pointing to invalid entries. one writer. EREW mode does not require synchroniza-
To address this problem, MICA uses large pointers tion between cores, so MICA ignores version numbers.
for head/tail and item offsets. As depicted in Figure 7, Because of such a hard-coded optimization, the current
MICA’s index stores log offsets that are wider than needed prototype lacks support for runtime switching between
to address the full size of the log (e.g., 48-bit offsets vs 34 the operation modes.
bits for a 16 GiB log). MICA detects a dangling pointer Multi-stage prefetching: To retrieve or update an item,
before using it by checking if the difference between the MICA must perform request parsing, hash index lookup,
log tail and the item offset is larger than the actual log and log entry retrieval. These stages cause random mem-
size.5 If the tail wraps around the 48-bit size, however, a ory access that can significantly lower system perfor-
dangling pointer may appear valid again, so MICA scans mance if cores stall due to CPU cache and TLB misses.
the index incrementally to remove stale pointers. MICA uses multi-stage prefetching to interleave com-
This scanning must merely complete a full cycle before putation and memory access. MICA applies memory
the tail wraps around in its wide offset space. The speed prefetching for random memory access done at each pro-
at which it wraps is determined by the increment rate of cessing stage in sequence. For example, when a burst
the tail and the width of the item offset. In practice, full of 8 RX packets arrives, MICA fetches packets 0 and 1
scanning is infrequent even if writes occur very frequently. and prefetches packets 2 and 3. It decodes the requests in
For example, with 48-bit offsets and writes occurring at packets 0 and 1, and prefetches buckets of the hash index
230 bytes/second (millions of operations/second), the tail that these requests will access. MICA continues packet
wraps every 248−30 seconds. If the index has 224 buckets, payload prefetching for packets 4 and 5. It then prefetches
MICA must scan only 26 buckets per second, which adds log entries that may be accessed by the requests of pack-
negligible overhead. ets 0 and 1 while prefetching the hash index buckets for
Supporting concurrent access: MICA’s hash index packets 2 and 3, and the payload of packet 6 and 7. MICA
must behave correctly if the system permits concurrent continues this pipeline until all requests are processed.
operations (e.g., CREW). For this, each bucket contains 4.3.3 Store Mode
a 32-bit version number. It performs reads optimistically
The store mode of MICA uses segregated fits [42, 47]
using this version counter to avoid generating memory
similar to fast malloc implementations [27], instead of the
writes while satisfying GET requests [15, 30, 48]. When
circular log. Figure 8 depicts this approach. MICA defines
accessing an item, MICA checks if the initial state of the
multiple size classes incrementing by 8 bytes covering
version number of the bucket is even-numbered, and upon
all supported item sizes, and maintains a freelist for each
completion of data fetch from the index and log, it reads
size class (a linked list of pointers referencing unoccupied
the version number again to check if the final version
memory regions that are at least as large as the size class).
number is equal to the initial version number. If either
When a new item is inserted, MICA chooses the smallest
check fails, it repeats the read request processing from
size class that is at least as large as the item size and has
the beginning. For writes, MICA increments the version
any free space. It stores the item in the free space, and
number by one before beginning, and increments the ver-
inserts any unused region of the free space into a freelist
sion number by one again after finishing all writes. In
that matches that region’s size. When an item is deleted,
5 (Tail − ItemOffset + 248 ) mod 248 > LogSize. MICA coalesces any adjacent free regions using boundary

436 11th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Lossless concurrent
hash index
Index entries 5 Evaluation
Main bucket 0 Version ... We answer four questions about MICA in this section:
Main bucket 1 • Does it perform well under diverse workloads?
Main bucket 2 • Does it provide good latency?
... • How does it scale with more cores and NIC ports?
Spare bucket i Version ... • How does each component affect performance?
Spare bucket j Our results show that MICA has consistently high
Spare bucket k throughput and low latency under a variety of workloads.
... It scales nearly linearly, using CPU cores and NIC ports
Figure 9: Bulk chaining in MICA’s lossless hash index. efficiently. Each component of MICA is needed. MICA
achieves 65.6–76.9 million operations/second (Mops),
which is over 4–13.5x faster than the next fastest system;
tags [26] to recreate a large free region. the gap widens as the fraction of write requests increases.
MICA is written in 12 K lines of C and runs on x86-64
MICA’s segregated fits differ from the simple segre- GNU/Linux. Packet I/O uses the Intel DPDK 1.4.1 [22].
gated storage used in Memcached [15, 32]. MICA main- Compared systems: We use custom versions of open-
tains a unified space for all size classes; on the contrary, source Memcached [32], MemC3 [15], Masstree [30],
Memcached’s SLAB allocator dynamically assigns mem- and RAMCloud [37]. The revisions of the original code
ory blocks to size classes, which effectively partitions the we used are: Memcached: 87e2f36; MemC3: an internal
memory space according to size classes. The unified space version; Masstree: 4ffb946; RAMCloud: a0f6889.
of MICA eliminates the need to rebalance size classes un- Note that the compared systems often offer additional
like the simple segregated storage. Using segregated fits capabilities compared to others. For example, Masstree
also makes better use of memory because MICA already can handle range queries, and RAMCloud offers low la-
has partitioning done with keyhashes; a SLAB alloca- tency processing on InfiniBand; on the other hand, these
tor introducing another partitioning would likely waste key-value stores do not support automatic item eviction
memory by allocating a whole block for only a few items, as Memcached systems do. Our evaluation focuses on
resulting in low memory occupancy. the performance of the standard features (e.g., single key
MICA converts its lossy concurrent hash index into a queries) common to all the compared systems, rather than
lossless hash index by using bulk chaining. Bulk chaining highlighting the potential performance impact from these
is similar to the traditional chaining method in hash tables; semantic differences.
it adds more memory space to the buckets that contain an Modifications to compared systems: We modify the
excessive number of items. compared systems to use our lightweight network stack to
avoid using expensive socket I/O or special hardware (e.g.,
Figure 9 shows the design of the lossless hash index.
InfiniBand). When measuring Memcached’s baseline la-
MICA uses the lossy concurrent hash index as the main
tency, we use its original network stack using the kernel to
buckets and allocates space for separate spare buckets
obtain the latency distribution that typical Memcached de-
that are fewer than the main buckets. When a bucket
ployments would experience. Our experiments do not use
experiences an overflow, whether it is a main bucket or
any client-side request batching. We also modified these
spare bucket, MICA adds an unused spare bucket to the
systems to invoke memory allocation functions though
full bucket to form a bucket chain. If there are no more
our framework if they use hugepages, because the DPDK
spare buckets available, MICA rejects the new item and
requests all hugepages from the OS at initialization and
returns an out-of-space error to the client.
would make the unmodified systems inoperable if they
This data structure is friendly to memory access. The request hugepages from the OS; we kept other memory
main buckets store most of items (about 95%), keeping allocations using no hugepages as-is. Finally, while run-
the number of random memory read for an index lookup ning experiments, we found that statistics collection in
close to 1; as a comparison, cuckoo hashing [39] used RAMCloud caused lock contention, so we disabled it for
in improved Memcached systems [15] would require 1.5 better multi-core performance.
random memory accesses per index lookup in expecta-
tion. MICA also allows good memory efficiency; because 5.1 Evaluation Setup
the spare buckets only store overflow items, making the Server/client configuration: MICA server runs on a ma-
number of spare buckets 10% of the main buckets allows chine equipped with dual 8-core CPUs (Intel Xeon E5-
the system to store the entire dataset of 192 Mi items in 2680 @2.70 GHz), 64 GiB of total system memory, and
our experiments (Section 5). eight 10-Gb Ethernet ports (four Intel X520-T2’s). Each

USENIX Association 11th USENIX Symposium on Networked Systems Design and Implementation 437

CPU has 20 MiB of L3 cache. We disabled logical pro- request processing (which can boost measured through-
cessor support (“Hyper-Threading”). Each CPU accesses put more than 10%), as is commonly done in throughput
the 32 GiB of the system memory that resides in its local measurement of software routers [12, 19], but we avoid
NUMA domain over a quad-channel DDR3-1600 bus. this method because we expect that real deployments of
Each CPU socket is directly connected to two NICs using in-memory key-value stores would not tolerate excessive
PCIe gen2. Access to hardware resources in the remote packet losses, and such flooding can distort the intended
NUMA domain uses an interconnect between two CPUs skew in the workload by causing biased packet losses at
(Intel QuickPath). different cores.
We reserved the half of the memory (16 GiB in each The workload generator does not receive every re-
NUMA domain) for hugepages regardless of how MICA sponse from the server. On our client machines, receiving
and the compared systems use hugepages. packets whose size is not a multiple of 64 bytes is sub-
MICA allocates 16 partitions in the server, and these stantially slower due to an issue in the PCIe bus [18].
partitions are assigned to different cores. We configured The workload generator works around this slow RX
the cache version of MICA to use approximate LRU to by sampling responses to perform fewer packet RX from
evict items; MICA reinserts any recently accessed item at NIC to CPU. It uses its real source MAC addresses for
the tail if the item is closer to the head than to the tail of only a fraction of requests, causing its NIC to drop the
the circular log. responses to the other requests. By looking at the sampled
Two client machines with dual 6-core CPUs (Intel Xeon responses, the workload generator can validate that the
L5640 @2.27 GHz) and two Intel X520-T2’s generate server has correctly processed the requests. Our server is
workloads. The server and clients are directly connected unaffected from this issue and performs full packet RX.
without a switch. Each client is connected to the NICs
from both NUMA domains of the server, allowing a client 5.2 System Throughput
to send a request to any server CPU. We first compare the full-system throughput. MICA uses
Workloads: We explore different aspects of the systems EREW with all 16 cores. However, we use a different
by varying the item size, skew, and read-write ratio. number of cores for the other systems to obtain their best
We use three datasets as shown in the following table: throughput because some of them (Memcached, MemC3,
and RAMCloud) achieve higher throughput with fewer
Dataset Key Size (B) Value Size (B) Count cores (Section 5.4). The throughput numbers are calcu-
Tiny 8 8 192 Mi lated from the actual number of responses sent to the
Small 16 64 128 Mi clients after processing the requests at the server. We de-
Large 128 1024 8 Mi note the cache version of MICA by MICA-c and the store
version of MICA by MICA-s.
We use two workload types: uniform and skewed. Uni- Figure 10 (top) plots the experiment result using tiny
form workloads use the same key popularity for all re- key-value items. MICA performs best, regardless of the
quests; skewed workloads use a non-uniform key popu- skew or the GET ratio. MICA’s throughput reaches 75.5–
larity that follows a Zipf distribution of skewness 0.99, 76.9 Mops for uniform workloads and 65.6–70.5 Mops
which is the same as YCSB’s [9]. for skewed ones; its parallel data access does not incur
Workloads have a varied ratio between GET and PUT. more than a 14% penalty for skewed workloads. MICA
50% GET (50% PUT) workloads are write-intensive, and uses 54.9–66.4 Gbps of network bandwidth at this pro-
95% GET (5% PUT) workloads are read-intensive. They cessing speed—this speed is very close to 66.6 Gbps that
correspond to YCSB’s A and B workloads, respectively. our network stack can handle when doing packet I/O only.
Workload generation: We use our custom key-value The next best system is Masstree at 16.5 Mops, while
request generator that uses similar techniques to our others are below 6.1 Mops. All systems except MICA
lightweight network stack to send more than 40 Mops suffer noticeably under write-intensive 50% GET.
of key-value requests per machine to saturate the link.6 It Small key-value items show similar results in Figure 10
uses approximation techniques of Zipf distribution gener- (middle). However, the gap between MICA and the other
ation [17, 38] for fast skewed workload generation. systems shrinks because MICA becomes network bottle-
To find the maximum meaningful throughput of a sys- necked while the other systems never saturate the network
tem, we adjust the workload generation rate to allow only bandwidth in our experiments.
marginal packet losses (< 1% at any NIC port). We could Large key-value items, shown in Figure 10 (bottom),
generate requests at the highest rate to cause best-effort exacerbates the network bandwidth bottleneck, further
6 MICA clients are still allowed to use standard socket I/O in cases limiting MICA’s throughput. MICA achieves 12.6–14.6
where the socket overhead on the client machines is acceptable because Mops for 50% GET and 8.6–9.4 Mops for 95% GET; note
the MICA server and clients use the plain UDP protocol. that MICA shows high throughput with lower GET ratios,

438 11th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

80                                                                                                                                                                                       8
                                               Uniform, 50% GET                                                                76.9 76.3                  76.9
                                                                                                                                                                 75.5

                                                                                                                                                                                         Throughput (Mops)
                       70
                                                                                                                                            70.4                        70.5
                                                                                                                                                                                                                7
                                               Uniform, 95% GET                                                                                    65.6                        65.6

                                               Skewed, 50% GET                                                                                                                                                  6
   Throughput (Mops)
                       60
                                               Skewed, 95% GET                                                                                                                                                  5
                       50
                                                                                                                                                                                                                4
                       40                                                                                                                                                                                       3
                       30          Tiny key-value items                                                                                                                                                         2                Skewed
                       20
                                                                                                                                                                                                                1                Uniform
                                                                                                                        16.5

                                                                                                           12.2                                                                                                 0
                       10                                              5.7         6.1         5.8   5.7          6.5
                                                                                                                                                                                                                         0   1   2     3      4   5    6       7   8    9 10 11 12 13 14 15
                                                           5.4

                             0.7   1.3   0.7   1.3   0.9         0.9         0.9         1.0
                        0
                            Memcached MemC3                                  RAMCloud Masstree                                  MICA-c                     MICA-s                     Figure 11: Per-core breakdown of end-to-end throughput.
                       80
                                               Uniform, 50% GET                                                                                                                                                                      Uniform, 50% GET                           Skewed, 50% GET
                       70
                                               Uniform, 95% GET                                                                64.1 64.4                  64.1
                                                                                                                                                                                                                                     Uniform, 95% GET                           Skewed, 95% GET
                                               Skewed, 50% GET
   Throughput (Mops)

                                                                                                                                            59.7                 59.9 59.7
                       60                                                                                                                          55.6                        55.6                             200
                                               Skewed, 95% GET
                       50                                                                                                                                                                                                                                                                       174.8

                                                                                                                                                                                         Throughput (Mops)
                                                                                                                                                                                                                             147.3    149.6   148.4    151.7                            149.1
                       40                                                                                                                                                                                       150                                                     136.5   138.4

                       30          Small key-value items
                                                                                                                                                                                                                100
                       20                                                                                               17.7

                                                                                                           10.8
                       10                                                                      5.7   5.7          6.5                                                                                               50
                                                           5.3         5.6
                                                                                   3.1
                             0.7   1.3   0.7   1.3   0.9         0.9         0.5         0.5
                        0
                            Memcached MemC3                                  RAMCloud Masstree                                  MICA-c                     MICA-s
                                                                                                                                                                                                                     0
                       16                                                                                                                                                                                                               EREW                                     CREW
                                               Uniform, 50% GET                                                                             14.6                        14.6

                       14
                                               Uniform, 95% GET                                                                12.6
                                                                                                                                                          13.2
                                                                                                                                                                                      Figure 12: Local throughput of key-value data structures.
                                               Skewed, 50% GET
   Throughput (Mops)

                       12
                                               Skewed, 95% GET
                       10
                                                                                                                                                                                      67.8% to 77.8%. A local benchmark in Figure 12 (without
                                                                                                                                      9.4                        9.4
                                                                                                                                                   8.6                         8.6

                        8          Large key-value items                                                                8.0

                        6
                                                                                                                                                                                      network processing) also shows that skewed workloads
                                                                                                                                                                                      grant good throughput for local key-value processing due
                                                                       5.5
                                                           5.2
                                                                                                           4.9

                        4
                                                                                                                                                                                      to the data locality. These results further justify the parti-
                                                                                                                  3.5
                                                                                               2.9   3.1
                                                                                   2.3
                        2
                                                                                                                                                                                      tioned design of MICA and explains why MICA retains
                                   1.2         1.4
                                         0.7         0.8         0.9
                             0.7
                                                                             0.3         0.4

                        0
                            Memcached MemC3                                  RAMCloud Masstree                                  MICA-c                     MICA-s                     high throughput under skewed workloads.
Figure 10: End-to-end throughput of in-memory key-                                                                                                                                    Summary: MICA’s throughput reaches 76.9 Mops, at
value systems. All systems use our lightweight network                                                                                                                                least 4x faster than the next best system. MICA delivers
stack that does not require request batching. The bottom                                                                                                                              consistent performance across different skewness, write-
graph (large key-value items) uses a different Y scale                                                                                                                                intensiveness, and key-value sizes.
from the ﬁrst two graphs’.
                                                                                                                                                                                      5.3                       Latency
                                                                                                                                                                                      To show that MICA achieves comparably low latency
which require less network bandwidth as the server can
                                                                                                                                                                                      while providing high throughput, we compare MICA’s
omit the key and value from the responses. Unlike MICA,
                                                                                                                                                                                      latency with that of the original Memcached implemen-
however, all other systems achieve higher throughput un-
                                                                                                                                                                                      tation that uses the kernel network stack. To measure the
der 95% GET than under 50% GET because these systems
                                                                                                                                                                                      end-to-end latency, clients tag each request packet with
are bottleneck locally, not by the network bandwidth.
                                                                                                                                                                                      the current timestamp. When receiving responses, clients
   In those measurements, MICA’s cache and store modes                                                                                                                                compare the current timestamp and the previous times-
show only minor differences in the performance. We will                                                                                                                               tamp echoed back in the responses. We use uniform 50%
refer to the cache version of MICA as MICA in the rest
of the evaluation for simplicity.
Skew resistance: Figure 11 compares the per-core                                                                                                                                                                100
                                                                                                                                                                                         Average latency (μs)

                                                                                                                                                                                                                 90                                                Original Memcached
throughput under uniform and skewed workloads of 50%                                                                                                                                                             80
                                                                                                                                                                                                                                                                   MICA
                                                                                                                                                                                                                 70
GET with tiny items. MICA uses EREW. Several cores                                                                                                                                                               60
process more requests under the skewed workload than                                                                                                                                                             50
                                                                                                                                                                                                                 40
under the uniform workload because they process requests                                                                                                                                                         30
                                                                                                                                                                                                                 20
more efﬁciently. The skew in the workload increases the                                                                                                                                                          10
                                                                                                                                                                                                                  0
RX burst size of the most loaded core from 10.2 packets                                                                                                                                                           0.0 0.1 0.2 0.3 0                   10   20      30      40     50     60     70      80
per I/O to 17.3 packets per I/O, reducing its per-packet I/O                                                                                                                                                                                                   Throughput (Mops)
cost, and the higher data locality caused by the workload                                                                                                                             Figure 13: End-to-end latency of the original Memcached
skew improves the average cache hit ratio of all cores from                                                                                                                           and MICA as a function of throughput.

USENIX Association                                                                                                               11th USENIX Symposium on Networked Systems Design and Implementation 439

MICA           MemC3         RAMCloud                                     MICA         MemC3           RAMCloud
                                  Masstree       Memcached                                                  Masstree     Memcached
                        102                                                                       102

                                                                             Throughput (Mops)
   Throughput (Mops)

                              50% GET

                        101                                                                       101

                        100                                                                       100   50% GET

                       10−1                                                                      10−1
                              2              4            8         16                                  2                   4                8
                                             Number of cores                                                       Number of NIC ports
                        102                                                                       102

                                                                             Throughput (Mops)
   Throughput (Mops)

                              95% GET

                        101                                                                       101

                        100                                                                       100   95% GET

                       10−1                                                                      10−1
                              2              4            8         16                                  2                   4                8
                                             Number of cores                                                       Number of NIC ports

Figure 14: End-to-end throughput of in-memory key-                        Figure 15: End-to-end throughput of in-memory key-
value systems using a varying number of cores. All sys-                   value systems using a varying number of NIC ports. All
tems use our lightweight network stack.                                   systems use our lightweight network stack.

GET workloads on tiny items. MICA uses EREW. The                          (lower); MICA and Masstree scale well as before. The
client varies the request rate to observe the relationship                rest also achieve higher throughput, but still do not scale.
between throughput and latency.                                           Note that some systems scale differently from their origi-
   Figure 13 plots the end-to-end latency as a function                   nal papers. For example, MemC3 achieves 5.7 Mops at
of throughput; the error bars indicate 5th- and 95th-                     4 cores, while the original paper shows 4.4 Mops at 16
percentile latency. The original Memcached exhibits al-                   cores [15]. This is because using our network stack in-
most ﬂat latency up to certain throughput, whereas MICA                   stead of their network stack reduces I/O cost, which may
shows varied latency depending on the throughput it                       expose a different bottleneck (e.g., key-value data struc-
serves. MICA’s latency lies between 24–52 μs. At the sim-                 tures) that can change the optimal number of cores for the
ilar latency level of 40 μs, MICA shows 69 Mops—more                      best throughput.
than two orders of magnitude faster than Memcached.                       Network scalability: We also change the available net-
   Because MICA uses a single round-trip per request                      work bandwidth by varying the number of NIC ports we
unlike RDMA-based systems [35], we believe that MICA                      use for request processing. Figure 15 shows that MICA
provides best-in-class low-latency key-value operations.                  again scales well with high network bandwidth, because
Summary: MICA achieves both high throughput and                           MICA can use almost all available network bandwidth
latency near the network minimum.                                         for request processing. The GET ratio does not affect
                                                                          the result for MICA signiﬁcantly. This result suggests
5.4                    Scalability                                        that MICA can possibly scale further with higher net-
CPU scalability: We vary now the number of CPU cores                      work bandwidth (e.g., multiple 40 Gbps NICs). MICA
and compare the end-to-end throughput. We allocate cores                  and Masstree achieve similar performance under the 95%
evenly to both NUMA domains so that cores can efﬁ-                        GET workload when using 2 ports, but Masstree and other
ciently access NICs connected to their CPU socket. We                     systems do not scale well with more ports.
use skewed workloads on tiny items because it is gener-                   Summary: MICA scales well with more CPU cores
ally more difﬁcult for partitioned stores to handle skewed                and more network bandwidth, even under write-intensive
workloads. MICA uses EREW.                                                workloads where other systems tend to scale worse.
   Figure 14 (upper) compares core scalability of systems
with 50% GET. Only MICA and Masstree perform better
                                                                          5.5                    Necessity of the Holistic Approach
with more cores. Memcached, MemC3, and RAMCloud                           In this section, we demonstrate how each component of
scale poorly, achieving their best throughput at 2 cores.                 MICA contributes to its performance. Because MICA is
   The trend continues for 95% GET requests in Figure 14                  a coherent system that exploits the synergy between its

440 11th USENIX Symposium on Networked Systems Design and Implementation                                                          USENIX Association

You can also read