Geting More Performance with Polymorphism from Emerging Memory Technologies - Microsoft

Page created by Victoria Brewer
 
CONTINUE READING
Geting More Performance with Polymorphism from
         Emerging Memory Technologies
         Iyswarya Narayanan                                     Aishwarya Ganesan                                       Anirudh Badam
                     Penn State                                           UW-Madison                                             Microsoft

            Sriram Govindan                                           Bikash Sharma                            Anand Sivasubramaniam
                      Microsoft                                              Facebook                                           Penn State
ABSTRACT                                                                            Memory Technologies. In Proceedings of The 12th ACM Interna-
Storage-intensive systems in data centers rely heavily on                           tional Systems and Storage Conference (SYSTOR ’19). ACM, New
                                                                                    York, NY, USA, 13 pages. https://doi.org/10.1145/3319647.3325826
DRAM and SSDs for the performance of reads and persistent
writes, respectively. These applications pose a diverse set of
requirements, and are limited by ixed capacity, ixed access
                                                                                    1 INTRODUCTION
latency, and ixed function of these resources as either mem-                        Data-intensive applications like key-value stores, cloud stor-
ory or storage. In contrast, emerging memory technologies                           age, back-ends of popular social media services that run on
like 3D-Xpoint, battery-backed DRAM, and ASIC-based fast                            cloud datacenters are latency critical, and require fast access
memory-compression ofer capabilities across several dimen-                          to store and retrieve application data. The memory and stor-
sions. However, existing proposals to use such technologies                         age requirements of these data-intensive applications are
can only improve either read or write performance but not                           fast outpacing the limits of existing hardware. For instance,
both without requiring extensive changes to the application,                        growing data volume for these applications pose increasing
and the operating system. We present PolyEMT, a system                              pressure on the capacity needs of the memory and storage
that employs an emerging memory technology based cache                              subsystems, and adversely impact application latency. This is
to the SSD, and transparently morphs the capabilities of this                       further exacerbated in applications with persistent writes to
cache across several dimensions ś persistence, capacity, la-                        storage (ile writes/lushes, and msyncs) as even the fastest
tency ś to jointly improve both read and write performance.                         storage resource in contemporary servers (SSD) is an order
We demonstrate the beneits of PolyEMT using several large-                          of magnitude slower than its volatile counterpart (DRAM).
scale storage-intensive workloads from our datacenters.                             Further, cloud applications are diverse in their resource re-
                                                                                    quirements (e.g. memory and storage working set sizes). In
CCS CONCEPTS                                                                        contrast, existing memory and storage resources are rigid
· Information systems → Storage class memory; Hier-                                 in their characteristics in terms of persistence capability,
archical storage management; Data compression; · Com-                               accesses latency, and statically provisioned capacity. There-
puter systems organization → Cloud computing; · Soft-                               fore, datacenter operators are faced with the question of how
ware and its engineering → Memory management;                                       to better serve data-intensive cloud applications that pose
                                                                                    diverse resource requirements across multiple dimensions.
ACM Reference Format:                                                                  Emerging memory technologies ofer better performance
Iyswarya Narayanan, Aishwarya Ganesan, Anirudh Badam, Sri-                          along several dimensions (e.g. higher density compared to
ram Govindan, Bikash Sharma, and Anand Sivasubramaniam. 2019.                       DRAM [63, 64], higher performance than SSDs [4, 5, 9], or
Getting More Performance with Polymorphism from Emerging
                                                                                    even both). Realizing their potential, prior works [18, 28, 70]
Permission to make digital or hard copies of all or part of this work for
                                                                                    have exploited Non-Volatile Memory (NVM) technologies
personal or classroom use is granted without fee provided that copies               to jointly improve both read and write performance1 of an
are not made or distributed for proit or commercial advantage and that              application by relying on their byte addressability and non-
copies bear this notice and the full citation on the irst page. Copyrights          volatility. NVMs that ofer higher density than DRAM beneit
for components of this work owned by others than the author(s) must                 applications bottle-necked by memory capacity, and improve
be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior speciic
                                                                                    their read performance. And, their non-volatility beneits
permission and/or a fee. Request permissions from permissions@acm.org.              applications bottle-necked by writes to the persistent stor-
SYSTOR ’19, June 3ś5, 2019, Haifa, Israel                                           age. While these solutions ofer attractive performance for
© 2019 Copyright held by the owner/author(s). Publication rights licensed           both reads and writes, these are insuicient for widespread
to ACM.                                                                             adoption in current datacenters for two main reasons.
ACM ISBN 978-1-4503-6749-3/19/06. . . $15.00
https://doi.org/10.1145/3319647.3325826                                             1 We   refer to all persistent writes as writes.
SYSTOR ’19, June 3ś5, 2019, Haifa, Israel                                                    I. Narayanan and A. Ganesan et al.

   First, many of these proposals consider ile systems that         access times at higher density depending on data representa-
reside entirely on NVMs[39, 75, 77] to beneit both reads and        tion (e.g. Compressed memory [68], SLC vs. MLC [62]). We
writes. Note that the cost per unit capacity of NVMs is still an    refer to this as representational polymorphism. This poses
order of magnitude higher than SSDs. Therefore, replacing           opportunity to improve tail performance for latency sensi-
a large leet of SSDs with NVM will be cost prohibitive for          tive applications that are limited by ixed latency and ca-
applications requiring massive storage capacities. Instead, we      pacity of existing hardware. While this has been studied
require a design that requires only a limited NVM capacity.         previously [23, 49, 62, 68], a holistic solution that exploits
   Second, to extract high performance, many of these pro-          polymorphism across several dimensions is missing.
posals require extensive code changes to port the appli-               Given the rigidity of existing hardware and the diversity
cation [18, 28, 70], operating system [39, 77] or the ile-          of cloud applications, it is essential to fully beneit from the
system [74ś76] which may not be immediately feasible at             polymorphic capabilities of these technologies to efectively
scale. These limit faster and wider on-boarding of these tech-      serve applications. Towards that, we employ NVM as a cache
nologies in current datacenters. Therefore, we require trans-       to the SSD, and exploit its capability to morph in its function
parent mechanisms and policies that can efectively utilize          (as volatile vs. non-volatile cache) and data representation
limited NVM capacity available within each server.                  (high density/slow vs. low density/fast). But, to beneit from
   While there are several proposals that enable transparent        polymorphism, we need to navigate the trade-ofs along
integration of limited capacity NVMs into existing memory           several dimensions to adapt the cache characteristics based
and storage stack [15, 63, 79], they are insuicient in the          on the needs of individual applications. Moreover, we need
context of cloud datacenters. The reason is, they focus on          mechanisms to transparently enforce these capabilities with-
single aspect whereas cloud applications pose diverse set of        out requiring any application, OS or ile system level changes.
requirements which can be met efectively by tapping into            Contributions: Towards these goals, we present:
the potential of these technologies to morph across several         • A detailed characterization of production storage traces to
dimensions ś persistence, latency, and capacity.                      show that cloud operators can efectively serve heteroge-
   NVMs with performance similar to that of DRAM and                  neous applications by exploiting polymorphic emerging
persistence similar to that of SSDs can morph dynamically             memory technology based cache in front of SSD.
either as persistent block cache to SSDs or as additional
                                                                    • A functional polymorphism solution that allows trans-
byte-addressable volatile main memory ś we call this func-
                                                                      parent integration of limited capacity NVM as persistent
tional polymorphism. Existing transparent solutions focus
                                                                      block-based write-back cache to SSDs. It dynamically con-
on one function by employing these devices either as storage
                                                                      igures available cache capacity in volatile and non-volatile
caches for SSDs [15, 16, 50] or as extension of DRAM to aug-
                                                                      forms to simultaneously improve performance for reads
ment main memory capacity[27, 63, 79]. Employing NVM as
                                                                      and persistent writes.
storage cache accelerates persistent accesses to the storage
device. But, these incur additional software overheads for          • A representational polymorphism knob that helps tune
read accesses [17, 36, 73], despite their ability to allow direct     the latency and capacity within each function to further
hardware access using byte addressability.                            optimize performance for applications with working set
   Transparently integrating NVM as byte addressable mem-             sizes exceeding the physical capacity of the resources.
ory is beneicial only for volatile memory access, and does          • A dynamic system that traverses the design space ofered
not beneit any persistent writes to the storage medium de-            by both functional and representational polymorphism,
spite being non-volatile. However, dynamically re-purposing           and apportions resources across persistence, latency, and
all or part of NVM as memory or storage can beneit a diverse          capacity dimensions to maximize application performance.
set of applications. Unlike existing such proposals [39, 67],       • A systematic way of reaching good conigurations in the
we target a design that neither requires high NVM provi-              above dynamic system: starting with the full capacity in
sioning costs to replace entire SSD with NVM as ile system            one single form that addresses the most signiicant bottle-
storage medium nor requires changes to existing ile system.           neck irst and then gradually morphing into other forms
Therefore, we employ limited NVM capacity as a storage                until the performance increases, enables a way to search
cache to SSD, where the challenge is to carefully apportion           for the ideal coniguration systematically.
this resource between competing memory and storage access
                                                                    • A prototype in real hardware available in today’s datacen-
streams, whereas in existing proposals it is only determined
                                                                      ters as a transparent memory and storage-management
by one of those.
                                                                      runtime in C++ for Linux. Our solution does not require
   There exists other trade-ofs in emerging memory tech-
                                                                      any application, OS, or ile system changes for its usage. Our
nologies that ofer fast access times at lower density or slow
Geting More Performance with Polymorphism from EMT                                                                                                         SYSTOR ’19, June 3ś5, 2019, Haifa, Israel

                         1.5
                                                                                                                                            In Fig. 1, we show the beneits of such explicit partitioning
normalized p95 latency

                                                                                                1

                                                                      normalized p95 latency
                                                                                                                                        using TPC-C, a ready heavy workload as well as a write-
                                                                                               0.8
                          1                                                                                                             heavy YCSB workload using a server provisioned with 24
                                                                                               0.6
                                                                                               0.4
                                                                                                                                        GB volatile DRAM and 24 GB non-volatile BB-DRAM. We
                         0.5
                                                                                               0.2                                      manually varied the non-volatile capacity between an SSD
                          0                                                                     0                                       block cache (managed by dm-cache [69]) and physical mem-
                                0              50               100
                               % polymorphic memory used as storage
                                                                                                  0              50               100
                                                                                                % polymorphic memory used as storage
                                                                                                                                        ory (managed by OS). Fig. 1a shows that TPC-C sufers a loss
                                        (a) TPC-C                                                    (b) YCSB-loading phase
                                                                                                                                        of 30% in performance when using all of the BB-DRAM as
                                                                                                                                        SSD-cache compared to when using it as main memory. In
Figure 1: Applications beneit from diferent capacities of NVM                                                                           contrast, Fig. 1b shows that using BB-DRAM as SSD-cache
in memory and storage. This split is diferent across applications.
                                                                                                                                        is optimal for the write-heavy YCSB workload.
                         experiments with the prototype show signiicant improve-                                                            The results demonstrate that diferent "static" splits not
                         ments to the tail latencies for representative workloads,                                                      only change the performance of applications diferently, but
                         by up to 57% and 70% for reads and writes, respectively.                                                       also that the best split is diferent across applications. Fur-
                                                                                                                                        thermore, the results also show that the extremes are not
2 THE NEED TO EXPLOIT                                                                                                                   always optimal. For instance, we see that the write-intensive
                                                                                                                                        YCSB-loading phase requires entire BB-DRAM capacity in
  POLYMORPHISM                                                                                                                          the storage tier, whereas read-intensive TPCC requires a split
Our goal is to aid datacenter operators to efectively serve                                                                             of 75% and 25% between memory and storage to meet both
data-intensive cloud applications using emerging memory                                                                                 read and persistent-write requirements satisfactorily.
technologies without incurring signiicant capital invest-                                                                                   This is especially important in cloud datacenters, as we
ments and intrusive software changes. While there are prior                                                                             observe that most real-world applications have a heteroge-
proposals that meet these constraints, they often extract sub-                                                                          neous mix of reads and writes. We show this by analyzing
optimal performance as they focus on one aspect, whereas                                                                                the storage access characteristics of four major production
these resources can morph across several dimensions - per-                                                                              applications from our datacenters: (i) A public cloud storage
sistence, capacity and latency. In this work, we consider                                                                               (Cloud-Storage); (ii) A map-reduce framework (Map-Reduce);
battery-backed DRAM (BB-DRAM) [4, 9, 54] and fast mem-                                                                                  (iii) Search data indexing application (Search-Index), and (iv)
ory compression as the Emerging Memory Technologies                                                                                     Search data serving application (Search-Serve).
(EMTs), backed up by SSDs [58], as seen in today’s datacen-                                                                                 We observe that ile system read and write working set
ters. It presents us with the following two trade-ofs2 .                                                                                sizes vary across applications. Fig. 2a shows that the total
                                                                                                                                        number of unique pages required to serve 90t h , 95t h , and
2.1 Why Functional Polymorphism?                                                                                                        99t h percentile of access for reads and writes respectively,
Much like other NVMs, BB-DRAM can function both as                                                                                      normalized to total unique pages accessed for a duration of
a regular volatile medium and as a specialized non-volatile                                                                             30 minutes. As we can see, the working set sizes vary across
medium using an energy storage device (e.g., ultra-capacitor).                                                                          applications. Similarly, Fig. 2b captures the diference in the
The battery capacity is used to lush data from DRAM to                                                                                  access intensities of total read and write volumes per day;
SSDs upon power loss. This can be dynamically conigured                                                                                 the read and write volumes varies between 0.18 TB to 6 TB
to lush only a portion of (non-volatile) DRAM with the re-                                                                              and 0.45 TB to 4 TB, respectively. Moreover, these access
maining used as volatile memory, thereby creating a seamless                                                                            intensities vary over time; Fig. 3 shows this in Search-Serve
transition between these modes. The battery-backed portion                                                                              for a 12 hour period as an example. Together, these moti-
ofers persistence at much faster latency than SSDs.                                                                                     vate the need to exploit the ability of these technologies to
   One can transparently add such NVM to existing servers as                                                                            morph dynamically between storage and memory functions
an extension of main memory or as a block-granular cache to                                                                             to efectively serve diverse set of applications.
the SSD. While the former implicitly beneits reads, the latter
is expected to implicitly beneit both reads and persistent-
writes. However, note that all reads to NVM would then incur
                                                                                                                                        2.2 Why Representational Polymorphism?
software penalty by having to go through the storage stack,                                                                             In addition to the above trade-of, there exist trade-of be-
despite their ability to allow direct CPU-level access for reads.                                                                       tween data representation and access latency in emerging
So, explicitly partitioning it between the volatile memory                                                                              technologies. For example, fast memory compression (using
and storage tiers would ofer better read performance.                                                                                   ASICs [2]) enable larger capacity using compressed high-
                                                                                                                                        density data representation. However, it incurs longer la-
2 These                           trade-ofs exist for other emerging memory technologies as well.
                                                                                                                                        tency to work with the compressed representation. Thus, the
SYSTOR ’19, June 3ś5, 2019, Haifa, Israel                                                                                                                                                                                                        I. Narayanan and A. Ganesan et al.

                                     Reads     Writes     Both                                    Reads   Writes     Both                                Reads      Writes     Both                               Reads      Writes     Both
                                                                                              1                                                                                                                                                                             10

                                                                                                                                                                                                                                                    Write Volume (TB/day)
                                 1                                                                                                                       1                                                        1

                                                                    Unique pages accessed
       Unique pages accessed

                                                                                                                                                                                        Unique pages accessed
                                                                                                                               Unique pages accessed
                               0.8                                                          0.8                                                        0.8                                                      0.8                                                                     Cloud-          Map-
                               0.6                                                          0.6                                                        0.6                                                      0.6                                                                     Storage        Reduce
                               0.4                                                          0.4                                                        0.4                                                      0.4                                                          1
                               0.2                                                          0.2                                                        0.2                                                      0.2                                                                          Search-       Search
                                 0                                                            0                                                          0                                                        0                                                                                        -Serve
                                                                                                                                                                                                                                                                                              Index
                                          90th     95th     99th                                     90th     95th     99th                                    90th     95th     99th                                   90th     95th     99th                              0.1
                                          Percentile of accesses                                     Percentile of accesses                                    Percentile of accesses                                   Percentile of accesses                                    0.1             1                 10
                                     (i) Cloud Storage                                             (ii) Map Reduce                                           (iii) Search-Index                                       (iv) Search-Serve                                                  Read Volume (TB/day)

(a) Heterogeneity in read vs persistent-write capacity requirements to serve 90t h , 95t h and 99t h per- (b) Heterogeneity in the volume of
centile of read, persistent-write and total accesses over a 30 minutes window.                            read and persistent-write accesses.

                                                                                            Figure 2: Need for application aware memory and storage resource provisioning.
Access Volume (GB)

                           100                                                                                                                                                        While existing works have studied such latency vs. ca-
                                                     Writes                                   Reads
                                                                                                                                                                                   pacity trade-of only in isolation, the additional challenge
                               50
                                                                                                                                                                                   for us here is to identify which functional layer can beneit
                                0                                                                                                                                                  from exploiting representational polymorphism. Towards
                                      0           100         200                                 300      400           500                                 600       700         that, we explore a holistic approach for lexible memory and
                                                                                                  Time (minutes)                                                                   storage provisioning within a server to exploit both kinds of
Figure 3: Temporal variations in read and persistent-write access                                                                                                                  polymorphism, not only for static application conigurations
intensities in Search-Serve.                                                                                                                                                       but also for dynamic variations in application behavior.

same memory can provide either the higher capacity (with                                                                                                                           3 POLYMORPHIC EMT DESIGN
compression) or, the faster access (without compression).
   This is especially important in the context of tail sensitive                                                                                                                   Our goal is to identify the optimal coniguration of a limited-
applications constrained by static capacities and latencies                                                                                                                        capacity EMT across its various polymorphic forms, both
of today’s hardware. For instance, existing servers have lim-                                                                                                                      spatially and temporally, to extract maximum application
ited capacities of DRAM (in the order of 10s of GBs) and are                                                                                                                       performance. The challenge here is to navigate a huge search
backed up using SSDs (in the order of 100s of GBs) or over                                                                                                                         space as jointly determined by the polymorphism knobs. We
the network which are 500× slower. The static capacities                                                                                                                           use the following insight to decouple the search space: start
of DRAM, non-volatile EMT, and SSDs form strict latency                                                                                                                            with the full capacity in one single form that addresses the
tiers. Consequently, the tail latency of data-intensive appli-                                                                                                                     most signiicant bottleneck in the system. Then gradually
cations with working sets that do not it in the DRAM is                                                                                                                            morph into other forms to further improve efectiveness.
now determined by the slower tier. We illustrate this us-
ing Fig. 4a. It plots latency vs probability of pages accessed                                                                                                                     3.1 Using Functional Polymorphism
at this latency when the working set of an application is                                                                                                                          In today’s servers (Fig. 5(a)), all persistent writes (msyncs)
spread between two latency tiers. Here, the tail latency of                                                                                                                        must hit the SSD synchronously, while reads have some
the application is determined by the SSDs which are order                                                                                                                          respite since only those that miss DRAM bufer cache hit
of magnitudes slower than DRAM. One way to optimize the                                                                                                                            the SSD. Also, SSDs by nature have higher tail latencies for
tail performance in such systems is to increase the efective                                                                                                                       writes compared to reads. Fig. 6 shows that reads to be 2×
capacity of the faster tier using high density data representa-                                                                                                                    faster than the writes even in their average latencies, and up
tion (e.g. compression) while maintaining latency well below                                                                                                                       to 8× faster at the 95t h percentile for the io benchmark [14].
that of the slowest tier using fast compression techniques as                                                                                                                      This can be alleviated by using BB-DRAM as a persistent
illustrated in Fig. 4a.                                                                                                                                                            cache, to avoid accessing the SSD in the critical path.
   We observe that real-world applications in our datacenter                                                                                                                          However, in a persistent storage cache (Fig. 5(b)), both
can increase their efective capacity by 2ś7× (see Fig. 4b) us-                                                                                                                     reads (those that miss in the DRAM bufer cache) and writes
ing compression while incurring signiicantly lower latency                                                                                                                         to the SSD will be cached. But, caching SSD reads in this
compared to accessing the slowest tier. Fig. 4c shows that in                                                                                                                      storage layer is sub-optimal, as going through software adds
contrast to SSDs which incur more than 80µs and 100µs for                                                                                                                          signiicant latency penalty [17, 36, 73]. Ideally, such reads
reads and writes, compressed DRAM based memory incurs                                                                                                                              should be served directly via CPU load/store by bypassing
4µs for reads (decompression), and 11µs for writes (compres-                                                                                                                       the software stack. Fig. 5(c) shows such a system, where the
sion). This can be further tuned based on the application                                                                                                                          BB-DRAM based write cache is just large enough to absorb
needs by exploiting parallelism.                                                                                                                                                   bursts of application writes, bufer them, and write data to the
Geting More Performance with Polymorphism from EMT                                                                                                          SYSTOR ’19, June 3ś5, 2019, Haifa, Israel

                             1. Appli ation’s working set split etween two                                                                                                      Write Access Read Access CompressionRatio

                                                                               Effective increase in capacity
                                                                                wrt uncompressed capacity
                             fixed latency tiers                                                                8                                                               12                                      0.4

                                                                                                                                                          Access Latency (us)

                                                                                                                                                                                                                              Compression Ratio
    Probability

                                                                                                                6                                                               10
                                                                                                                                                                                                                        0.3
                                 2. Faster tier morphs    95th percentile                                                                                                        8
                                 to hold more working set                                                       4
                                                                                                                                                                                 6                                      0.2
                                                                                                                2                                                                4
                                             95th percentile                                                                                                                                                            0.1
                      DRAM                                                                                                                                                       2
                                                                                                                0
                                                               SSD
                                                                                                                    a      b   c    a    b     c      d                          0                                      0
                                              3. Tail latency reduces                                                                                                                  4096  2048     1024     512
                                                                                                                      Map Reduce          Search
                                    Access Latency                                                                                                                                   Compressed Access Granularity (bytes)
                  (a) Representational polymorphism.                          (b) Compressibility of application contents                                                         (c) Performance characteristics.
Figure 4: Representational polymorphism (compression) to increase working set in the fast tier. Latencies measured on 2.2 GHz Azure VM.
                                                                                                                                                                                                        Avg.   95   99
                         DRAM                            DRAM                                      DRAM                 EMT         DRAM      EMT                                                  8
                     read                         read                                 read                                         Compressed EMT                                                 7
                    misses         msyncs        misses              msyncs           misses                          msyncs        read
                                                                                                                                             msyncs                                                6

                                                                                                                                                                                    Latency (ms)
                                                                                                                                   misses
                                                                                                                                                                                                   5
                        Block FS                        Block FS                            Block FS                                  Block FS
                                                                                                                                                                                                   4
                                                     EMT (BB-DRAM)                       EMT (BB-DRAM)                             EMT (BB-DRAM)
                                                                                                                                                                                                   3
                     disk           disk                                                                                           Compressed EMT
                    reads          writes             disk            disk                 disk                        disk                                                                        2
                                                     reads           writes               reads                       writes        disk     disk
                                                                                                                                   reads    writes                                                 1
                                                                                                                                                                                                   0
                             SSD                                                                                                                                                                       Reads         Writes
                                                             SSD                                                SSD                     SSD
                  (a) Existing Servers        (b) EMT Write Cache (c) Dynamic Write Cache (d) Virtual Expansion                                                                   Figure 6: Read/Write asymmetry: 4K
Figure 5: PolyEMT design: (a) Today’s servers have a high read and write traic to SSD in the                                                                                      reads incur an average of 400µs with 95t h
critical path. (b) We begin with EMT as Write-Cache to reduce persistent writes to SSD. (c) We                                                                                    percentile at 500µs, whereas average 4K
then re-purpose underutilized Write-Cache blocks to extend DRAM capacity reducing DRAM                                                                                            writes incur as high as 850µs with 95t h per-
read misses. (d) We then employ EMT compression to further reduce SSD traic.                                                                                                      centile at 4200µs at high load.

SSD in the background such that the applications do not have                                                                        heterogeneous latency tiers [40, 71]. Such bufer manage-
to experience the high write-tails of SSDs in the critical path.                                                                    ment policies maintain only the most recently accessed pages
Its remaining capacity can be morphed to extend volatile                                                                            (volatile accesses) in memory, to improve access latency.
main memory to accelerate reads. While the system shown in                                                                             Hence, the sweet spot for partitioning the EMT is identi-
Figure 5(c) conceptually exploits functional polymorphism,                                                                          ied by incrementally revoking least recently written write-
its beneits depends on the ability to match the volatile and                                                                        cache pages and re-purposing them as volatile main mem-
non-volatile cache sizes to application needs as discussed in                                                                       ory by probing application-level request latencies. PolyEMT
Section 2.1 (see Fig. 2). Further, as applications themselves                                                                       stops the re-purposing when the performance of the ap-
evolve temporally (see Fig. 3), these capacities need to be                                                                         plication stops increasing; this is the point where the read
adapted to continuously meet changing application needs.                                                                            bottleneck is mitigated and persistent writes re-start to im-
   Therefore, our solution (PolyEMT) starts with the entire                                                                         pact the application’s performance, as shown in Fig. 7. We
emerging memory capacity as write-cache, while gradually                                                                            observe a convex behavior, as in Fig. 11, for a wide range of
re-purposing some of its capacity to function as volatile (even                                                                     applications. This is because LRU is a stack algorithm which
though physically non-volatile) main memory. To minimize                                                                            does not sufer from Belady’s anomaly, i.e., the capacity vs.
performance consequences of such a dynamic reduction in                                                                             hit rate has a monotonic relationship for both memory capac-
the write-cache capacity, we need to efectively utilize avail-                                                                      ity and write cache capacity under steady access conditions.
able write-cache capacity. Towards that, PolyEMT uses a                                                                             Therefore, a linear (incremental) search is suices to identify
perfect LRU policy to identify blocks for revocation. It peri-                                                                      the sweet spot between volatile and non-volatile capacities.
odically writes them back to the SSD in the background to
make those blocks free and adds them to the free list.                                                                              3.2 Using Representational Polymorphism
   Simultaneously, the re-purposed EMT pages (moved out
                                                                                                                                    PolyEMT leverages representational polymorphism to vir-
of the free list of the storage cache) serve as DRAM exten-
                                                                                                                                    tually expand capacities when the read and write working
sion managed by the OS. The virtual memory management
                                                                                                                                    set sizes of application exceed available physical memory
component of the OS is well-equipped to efectively man-
                                                                                                                                    and write-cache capacities. The compressed data represen-
age the available main memory capacity, and even handle
                                                                                                                                    tation reduces the capacity footprint of data (refer Fig. 4);
                                                                                                                                    thus more data can be held in EMT instead of being paged
SYSTOR ’19, June 3ś5, 2019, Haifa, Israel                                                                                          I. Narayanan and A. Ganesan et al.

                          % EMT in Memory                                            % EMT in Memory       is used to not only identify EMT pages to move from mem-
                    100    75 50     25     0                                  100    75 50     25   0

                                                    Read Performance
                                                                                                           ory to storage, but also to identify DRAM pages in volatile
Persistent Write

                   1                         1                                  1                     1

                                                                       Performance
Performance

                                                                                                           memory that can beneit from compression. The best split

                                                                       Application
                                                                                                           occurs when there is suicient fast tier capacity to serve hot
                                                                                                           data accesses while optimizing the tail performance with the
                   0                            0                                0                     0   capacity reclaimed from using compressed representation.
                    0      25 50       75 100                                    0    25 50       75 100      To summarize PolyEMT design, the irst three optimiza-
                        % EMT in Storage                                             % EMT in Storage      tion steps - write-back cache, functional and representation
Figure 7: To partition EMT between memory and storage, incre-                                              polymorphisms are applied one after another sequentially,
mentally re-purpose write-cache pages as volatile memory until                                             and the LRU-based capacity management is used in all com-
performance start to decrease. This point balances the impact of                                           ponents to cope with limited capacity and latency hetero-
persistent writes and volatile reads on application performance.
                                                                                                           geneities of the underlying resources. PolyEMT irst exploits
                                                                                                           functional polymorphism and then within each of the func-
to the slowest tier (i.e., SSD). Towards that, PolyEMT uses a                                              tions it exploits representational polymorphism. The sys-
caching hierarchy as shown in Fig. 5(d) where 3 latency tiers                                              tem re-conigures the memory and storage characteristics
ś EMT, Compressed EMT and SSD ś exist in both volatile and                                                 to dynamic changes within an application. Once the change
non-volatile forms. Here, a capacity bottleneck at the fast tier                                           is detected, the PolyEMT runtime begins with the EMT as
will move least-recently-used pages to the slower represen-                                                write-cache and inely tunes the capacities and latencies of
tation to make room for hot pages. And, a hit access at the                                                memory and storage resources based on the current needs.
slow tier will result in its conversion to fast representation.
   Unlike existing compressed memory systems [12, 60, 68]                                                  4 PROTOTYPING POLYEMT
and tiered storage systems [42, 59], we face the additional                                                We next describe the key aspects of PolyEMT prototype.
challenge of deciding where to spend the limited compute cy-                                               PolyEMT API: One of our goals is to retain application
cles available for compression/decompression - in the write                                                read/write API semantics to aid in faster and seamless on-
cache or in the volatile memory? Exploiting representational                                               boarding of EMTs. Our prototype modiies the behavior of
polymorphism in either modes is viable when the beneits                                                    the traditional mmap and msync APIs commonly used in many
of holding more data using slower representation are higher                                                cloud storage applications [1, 6ś8]:
than the beneits of holding fewer data using a faster repre-
                                                                                                           • mmap(filename) maps a ile in persistent storage (SSD)
sentation. This is conceptually similar to the capacity parti-
                                                                                                             into the application’s virtual memory. This allows directly
tion problem of the functional polymorphism.
                                                                                                             access to the mapped region using loads/stores, avoiding
   However, unlike functional polymorphism, asymmetry in
                                                                                                             software overheads of read/write system calls.
read and write accesses are not important. This is because,
the penalty of a cache miss is the same in both functions (the                                             • msync(address, length) invocation guarantees persis-
read or the write cache), which is primarily determined by the                                               tence of writes to the application’s virtual memory. Modi-
software latency to go and fetch the page/block to the faster                                                ied memory pages in the application’s mapped address
tier. PolyEMT uses this key insight when determining the                                                     space within the given length ofset is persisted immedi-
capacities of each tier in representational polymorphism ś by                                                ately as part of this API call. Typically, such msync oper-
using a combined faster/slower representation of both read                                                   ations are often bottlenecked by the writes to SSD. EMT
and write caches rather than exploiting them independently                                                   used as a transparent persistent block cache below the ile
(as in functional polymorphism). This has the natural side                                                   system layer can help improve msync performance.
efect of using the minimum compute resources and EMT                                                       PolyEMT Runtime: We implement PolyEMT logic as an
capacity to meet a target tail latency and throughput.                                                     application-level runtime without requiring any changes
   PolyEMT uses LRU based ranking of accesses across all the                                               to application, ile system, and OS. The initial allocation of
non-compressed pages in both volatile memory and write-                                                    DRAM, EMT/BB-DRAM, and SSD for an application is spec-
back cache to decide data movement between the faster and                                                  iied via a global coniguration. The runtime tunes these
slower tiers. Note that when using representational poly-                                                  allocations based on the instantaneous needs of the applica-
morphism in volatile main memory, PolyEMT does not have                                                    tion, using the following modules for resource management.
complete information on the data accessed in the fast tier -                                               Physical resource managers: To manage physical capacity,
hardware loads and stores bypasses software stack to track                                                 PolyEMT uses a DRAM manager and a BB-DRAM manager
exact LRU. Instead, PolyEMT uses page reference informa-                                                   which maintains the list of free pages in DRAM and BB-
tion in the page table and uses an approximate LRU algo-                                                   DRAM, respectively. The DRAM and BB-DRAM memory
rithm in the fast tier (i.e., Clock [21, 37]). This information                                            regions allocated to PolyEMT are pinned to physical frames.
Geting More Performance with Polymorphism from EMT                                                            SYSTOR ’19, June 3ś5, 2019, Haifa, Israel

                                                                      7 Populate data
       Application
                          8 Unprotect page                                                  from slower tiers in the order of Write-Cache, Compression-
                                                                  6 Free list empty?        Store (if used) and SSD 7 . It then unsets the protection bits
                                                                    Evict page
       1            2 Protect pages List of free volatile pages                             in page table and returns the control to the application 8 .
    mmap(file_name)                   D      R    A M                        Write-Cache       Pages are evicted from the bufer-cache as follows. A least
       mmap_address                              E      M     T
                          mmaped                                             Compression-   recently used page is chosen for eviction based on an LRU-
                          address   5 Get Free Page                             Store
    3 Access address       region                                                           like Clock [21, 37] policy implemented using the referenced
                                    4 Trap
                                                     Page fault                 SSD         bit in the page table which is reset periodically. Eviction
                                                      handler
        Figure 8: PolyEMT operation for volatile accesses                                   maintains the look-up order of Write-Cache, Compression-
                                                                                            Store, and SSD during page faults. Towards that, dirty pages
                               2 Write       3  Hit?          6 Invalidate
                                                                                            that are already present in write-cache (from a previous
  Application                    page         Update
                                                                          Compression SSD
                     mmaped
                                         Write-Cache                        -Store          msync call to the page) are updated in place. Otherwise, they
  1
      msync          address
                                             4
                                                 Miss?                                      are directly evicted to the SSD. The process is similar when
    (addr, len)       region                     Get free page
                                                                                            using compression. But, even clean pages are evicted to the
                                         E       M     T          5 Empty?
                                                                     Evict                  compression store.
       Figure 9: PolyEMT operation for persistent accesses                                  Handling persistent writes. To persist a page, applications
                                                                                            call msync() with a virtual address and length, as shown in
Bufer-Cache: This cache serves as the set of DRAM and                                       Fig. 9 1 . PolyEMT handles this call by writing 2 the pages in
BB-DRAM pages used as volatile memory by the application,                                   this region from Bufer-Cache to Write-Cache 3 . For write
accessible by the CPU directly via loads/stores.                                            misses at the write cache, it allocates a new page from the
Write-Back Cache: This cache is a block storage device                                      free list of non-volatile pages 4 ; in rare cases when there
which caches persistent writes to SSD (write-back). The ap-                                 are no free pages 5 it evicts a least-recently-used page to
plication msyncs() calls are served using this Write-Cache.                                 the slower tier 6 (either Compression-Store or SSD).
Compression-Store: This resides in BB-DRAM so it can                                           The candidate for eviction is chosen based on LRU for
serve as a slow (but higher capacity) tier for both volatile                                persistent writes in the Write-Cache. Note that, much like
bufer cache and persistent write-back cache. It employs                                     traditional msync, PolyEMT does not remove the synced
dictionary-based Lempel-Ziv compression [81] at 4KB gran-                                   pages from the Bufer-Cache unless they were retired implic-
ularity as it provides higher efective capacity (compared to                                itly by the higher tier. However, for correctness, the virtual
pattern-based ones [12, 60]). To handle the resulting variable-                             addresses involved are write-protected irst, written to the
sized compressed pages which increase fragmentation, we                                     Write-Cache, marked as clean, and then unprotected again.
compress 4K pages down only to multiples of a ixed sector                                   Pages lushed from compressed volatile memory to Write-
size by appending zeros when needed. We determine the                                       Cache are maintained in a compressed state until further
best sector size to be 0.5 KB based on the distribution of                                  read operations to save resources.
compression beneits for applications analyzed in Sec. 2.2.                                  Handling dynamic capacity variations: PolyEMT dynam-
Enabling fast access to volatile memory. PolyEMT places                                     ically manages the capacity of Bufer-Cache, Write-Cache,
the most frequently read pages from the SSD in DRAM and                                     and Compression-Store by pursuing two knobs at runtime
the EMT/BB-DRAM extension of volatile memory. The fol-                                      ś the knob that splits BB-DRAM between volatile and non-
lowing process shown in Fig. 8 explains how this works: An                                  volatile layers and the knob that controls how much BB-
application initially maps a ile from the persistent storage                                DRAM is compressed across both volatile and non-volatile
(SSD) into main memory 1 using mmap. PolyEMT creates                                        layers. The runtime described in the previous section sweeps
an anonymous region in virtual memory without allocating                                    through diferent values of these two knobs and moves the
physical memory resources, set its page protection bits 2 ,                                 system to higher-performance conigurations ś detected
and returns a virtual address. When the application irst ac-                                transparently by monitoring latencies of SSD operations.
cesses any page in this virtual address region 3 , it incurs a                              It maintains the conigurations until such latencies change
page fault 4 , which is handled by a user-level fault handler                               signiicantly (a conigurable threshold).
(using sigaction system call).
   The page fault handler obtains a free volatile page from the
physical memory manager 5 and maps its physical address
                                                                                            5 EVALUATION
to the faulting virtual address referenced by the application                               We run all experiments using Azure VMs running Linux
using a kernel driver for managing application page table                                   4.11 on E8-4s_v3 instances (4 CPU cores, 64GB memory and
in software [34, 41]. In case the free list is empty, PolyEMT                               128GB SSD at 16000 IOPS). We emulate a part of DRAM as
evicts a page 6 . Next, to populate the contents of the page, it                            BB-DRAM using Linux persistent memory emulation [3]. In
brings in the data to the relevant physical page by searching                               addition to using the workloads presented in Sec. 2, we use
SYSTOR ’19, June 3ś5, 2019, Haifa, Israel                                                                                                   I. Narayanan and A. Ganesan et al.

 Bench   Properties                                                             Write-Cache                         Functional   Functional+Representational
 mark

                                                                                                                                                 Normalized Write Latency
                                                                                        Normalized Read Latency
                                   Normalized Throughput
                                                           8                                                        1                                                         1
 A       50% reads, 50% updates
                                                           6                                                      0.8                                                       0.8

                                      wrt DRAM-Ext
 B       95% reads, 5% updates

                                                                                                                                                     wrt DRAM-Ext
                                                                                            wrt DRAM-Ext
 C       100% reads                                                                                               0.6                                                       0.6
                                                           4
 D       95% reads, 5% inserts                                                                                    0.4                                                       0.4
 E       95% scans, 5% inserts                             2                                                      0.2                                                       0.2
 F       50% reads,                                        0                                                        0                                                         0
         50% read-modify-writes
Table 1: YCSB benchmark
characteristics.                                               (a) Throughput                            (b) 90th Percentile Read Latency                      (c) 90th Percentile Write Latency
                                             Figure 10: Performance of diferent transparent EMT integration designs normalized to DRAM-Ext.
the YCSB [20] cloud storage benchmark (see Table 1) on Re-                                                        benchmarks in Fig. 10(a). The x-axis shows the benchmark,
dis [10] populated with 1 million 1KB records, and we report                                                      and the y-axis shows its throughput normalized with respect
the performance for 1 million operations. We use a persistent                                                     to the DRAM-Ext policy. We ind that using the non-volatile
version of Redis built using the ile mmap interface that we                                                       BB-DRAM as Write-Cache provides an average speed-up
transparently hijack to invoke PolyEMT’s mmap()/msync()                                                           of around 2.5× over using it as a memory extension. This
APIs. We evaluate the performance of PolyEMT on the met-                                                          implies that even when using this resource readily without
rics of application throughput, read and write tail latencies.                                                    any software modiications, adding it to the most bottle-
                                                                                                                  necked function (i.e. persistent msyncs) delivers better per-
5.1 PolyEMT convergence time                                                                                      formance. Further partitioning it between memory and stor-
We irst study the optimal allocations of the EMT between                                                          age by exploiting functional polymorphism improves it by
memory and storage cache                                                                                          up to 70%. Adding representational polymorphism improves
                                        Storage       Time                                                        performance by 90% compared to Write-Cache policy. This
under an oline search and               Trace         (min.)
PolyEMT for the storage traces                                                                                    indicates the importance of exploiting polymorphic proper-
                                        Cloud-Storage 4
analyzed in Sec.- 2. The on-                                                                                      ties of EMTs to derive higher performance from hardware
                                        Map-Reduce    7.5
line run time of PolyEMT con-           Search-Index  6                                                           infrastructure by adapting their characteristics at runtime.
verges to the same allocation           Search-Serve  8                                                              We next present the impact on the tail performance of
                                      Table 2: Convergence                                                        these benchmarks in Figures 10(b) and 10(c). The x-axis
of BB-DRAM between mem-               time.
ory and storage functions com-                                                                                    presents the YCSB benchmark and the y-axis presents 90t h
pared to the optimal partitions identiied oline. Table 2                                                          percentile tail latency for writes and reads, respectively, nor-
shows that it takes 4 to 8 minutes to identify the partition                                                      malized to that of DRAM-Ext. As can be seen, the tail latency
sizes, with read-intensive applications requiring longer time                                                     for reads and writes when using Write-Cache reduces by 40%
as more of the write-cache is re-purposed as memory.                                                              and 36%, respectively, when compared to EMT as DRAM-Ext.
                                                                                                                  In contrast, employing functional polymorphism reduces
5.2 Performance beneits of polymorphism                                                                           read tail latency by an average of 80%. This reduction is
                                                                                                                  signiicant across all benchmarks as more of their working
We study the performance beneits of polymorphism using
                                                                                                                  size its in memory and access to EMT memory extension is
YCSB benchmark. We use a server provisioned with capac-
                                                                                                                  done via load/stores without any kernel/storage stack over-
ities of DRAM (26 GB) and BB-DRAM (6 GB) for a total of
                                                                                                                  head. Compared to read performance, the reduction in write
32GB. The application’s dataset size (≈ 38 GB) exceeds the
                                                                                                                  latency at tail is only modest, as the write misses at the write-
capacity of these memories. We study the following designs
                                                                                                                  cache still has to go to the slowest tier (i.e. SSD). However, by
to transparently utilize limited EMT available in the server3 :
                                                                                                                  incorporating both functional and representational polymor-
1. DRAM-Ext: EMT serves as main memory extension.
                                                                                                                  phism write performance at tail improves signiicantly, with
2. Write-Cache: EMT serves as persistent write-back cache.
                                                                                                                  an average latency drop of 85%. Workload C, being read-only
3. Functional-only ś performs a linear search to partition
                                                                                                                  is not included for the write latency results in Fig. 10(c).
EMT between memory and storage cache functions.
                                                                                                                  Resulting memory and storage conigurations: We next
4. Functional+Representational ś partitions EMT between
                                                                                                                  study the resulting coniguration of memory/storage re-
memory and storage functions and also identiies the fast
                                                                                                                  sources for individual applications. Fig. 11 shows the applica-
and slow tier capacities within each function.
                                                                                                                  tion performance when exploiting functional polymorphism.
Impact on application performance: We irst study the
                                                                                                                  The x-axis shows the % of EMT morphed as memory and
impact of these designs on application throughput for YCSB
                                                                                                                  the y-axis shows the average application performance. We
3 As the data set size exceeds BB-DRAM capacity, transparent NVM based
                                                                                                                  see that performance initially increases as more EMT is con-
ile system solutions are infeasible.                                                                              verted to memory, and beyond which it starts to decrease. As
Geting More Performance with Polymorphism from EMT                                                                                                                                                        SYSTOR ’19, June 3ś5, 2019, Haifa, Israel
Norm. Throughput

                                                                      Norm. Throughput

                                                                                                                                        Norm. Throughput
                        6                                                                 8                                                                8                                                                          Write-Cache      Compression-Store         Buffer-Cache
                        4                                                                 6                                                                6                                                                          100
                                                                                          4                                                                4

                                                                                                                                                                                                                EMT allocation in %
                        2                                                                 2                                                                2                                                                           80
                        0                                                                 0                                                                0                                                                           60
                                          0 20 40 60 80 100                                   0     20 40 60 80 100                                            0 20 40 60 80 100                                                       40

                                          % EMT as DRAM-Ext                                       % EMT as DRAM-Ext                                             % EMT as DRAM-Ext                                                      20
                                                                                                                                                                                                                                        0
                                           (a) YCSB-A                                              (b) YCSB-B                                                    (c) YCSB-C

                                                                                                                                                                                                                                                         F+R

                                                                                                                                                                                                                                                                                    F+R
                                                                                                                                                                                                                                                F+R

                                                                                                                                                                                                                                                                  F+R

                                                                                                                                                                                                                                                                           F+R

                                                                                                                                                                                                                                                                                             F+R
                                                                                                                                                                                                                                             F-only

                                                                                                                                                                                                                                                      F-only

                                                                                                                                                                                                                                                               F-only

                                                                                                                                                                                                                                                                        F-only

                                                                                                                                                                                                                                                                                 F-only

                                                                                                                                                                                                                                                                                          F-only
Norm. Throughput

                                                                       Norm. Throughput

                                                                                                                                        Norm. Throughput
                         6                                                                6                                                                4
                         4                                                                4                                                                                                                                                     A        B        C        D        E       F
                                                                                                                                                           2
                         2                                                                2
                         0                                                                0                                                                0                                                   Figure 12: Comparison of ideal EMT
                                          0 20 40 60 80 100                                   0 20 40 60 80 100                                                0    20 40 60 80 100                            split across Bufer-Cache, Write-Cache
                                           % EMT as DRAM-Ext                                   % EMT as DRAM-Ext                                                   % EMT as DRAM-Ext                           and Compression-Store between func-
                                              (d) YCSB-D                                          (e) YCSB-E                                                       (f) YCSB-F                                  tional polymorphism only (F-only) and
                                                                                                                                                                                                               functional+Representational  polymor-
Figure 11: Exploiting functional polymorphism only. PolyEMT search space when traversing                                                                                                                       phism (F+R).
diferent EMT capacity splits between memory and storage.

                                         14            F-only                                                                     YCSB-A-Reads                        YCSB-A Updates                      Write-Cache                         Compression-Store                Buffer-Cache
                   Throughput (x 1000)

                                         12                                                                           1.2                                                                                                              YCSB-C                      YCSB-A
                                         10                                       F+R                                             YCSB-C Reads                                                        100

                                                                                                                                                                                       EMT allocation %
                                                                                                       Latency (ms)

                                                                                                                        1
                                          8                                                                           0.8                                                                              80
                                          6                      F-only                                                                                                                                60
                                                                                                                      0.6          YCSB-C                          YCSB-A
                                          4                                                                           0.4                                                                              40
                                          2          YCSB-C           YCSB-A                                          0.2                                                                              20
                                          0                                                                             0                                                                               0
                                              0           15        30                            45                        0           15        30                             45                             1 2 3 4 5 6 7 8 9 10 11 12 13
                                                         Time (minutes)                                                                Time (minutes)                                                              Runtime Adaptation Steps
                                                  (a) Average performance                                                       (b) Performance at tail                                                                                     (c) Resulting split
Figure 13: Adapting to dynamic changes as YCSB transitions from read-only YCSB-C workload to a read-write YCSB-A workload. PolyEMT
identiies YCSB-C beneits from direct access to volatile EMT pages via load/store, where-as YCSB-A beneits from 80% of EMT as Write-Cache
and 20% as Compression-Store within twenty minutes. F-only: Functional-only; F+R: Functional+Representational.

expected, the best split (indicated with up-arrow) is diferent                                                                                                      throughput from 4K operations per second to 11.5K opera-
across these applications as expected. However, as the search                                                                                                       tions per second (F-only for YCSB-C in Fig. 13a) resulting in
space is convex with a single maximal point, linear search                                                                                                          80% of BB-DRAM as volatile memory extension and 20% in
inds the ideal split with regard to overall throughput.                                                                                                             Write-Cache (refer to runtime adaptation step-5 in Fig. 13c).
   Fig. 12 presents the partitions identiied across Write-                                                                                                             At this point, the server gets a burst of writes as its phase
Cache, Compression-Store, and Bufer-Cache. Exploiting                                                                                                               changes to 50% reads and 50% updates (YCSB-A). So, PolyEMT
functional polymorphism alone suices for read intensive                                                                                                             starts with a full Write-Cache and exploits functional poly-
YCSB-C and YCSB-E; this is because the re-purposed ca-                                                                                                              morphism, until it reaches a split of 40% EMT in storage and
pacity its the entire working size in DRAM. In contrast,                                                                                                            60% in memory. This partition balances the relative impact
benchmarks with write operations employ representational                                                                                                            of the reads and writes and achieves highest throughput (F-
polymorphism as their combined working set for reads and                                                                                                            only for YCSB-A in Fig. 13a). However, PolyEMT observes
writes exceed the total physical capacity. e.g., YCSB-A and                                                                                                         that the tail latency of update requests are much higher than
YCSB-F use representational polymorphism with 20% of EMT                                                                                                            volatile reads as shown in Fig. 13b at around 30 minutes
in compressed form and the rest as write-cache. This shows                                                                                                          time in x-axis. At this point, PolyEMT exploits representa-
the ability of the PolyEMT to tune memory and storage re-                                                                                                           tional polymorphism within the storage cache to alleviate
source characteristics based on the application needs.                                                                                                              the high update latency. It identiies 20% of total EMT pages
                                                                                                                                                                    in compressed form to have better tail performance with-
5.3 Adapting to dynamic phase changes                                                                                                                               out impacting the average performance. PolyEMT further
                                                                                                                                                                    re-apportions EMT between non-volatile region (80%) and
We next illustrate run time re-coniguration using PolyEMT
                                                                                                                                                                    the shared compressed region (20%) to bring down the tail
in Fig. 13 for a server running YCSB-C. PolyEMT runtime
                                                                                                                                                                    latency of both reads and writes within 180µs. This demon-
begins with a full Write-Cache, and subsequently exploits
                                                                                                                                                                    strates the ability of PolyEMT to adapt to dynamic changes.
functional polymorphism. As YCSB-C is read-only, direct
access to volatile BB-DRAM pages via hardware increases its
SYSTOR ’19, June 3ś5, 2019, Haifa, Israel                                                          I. Narayanan and A. Ganesan et al.

5.4 Cost beneits of polymorphism                                         Employing Emerging Memory Technologies: There is
We next study the cost beneits of PolyEMT by comparing                   a rich body of work on readily exploiting emerging memory
the following provisioning strategies:                                   technologies [4, 5, 9, 11, 12, 24, 30, 41, 45, 51, 60, 61, 68] as
(i) Application-speciic static provision: Each applica-                  DRAM replacement [22, 43, 45] or extension [22, 53], stor-
tion runs on a dedicated server with the DRAM, BB-DRAM                   age caches [16, 50] and lash/storage replacement [39]. This
and SSD capacity right-sized to meet the the application                 includes systems and mechanisms to exploit heterogeneous
needs. This hardware right-sizing is primarily suited for ap-            latency when employing these technologies as DRAM ex-
plications that run on private datacenters to reduce costs.              tension [27, 40] or storage caches [42, 59]. Further, there is a
But, it is impractical at scale. Hence, we consider two other            large body of research on using byte-addressable persistent
strategies that provision uniform set of physical servers for            programming techniques at the ile system [19, 26, 33, 44, 46,
all applications.                                                        75, 76] and application levels [13, 18, 25, 28, 35, 70, 74, 78]
(ii) Application-oblivious static provision: This employs                to achieve performance close to that of the raw hardware.
an identical set of servers to meet the needs of all applications.       Complementary to such works, we exploit the polymorphic
Here, both DRAM and BB-DRAM capacities are sized based                   nature of these technologies at system software level without
on the maximum capacity needs across all applications.                   changes to ile systems or applications to get closer to raw
(iii) Application-oblivious dynamic provision: Here, the                 hardware performance than existing transparent approaches.
servers are provisioned with BB-DRAM based on the peak                   Exploiting Polymorphism: Prior works [39, 49, 62, 67]
Write-Cache needs of the applications. However, the servers              have studied EMT polymorphism. Morphable memory [62]
are under-provisioned in their DRAM capacity as BB-DRAM                  proposes new hardware to trade capacity for latency in PCM,
can also function as DRAM using PolyEMT. This strategy                   by exploiting single level (SLC) and multi level (MLC) repre-
can serve diverse applications using a uniform leet as in the            sentations. Memorage [39] and SAY-Go [67] propose mem-
second design, but with reduced upfront capital costs.                   ory and ile system changes to increase the efective volatile
   We consider SSD to cost 25 cents per GB [65], DRAM to                 memory capacity by using the free space from an NVM based
cost $10 per GB [32], and BB-DRAM to cost 1.2× to 2× com-                ile system volume. In these approaches, the capacity of NVM
pared to DRAM [25, 55]. Table 3 presents the datacenter level            available for volatile memory decreases over time as the ile
total cost of ownership (TCO) [31] when hosting the work-                system is used, and is also determined only by memory pres-
loads under each of these designs. As expected, right sizing             sure in a two-tier architecture. In PolyEMT, we focus on a
the server for each application results in the lowest TCO                three-tier architecture where NVM is used as a cache to SSDs
ś which increases by 1.19% even if the cost of BB-DRAM                   running on unmodiied ile systems and hardware.
increases from 1.2× to 2× as DRAM. Although attractive
from cost perspective, this strategy is diicult in practice              7 CONCLUSION
due to diversity applications. Among the practical alterna-              Low tail latency of reads and writes is critical for the pre-
tives, App-Oblivious-Dynamic translates to 1.97% reduction               dictability of storage-intensive applications. The performance
in datacenter TCO compared to App-Oblivious-Static.                      and capacity gaps between DRAM and SSDs unfortunately
                                                                         lead to high tail latency for such applications. Emerging
                                Cost-ratio (BB-DRAM/DRAM)                memory technologies can help bridge these gaps transpar-
      Provisioning Strategy      1.2×     1.5×      2×
                                                                         ently. However, existing proposals have targeted solutions
        App-Speciic-Static         0     0.435%       1.19%
                                                                         that either beneit read or write performance but not both
       App-Oblivious-Static     2.18%    2.925%       4.15%
      App-Oblivious-Dynamic     0.205%   0.947%       2.18%              since they exploit only one of the many capabilities of such
Table 3: Datacenter TCO under various provisioning strategies            technologies. PolyEMT on the other hand exploits several
and cost ratios. Baseline: App-Speciic-Static with cost-ratio of 1.2×.   polymorphic capabilities of EMTs to alleviate both read and
                                                                         write bottlenecks. It dynamically adjusts the capacity of EMT
6 RELATED WORK                                                           within each capability layer to tailor to application needs for
Datacenter resource provisioning: To overcome the chal-                  maximal performance. Our evaluation with a popular stor-
lenges in scaling the physical capacity of datacenters [66, 72],         age benchmark shows that PolyEMT can provide substantial
prior eforts optimized the static provisioned capacity [29,              improvements in throughput and tail latencies of both read
56, 57], enable dynamic sharing of capacity using novel tech-            and write operations. In the future, we want to include more
niques such as re-purposing remote memory [52], memory                   forms of polymorphism like latency vs. retention.
disaggregation [47, 48], compute reconiguration [38, 80],
etc. We complement these works by exploring a knob for dy-               8 ACKNOWLEDGMENTS
namic provisioning of memory and storage resources within                This work has been supported in part by NSF grants 1439021,
a server using polymorphic emerging memory technologies.                 1526750 1763681, 1629129, 1629915, and 1714389.
Geting More Performance with Polymorphism from EMT                                                     SYSTOR ’19, June 3ś5, 2019, Haifa, Israel

REFERENCES                                                                           46th Annual Design Automation Conference. ACM, 664ś469.
 [1] [n. d.]. ArangoDB. https://www.arangodb.com/.                              [23] Xiangyu Dong and Yuan Xie. 2011. AdaMS: Adaptive MLC/SLC phase-
 [2] [n. d.]. Helion LZRW Compression cores. https://www.heliontech.                 change memory design for ile storage. In Proceedings of the 16th Asia
     com/comp_lzrw.htm.                                                              and South Paciic Design Automation Conference. IEEE Press, 31ś36.
 [3] [n. d.]. How to emulate Persistent Memory. https://pmem.io/2016/02/        [24] Fred Douglis. 1993. The Compression Cache: Using On-line Compres-
     22/pm-emulation.html/.                                                          sion to Extend Physical Memory. Usenix Winter.
 [4] [n. d.]. HPE Persistent Memory. https://www.hpe.com/us/en/servers/         [25] Aleksandar Dragojević, Dushyanth Narayanan, Edmund B Nightingale,
     persistent-memory.html.                                                         Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Cas-
 [5] [n. d.]. Intel Optane/Micron 3d-XPoint Memory. http://www.intel.                tro. 2015. No compromises: distributed transactions with consistency,
     com/content/www/us/en/architecture-and-technology/non-volatile-                 availability, and performance. In Proceedings of the 25th Symposium on
     memory.html..                                                                   Operating Systems Principles. ACM, 54ś70.
 [6] [n. d.]. Lightning Memory-Mapped Database Manager (LMDB). http:            [26] Subramanya R Dulloor, Sanjay Kumar, Anil Keshavamurthy, Philip
     //www.lmdb.tech/doc/.                                                           Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jef Jackson. 2014. System
 [7] [n. d.]. MapDB. http://www.mapdb.org/.                                          software for persistent memory. In Proceedings of the Ninth European
 [8] [n. d.]. MonetDB. https://www.monetdb.org/Home.                                 Conference on Computer Systems. ACM, 15.
 [9] [n. d.]. Netlist Expressvault PCIe (EV3) PCI Express (PCIe) Cache Data     [27] Subramanya R Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan
     Protection. http://www.netlist.com/products/vault-memory-storage/               Sundaram, Nadathur Satish, Rajesh Sankaran, Jef Jackson, and Karsten
     expressvault-pcIe-ev3/default.aspx.                                             Schwan. 2016. Data tiering in heterogeneous memory systems. In
[10] [n. d.]. Redis. https://redis.io/.                                              Proceedings of the Eleventh European Conference on Computer Systems.
[11] Bulent Abali, Hubertus Franke, Xiaowei Shen, Dan E Pof, and T Basil             ACM, 15.
     Smith. 2001. Performance of hardware compressed main memory.               [28] Ru Fang, Hui-I Hsiao, Bin He, C Mohan, and Yun Wang. 2011. High
     In High-Performance Computer Architecture, 2001. HPCA. The Seventh              performance database logging using storage class memory. In Proceed-
     International Symposium on. IEEE, 73ś81.                                        ings of the 2011 IEEE 27th International Conference on Data Engineering.
[12] Alaa R Alameldeen and David A Wood. 2004. Frequent pattern com-                 IEEE Computer Society, 1221ś1231.
     pression: A signiicance-based compression scheme for L2 caches.            [29] Inigo Goiri, Kien Le, Jordi Guitart, Jordi Torres, and Ricardo Bianchini.
     (2004).                                                                         2011. Intelligent placement of datacenters for internet services. In 2011
[13] Joy Arulraj, Andrew Pavlo, and Subramanya R Dulloor. 2015. Let’s                31st International Conference on Distributed Computing Systems. IEEE,
     talk about storage & recovery methods for non-volatile memory data-             131ś142.
     base systems. In Proceedings of the 2015 ACM SIGMOD International          [30] Erik G Hallnor and Steven K Reinhardt. 2005. A uniied compressed
     Conference on Management of Data. ACM, 707ś722.                                 memory hierarchy. In High-Performance Computer Architecture, 2005.
[14] Jens Axboe. 2014. Fio-lexible IO tester. URLhttp://freecode.com/                HPCA-11. 11th International Symposium on. IEEE, 201ś212.
     projects/io.                                                               [31] James Hamilton. [n. d.]. Overall data center costs. https://perspectives.
[15] Mary Baker, Satoshi Asami, Etienne Deprit, John Ouseterhout, and                mvdirona.com/2010/09/overall-data-center-costs/.
     Margo Seltzer. 1992. Non-volatile memory for fast, reliable ile systems.   [32] Sarah Harris and David Harris. 2015. Digital design and computer
     In ACM SIGPLAN Notices. ACM, 10ś22.                                             architecture: arm edition. Morgan Kaufmann.
[16] Meenakshi Sundaram Bhaskaran, Jian Xu, and Steven Swanson. 2013.           [33] Y Hu, Z Zhu, I Neal, Y Kwon, T Cheng, V Chidambaram, and E Witchel.
     Bankshot: Caching slow storage in fast non-volatile memory. In Pro-             2018. TxFS: Leveraging File-System Crash Consistency to Provide
     ceedings of the 1st Workshop on Interactions of NVM/FLASH with Oper-            ACID Transactions. In USENIX Annual Technical Conference (ATC) .
     ating Systems and Workloads. ACM, 1.                                       [34] Jian Huang, Anirudh Badam, Moinuddin K Qureshi, and Karsten
[17] Adrian M Caulield, Arup De, Joel Coburn, Todor I Mollow, Rajesh K               Schwan. 2015. Uniied address translation for memory-mapped SSDs
     Gupta, and Steven Swanson. 2010. Moneta: A high-performance stor-               with FlashMap. In ACM SIGARCH Computer Architecture News. ACM,
     age array architecture for next-generation, non-volatile memories. In           580ś591.
     Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium       [35] Jian Huang, Karsten Schwan, and Moinuddin K Qureshi. 2014. NVRAM-
     on Microarchitecture. IEEE Computer Society, 385ś395.                           aware logging in transaction systems. Proceedings of the VLDB Endow-
[18] Joel Coburn, Adrian M Caulield, Ameen Akel, Laura M Grupp, Ra-                  ment 8, 4 (2014), 389ś400.
     jesh K Gupta, Ranjit Jhala, and Steven Swanson. 2011. NV-Heaps:            [36] Intel. 2017.          Application latency comparison, Introduc-
     making persistent objects fast and safe with next-generation, non-              tion to Programming with Persistent Memory from In-
     volatile memories. ACM Sigplan Notices 46, 3 (2011), 105ś118.                   tel.        https://software.intel.com/en-us/articles/introduction-to-
[19] Jeremy Condit, Edmund B Nightingale, Christopher Frost, Engin Ipek,             programming-with-persistent-memory-from-intel.
     Benjamin Lee, Doug Burger, and Derrick Coetzee. 2009. Better I/O           [37] Theodore Johnson and Dennis Shasha. 1994. 2Q: a low overhead high
     through byte-addressable, persistent memory. In Proceedings of the              performance bu er management replacement algorithm. In Proceedings
     ACM SIGOPS 22nd symposium on Operating systems principles. ACM,                 of the 20th International Conference on Very Large Data Bases. 439ś450.
     133ś146.                                                                   [38] Norman P Jouppi, Clif Young, Nishant Patil, David Patterson, Gaurav
[20] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan,                Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al
     and Russell Sears. 2010. Benchmarking cloud serving systems with                Borchers, et al. 2017. In-datacenter performance analysis of a tensor
     YCSB. In Proceedings of the 1st ACM symposium on Cloud computing.               processing unit. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th
     ACM, 143ś154.                                                                   Annual International Symposium on. IEEE, 1ś12.
[21] Fernando J Corbato. 1968. A paging experiment with the multics system.     [39] Ju-Young Jung and Sangyeun Cho. 2013. Memorage: Emerging persis-
     Technical Report. Massachusetts Institute of Technology.                        tent ram based malleable main memory and storage architecture. In
[22] Gaurav Dhiman, Raid Ayoub, and Tajana Rosing. 2009. PDRAM: A                    Proceedings of the 27th international ACM conference on International
     hybrid PRAM and DRAM main memory system. In Proceedings of the                  conference on supercomputing. ACM, 115ś126.
You can also read