System Memory at a Fraction of the DRAM Cost - ScaleMP

Page created by Armando Mueller
 
CONTINUE READING
System Memory at a Fraction of the DRAM Cost - ScaleMP
The High-end Virtualization Company
                                                  The High-end Virtualization Company

      System Memory at a Fraction of the DRAM Cost

Abstract                                               As of early 2020, COTS dual-socket server system
Contemporary application software requires much        memory capacities reach 1TB-4TB, and premium
more memory than was imaginable a decade ago,          server models and processor SKUs reach 3TB-
and requirements for larger memory appear to in-       12TB, for overall DRAM capacity.
crease with time. Server applications demand the
most memory, but COTS systems’ memory capac-           Server Memory Requirements
ity falls short of the demand, and specialized solu-   Current-day computing workloads can be classi-
tions come with a very high premium. Despite all       fied into two key use-cases, from the perspective
efforts, these solutions remain prohibitively ex-      of system memory requirements:
pensive for most organizations.                           •   Compute-centric workloads: most work-
In this paper we introduce a novel alternative to             loads fit within this category, ranging from
the expensive dynamic random-access memory                    modest memory requirements (where
(DRAM) and proprietary large servers: software                multiple workloads can share the same
defined memory (SDM), coupled with high-perfor-               COTS system), to cases where a single
mance non-volatile memory (NVM). We show how                  workload requires the full COTS server’s
an SDM implementation from ScaleMP, vSMP                      memory capacity.
MemoryONE™, and the latest innovations in NVM
                                                          •   Memory-demanding workloads: require
— namely, NVMe and NVDIMM — deliver very
                                                              more RAM than a COTS server can provide
good performance at a reasonable cost. Use cases
                                                              and require specialized systems with high
of relevance help illustrate this point.
                                                              DRAM capacity to execute. This includes in-
                                                              memory databases, graph-processing, bio-
Background                                                    informatics, etc.
In an effort to keep pace with Moore’s Law, the
number of cores on a processor has significantly       Memory-demanding workloads obviously drive a
increased in the last decade. At the same time,        need for more memory, but so do many compute-
data and workloads have grown significantly, tak-      centric workloads – for which IT wants to utilize all
ing advantage of the increasing available compute      available cores. The core count grows with every
power. With the introduction of commercial             new CPU generation (From 2005 to 2020, for ex-
NAND-based solid-state drives (SSDs) in 2009, the      ample, it grew from one core to 64 cores per pro-
I/O subsystem likewise saw significant improve-        cessor), and the memory-per-core ratio typically
ments: SSDs grew in density and size while power       grows as well per specific application domain. The
consumption and cost were reduced, delivering          increase in core count enables more workloads to
previously unimaginable economies of scale.            concurrently run on the same system using con-
                                                       tainers, virtual machines or other multitenancy
However, one key component of the computer
                                                       paradigms, thus driving the need for large memory
system has not kept pace with the rest: system
                                                       in order to utilize the added compute power.
memory. Typically made of DRAM, its perfor-
mance has only marginally increased, and capacity      The growth in compute power of a single system
has not seen huge gains. At the same time, the         and of workload memory usage have only been
price of DRAM remains very high.
System Memory at a Fraction of the DRAM Cost - ScaleMP
The High-end Virtualization Company
                                                             The High-end Virtualization Company

partially met by developments in the DRAM mar-                         • Non-volatile ( ~ 10 years)
ket. Server systems remain limited in the number                  ...”
                                                                  (IBM Almaden Research Center, Freitas, R.F., Wilcke, W.W.)
of DRAM slots per processor, with a maximum of
12-16 slots per processor on COTS systems in                      So, on one hand, SCM are all-silicon (or other solid-
2020; at the same time, the DIMM capacity is also                 state) memory components with performance at-
limited: in early 2020, the $/GB sweet spot is the                tributes that are close to those of DRAM compo-
32GB-64GB DIMMs; 128GB DIMMs come with a                          nents. On the other hand, they are non-volatile
premium – and require higher processor SKUs, and                  storage featuring the capacity and economics of
256GB DIMMs more than triple the cost per GB,                     legacy storage devices such as hard drives. (For the
and are in short supply, while having very long lead              sake of precision, as of early 2020, the latency for
times.                                                            the highest-performing NVM technologies is
                                                                  O(10µs), compared to O(100ns) for DRAM, and
Storage-Class Memory                                              therefore falls short of IBM’s definition. However,
In the past decade, with the growing popularity of                vSMP MemoryONE enhances the actual average
NAND Flash in the consumer storage market, man-                   performance, so we shall regard those technolo-
ufacturing volumes have grown, and technology                     gies as SCM here.)
has improved. Since 2009, NAND-based SSDs have
become a commodity product, forcing the storage                   Typically, when users install today’s SCM into their
industry to innovate with non-volatile enterprise                 systems, they see a storage device. The operating
storage, including both controllers and the storage               system can only utilize those devices as a block de-
media itself, which is no longer limited to NAND.                 vice, which is typically used for storage, as today’s
                                                                  SCM lacks “byte addressability”, and cannot be
These new technologies — storage-class memory                     transparently used as if it was DRAM. This means
or SCM — have improved the performance of                         that the operating system could not access SCM at
solid-state IO devices and narrowed the perfor-                   the byte level, only at the block level. (For refer-
mance gap between NVM and DRAM.                                   ence, blocks are typically hundreds or thousands
                                                                  of bytes.)

                                                                  Software-Defined Memory for
                                                                  SCM
                                                                  Assembling memory of varying performance into a
                                                                  single system is a three-decade-old practice.
                                                                  NUMA systems — that is, pretty much any multi-
                                                                  processor system these days — allow a processor
                                                                  to access memory at varying degrees of latency or
                                                                  “distance” (e.g. memory attached to another pro-
                                                                  cessor), over a network or fabric. In some cases,
In 2008, IBM defined SCM as:                                      this fabric is purpose-built for such processor com-
                                                                  munication, like Intel’s UltraPath Interconnect
“...
                                                                  (UPI). In other cases, standard fabrics such as PCI
       •   Solid state, no moving parts
       •   Short access times (DRAM-like, within an order of      Express or InfiniBand are used for the same pur-
           magnitude)                                             pose in combination with SDM to provide the
       •   Low cost per bit (DISK-like, within an order of mag-   memory coherency operating as if additional
           nitude)
System Memory at a Fraction of the DRAM Cost - ScaleMP
The High-end Virtualization Company
                                                   The High-end Virtualization Company

memory was installed in the system (SDM for Fab-           •   Where very large system memory is re-
rics). This is exactly what ScaleMP has been offer-            quired – more than the DRAM capacity of a
ing since 2003 with its vSMP Foundation line of                COTS server system
products. Accessing memory at varying lower per-
formance over networks has proven to be feasible           •   Where the cost savings of replacing DRAM
and useful by using predictive memory access                   with NVM outweighs the performance dif-
technologies that support advanced caching and                 ference between the two technologies
replication, which effectively trade latency for
                                                        For the purposes of this paper, we shall refer to the
bandwidth. ScaleMP’s vSMP MemoryONE per-
                                                        former as “Memory Expansion” and the latter as
forms exactly that, but instead of doing it over fab-
                                                        “Memory Replacement”.
ric, it does so with SCM (SDM over SCM).

By using ScaleMP’s advanced and time-tested             Examples of Memory Expansion
memory access prediction algorithms and OS-
transparent Virtual-Machine-Monitor (VMM) de-           Consider in-memory database (IMDB) engines,
sign, ScaleMP presents the aggregated capacity of       which require a shared-memory (non-distributed)
the DRAM and NVM installed in the system as one         system. Examples include Oracle TimesTen, DB2
coherent shared memory address space. No                BLU and SAP HANA. With vSMP MemoryONE, an
changes are required to the operating system, the       in-memory database of larger than 10TB can easily
applications, or any other of the system compo-         be run on a COTS dual-socket server system. In
nents. With this approach, vSMP MemoryONE               fact, in-memory database deployments of any size
brings new capabilities and flexibilities to IT.        larger than the COTS server DRAM limit (>1TB as of
                                                        early 2020) may benefit from it.
vSMP MemoryONE Use Cases                                Additionally, in scientific computing and com-
There are two key scenarios in which it is beneficial
                                                        puter-aided engineering (CAE), there are many
for an IT environment to use vSMP MemoryONE:
The High-end Virtualization Company
                                                    The High-end Virtualization Company

workloads that cannot be run without sufficient          reduction of number of nodes in the Apache Spark
system memory. Examples include de-novo ge-              cluster, and increase overall performance.
nome assembly, which typically uses De Bruijn
graph in-memory, or pre-processing/meshing of a          Economic Model for Memory Expansion
model for simulation in CAE. vSMP MemoryONE              For an application needing even just 1TB of RAM,
can also be used on the desktop – where the PC is        organizations may need to procure expensive
typically limited to 64GB-128GB of DRAM.                 DRAM modules or move away from a COTS server
Examples of Memory Replacement                           to a higher-scale system; for example, a four-
                                                         socket Intel Xeon Scalable system, which offers
Multitenancy scenarios are very common in enter-         more sockets and thus higher total number of
prise IT, and even more popular with cloud service       DIMMs.
providers. The more workloads (e.g. multiple VMs,
multiple containers, multitenant databases, etc.)        The cost of upgrading to a high-end server is not
one can place into a single physical server, the bet-    insignificant: While a dual-socket server with 1.5TB
ter the utilization or yield of the infrastructure. In   of DRAM could cost less than $15,000, a quad-
most multitenant cases, system memory limits the         socket system with double the memory capacity
number of workloads per COTS server, so there is         could cost more than $40,000. It would also re-
reason to deploy systems with maximum memory             quire 3U or more of rack space, as well as much
capacity. With vSMP MemoryONE, less DRAM is              more power and cooling.
used per node and is replaced by NVM. In some            In comparison, a similar dual-socket system with
cases, even more NVM is deployed to further lev-         384GB DRAM and two 1,000GB NVMe SSDs, in-
erage the economic benefits.                             cluding a vSMP MemoryONE license, could cost
Another example could be a large grid running an         less than $15,000. The same system with 768GB
embarrassingly parallel workload, such as “value         DRAM and 6,000GB NVM could cost less than
at risk” (also known as “VAR”) for a financial insti-    $30,000 — that’s a system memory of more than
tution. A throughput of many independent Monte           6TB for less than the cost of a 3TB DRAM-only sys-
Carlo-based computations execute concurrently,           tem!
and the average memory per core requirement
                                                         Economic Model for Memory Replace-
can be predicted. However, the actual processes
                                                         ment
— typically many thousands of them — vary in
memory consumption to the point that IT must             The financial benefits of memory replacement can
over-provision memory into the nodes to avoid ex-        be attributed to two key aspects of TCO:
ecution failures. With vSMP MemoryONE, IT would
provision just the amount of DRAM needed per                1. Acquisition cost: NVM costs 70 percent less
core — even lower than the average — and aug-                  than DRAM. A 128GB DDR4 DIMM costs no
ment the missing DRAM for peak use with NVM.                   less than $1000 (early 2020), or at least
No failures would occur, and processes that                    $7.8/GB, while an enterprise-class NVMe
“spike” would only endure a slight increase in                 SSDs with vSMP MemoryONE licenses
runtime (approximately 10 percent). Another ex-                could be procured for $2.0/GB to
ample would be Apache Spark in-memory pro-                     $3.50/GB. Thus, for a dual-socket server,
cessing – where increasing per-node system                     instead of buying 1536GB of DRAM, one
memory can eliminate I/O (shuffling), allow for the            could buy 192GB of DRAM and two NVMe
                                                               units of 1TB, and possibly save more than
The High-end Virtualization Company
                                              The High-end Virtualization Company

   $7,000 on hardware vSMP MemoryONE                Performance
   software licensing. That’s more than 50%         When comparing alternatives, one must consider
   of savings on memory acquisition costs.          the tradeoffs of technology: cost and perfor-
                                                    mance. Using DRAM-only configurations wherever
2. Operational cost: Early 2020 NVMe drives
                                                    possible would represent the highest-performing
   consume 18~25 watts, while a DDR4 DIMM
                                                    and most expensive alternative. We have shown
   consumes 4~6 watts. In the first scenario -
                                                    that vSMP MemoryONE costs much less than
   DIMMs replaced by two NVMe SSDs - the
                                                    DRAM, but what about performance? Is it close to
   savings could be as high as 70 watts per
                                                    DRAM? What is the tradeoff, and which workloads
   server, which significantly reduces the life-
                                                    are a good fit?
   time cost of a cloud environment for power
   and cooling. For example, if power costs         We use a random-access key-value store bench-
   $0.15 per kWh, and assuming a PUE of 2.0,        mark for this comparison of DRAM-only vs. a 2020-
   that’s a $550 savings per server for three       retail NAND-NVMe product combined with vSMP
   years of non-stop operation.                     MemoryONE. The results are presented in the
                                                    chart below.

                                                    As one would expect, results differ for different
                                                    workloads, yet even with NAND-based NVM prod-
                                                    ucts vSMP MemoryONE reaches close to DRAM
                                                    performance. Using newer SCM products will fur-
                                                    ther minimize the gap.

        Redis – Key:Value 100,000 Bytes x 73k times, and 1,000 Bytes x 6.5M times
The High-end Virtualization Company
                                               The High-end Virtualization Company

Summary
Storage-class memory is here, and it allows for excellent IO performance. Software-defined memory for SCM
is available too. With vSMP MemoryONE from ScaleMP, SCM can be used not just for I/O — it can be used to
replace or expand DRAM at a fraction of the cost and with minimal performance impact.

Do your applications need more memory? Will your infrastructure benefit from having more memory per
node? The solution is here. Visit http://www.scalemp.com/memoryone or register with ScaleMP to have a
technical specialist contact you: http://www.scalemp.com/register
You can also read