Toward Efficient In-memory Data Analytics on NUMA Systems

Page created by Francis Love
 
CONTINUE READING
Toward Efficient In-memory Data Analytics
                                                                          on NUMA Systems

                                                            Puya Memarzia                                   Suprio Ray                        Virendra C Bhavsar
                                                     Univeristy of New Brunswick                  Univeristy of New Brunswick              Univeristy of New Brunswick
                                                        Fredericton, Canada                          Fredericton, Canada                      Fredericton, Canada
                                                         pmemarzi@unb.ca                                   sray@unb.ca                         bhavsar@unb.ca

                                         ABSTRACT                                                                        memory query processing systems have been increasingly
arXiv:1908.01860v3 [cs.DB] 25 Jan 2020

                                         Data analytics systems commonly utilize in-memory query                         adopted, due to continuous improvements in DRAM capac-
                                         processing techniques to achieve better throughput and                          ity and speed, and the growing demands of the data analyt-
                                         lower latency. Modern computers increasingly rely on Non-                       ics industry [36]. As the hardware landscape shifts toward
                                         Uniform Memory Access (NUMA) architectures in order to                          greater parallelism and scalability, keeping pace with these
                                         achieve scalability. A key drawback of NUMA architectures                       changes and maintaining efficiency are a key challenge.
                                         is that many existing software solutions are not aware of                          The development of commodity CPU architectures con-
                                         the underlying NUMA topology and thus do not take full                          tinues to be influenced by various obstacles that hinder the
                                         advantage of the hardware. Modern operating systems are                         speed and quantity of processing cores that can be packed
                                         designed to provide basic support for NUMA systems. How-                        into a single processor die [1]. The power wall motivated
                                         ever, default system configurations are typically sub-optimal                   the development of multi-core CPUs [22], which have be-
                                         for large data analytics applications. Additionally, achiev-                    come the de facto industry standard. The memory wall [50,
                                         ing NUMA-awareness by rewriting the application from the                        65] is a symptom of the growing gap between CPU and
                                         ground up is not always feasible.                                               memory performance, and the bandwidth starvation of pro-
                                            In this work, we evaluate a variety of strategies that aim                   cessing cores that share the same memory controller. The
                                         to accelerate memory-intensive data analytics workloads on                      demand for greater processing power has pushed the adop-
                                         NUMA systems. We analyze the impact of different mem-                           tion of various decentralized memory controller layouts,
                                         ory allocators, memory placement strategies, thread place-                      which are collectively known as non-uniform memory access
                                         ment, and kernel-level load balancing and memory manage-                        (NUMA) architectures. These architectures are widely pop-
                                         ment mechanisms. Our findings indicate that the operating                       ular in the server and high performance workstation mar-
                                         system default configurations can be detrimental to query                       kets, where they are used for compute-intensive and data-
                                         performance. With extensive experimental evaluation we                          intensive tasks. NUMA architectures are pervasive in multi-
                                         demonstrate that methodical application of these techniques                     socket and in-memory rack-scale systems. Recent develop-
                                         can be used to obtain significant speedups in four common-                      ments have led to On-Chip NUMA Architectures (OCNA)
                                         place in-memory data analytics workloads, on three different                    that partition the processor’s cores into multiple NUMA re-
                                         hardware architectures. Furthermore, we show that these                         gions, each with their own dedicated memory controller [52,
                                         strategies can speed up two popular database systems run-                       67]. It is clear that the future is NUMA, and that the soft-
                                         ning a TPC-H workload.                                                          ware stack needs to evolve and keep pace with these changes.
                                                                                                                         Although these advances have opened a path toward greater
                                                                                                                         performance, the burden of efficiently leveraging the hard-
                                         Categories and Subject Descriptors                                              ware has mostly fallen on software developers and system
                                         H.2.4 [Systems]: Query Processing                                               administrators.
                                                                                                                            Although a NUMA system’s memory is shared among all
                                                                                                                         its processors, the access times to different portions of the
                                         Keywords                                                                        memory varies depending on the topology. NUMA systems
                                         NUMA, Memory Allocators, Memory Management, Con-                                encompass a wide variety of CPU architectures, topologies,
                                         currency, Database Systems, Operating Systems                                   and interconnect technologies. As such, there is no stan-
                                                                                                                         dard for what a NUMA system’s topology should look like.
                                         1.    INTRODUCTION                                                              Due to the variety of NUMA topologies and applications,
                                           The digital world is producing large volumes of data at                       fine-tuning the algorithm to a single machine configuration
                                         increasingly higher rates [76, 34, 68]. Data analytics sys-                     will not necessarily achieve optimal performance on other
                                         tems are among the key technologies that power the in-                          machines. Given sufficient time and resources, applications
                                         formation age. The breadth of applications that depend                          could be fine-tuned to the different system configurations
                                         on efficient data processing has grown dramatically. Main                       that they are deployed on. However, in the real world, this
                                                                                                                         is not always feasible. Therefore, it is desirable to pursue
                                          c 2019 Copyright held by the owner/author(s). Permission to make               solutions that can improve performance across-the-board,
                                         digital or hard copies of all or part of this work for personal or class-       without tuning the code.
                                         room use is granted without fee provided that copies are not made or
                                         distributed for profit or commercial advantage and that copies bear
                                                                                                                            In an effort to provide a general solution that speeds up
                                         this notice and the full citation on the first page.

                                                                                                                     1
applications on NUMA systems, some researchers have pro-              ing this problem without extensively modifying the code
posed using NUMA schedulers that co-exist with the oper-              requires tools and tuning strategies that are application-
ating system (OS). These schedulers operate by monitoring             agnostic. In this work, we evaluate the viability of several
running applications in real-time, and managing thread and            key approaches that aim to achieve this. In this context,
memory placement [7, 15, 47]. The schedulers make deci-               the impact and role of memory allocators have been under-
sions based on memory access patterns, and aim to balance             appreciated and overlooked. We demonstrate that signifi-
the system load. However, some of these approaches are not            cant performance gains can be achieved by altering policies
architecture or OS independent. For instance, Carrefour [13]          that affect thread placement, memory allocation and place-
needs an AMD CPU based on the K10 architecture, in ad-                ment, and load balancing. In particular, we investigate 5
dition to a modified OS kernel. Moreover, researchers have            different workloads that prominently feature joins and ag-
have argued that these schedulers may not be beneficial for           gregations, arguably two of the most popular and compu-
multi-threaded in-memory query processing [58]. A differ-             tationally expensive workloads used in data analytics. Our
ent approach involves either extensively modifying or com-            study covers the following aspects:
pletely replacing the operating system. This is done with                1. Dynamic memory allocators (Section 3.1)
the goal of providing a custom tailored environment for the              2. Thread placement and scheduling (Section 3.2)
application. Some researchers have pursued this direction                3. Memory placement policies (Section 3.3)
with the goal of providing an operating system that is more              4. Operating system configuration: virtual memory page
suitable for large database applications [24, 26, 27]. Custom               size and NUMA load balancing (Section 3.4)
operating systems aim to reduce the burden on developers,                An important finding from our research is that the default
but their adoption has been limited due to the high pace of           operating system environment can be detrimental to query
advances in both the hardware and software stack. In the              performance. For instance, the default Linux memory allo-
past, researchers in the systems community proposed a few             cator ptmalloc can perform poorly compared to other alter-
new operating systems for multicore architecture, including           natives. Furthermore, with extensive experimental evalua-
Corey [9], Barrelfish [3] and fos [80]. However, none of them         tion, we demonstrate that it is possible to systematically uti-
were adopted by the industry. We believe that any custom              lize application-agnostic (or black-box) approaches to obtain
operating system designed for data analytics will follow the          speedups on a variety of in-memory data analytics work-
same trajectory. On the other hand, these efforts underscore          loads. We show that a hash join workload achieves a 3×
the need to investigate the impact of system and architec-            speedup on Machine C (see machine topologies in Figure 1
tural aspects on query performance.                                   and specifications in Table 3), just from using the tbbmalloc
   In recent times, researchers in the database community             memory allocator. This speedup improves to 20× when we
have started to pay attention to the issues with query perfor-        utilize the Interleave memory placement policy and modify
mance on NUMA systems. These researchers have favored                 the OS configuration. We also show that our findings can
a more application-oriented approach that involves algorith-          carry over to other hardware configurations, by evaluating
mic tweaks to the application’s source code, particularly, in         the experiments on machines with three different hardware
the context of query processing engines. Among these works            architectures and NUMA topologies. Lastly, we show that
some are static solutions that attempted to make query op-            performance can be improved on two real database systems:
erators NUMA-aware [66, 78]. Others are dynamic solu-                 MonetDB and PostgreSQL. For example, MonetDB’s query
tions that focused on work allocation to threads using work-          latency for the TPC-H workload is reduced by up to 20%
stealing [45], data placement [39, 56] and task scheduling            when overriding the memory allocator, and by 43% by ad-
with adaptive data repartitioning [60]. These approaches              justing the operating system configuration.
can be costly and time-consuming to implement, and in-                   The main contributions of this paper are as follows:
corporating these solutions to commercial database engines               • Categorization and analysis of the current state-of-the-
will take time. Regardless, our work is orthogonal to these                art strategies to improve application performance on
efforts, as we explore application-agnostic approaches to im-              NUMA systems
prove query performance.
   Software has been generally slow in adapting to shifts in             • The first study on NUMA systems (to our knowledge)
hardware architecture, such as NUMA. Inefficiencies in the                 that explores the combined impact of different memory
software stack are not always obvious, and the lack of effi-               allocators, thread and memory placement policies, and
cient hardware utilization has been easy to overlook in some               OS-level configurations, on data analytics workloads
fields due to a greater focus on multitasking. One common                • Extensive experimental evaluation, including different
approach is to run multiple tasks (or virtual machines), and               workloads, machine architectures and topologies, pro-
give each task a slice of the hardware resources proportional              filing and performance counters, and microbenchmarks
to its needs. This approach is not suitable for data analytics,
                                                                         • An effective application-agnostic strategic plan to help
due to the size of the data, as well as the importance of query
                                                                           practitioners speed up memory-intensive applications
throughput and latency. Processing large datasets in main                  with minimal code modifications
memory data analytics typically calls for a greater emphasis
on intra-query parallelism and hardware-awareness.                       The remainder of this paper is organized as follows: we
   Main memory data analytics achieve high throughput by              provide some background on the problem and elaborate on
leveraging data parallelism on very large sets of memory-             the workloads in Section 2. In Section 3 we discuss the
resident data, thus diminishing the influence of disk I/O.            strategies for improving query performance on NUMA sys-
However, applications that are not NUMA-aware do not                  tems. We present our setup and experiments in Section 4.
fully utilize the hardware’s potential [39]. Furthermore,             We categorize and discuss some of the related work in Sec-
rewriting the application is not always an option. Solv-              tion 5. Finally, we conclude the paper in Section 6.

                                                                  2
16GB         16GB       16GB                     16GB    I/O   16GB                768GB     768GB
      16GB                                         16GB
                 CPU          CPU        CPU                      CPU           CPU          I/O    CPU        CPU
                  2            4          6                                                                             I/O
                                                                   0             1           Hub     0          1
      CPU                                           CPU
       0                                             7
                 CPU          CPU        CPU                      CPU           CPU                 CPU        CPU
                                                                                              I/O                       I/O
                  1            3          5                        2             3                   2          3

      I/O                                           I/O
                 16GB        16GB        16GB                     16GB    I/O   16GB                768GB     768GB

                        (a) Machine A                             (b) Machine B                      (c) Machine C
                             Figure 1: Machine NUMA Topologies (machine specifications in Table 3)
              Table 1: Experiment Workloads
  Workload              SQL Equivalent
                                                                         are common in data analytics and decision support systems.
                                                                         The implementation of these workloads is described in more
  W1)Holistic           SELECT groupkey, MEDIAN(val)                     detail in Section 4.2. We now provide some background on
  Aggregation           FROM records                                     the experiment workloads.
  (Hash-based) [51]     GROUP By groupkey;                                  Joins and aggregations are ubiquitous, essential data op-
  W2)Distributive       SELECT groupkey, COUNT(val)                      erations used in many different applications. When use for
  Aggregation           FROM records                                     in-memory query processing, they are notable for stressing
  (Hash-based) [51]     GROUP By groupkey;                               the system’s memory bandwidth in addition to its capacity.
  W3)Hash Join [8] SELECT *                                              Joins and aggregations are essential components in analyt-
                   FROM table1 INNER JOIN table2                         ical queries, and are frequently used in popular database
                   on table1.jkey = table2.fkey;                         benchmarks, such as the TPC-H [14] benchmark.
                                                                            A typical aggregation workload involves grouping tuples
  W4)Index Nested       CREATE INDEX idx_jkey                            by a designated grouping column and then applying an ag-
  Loop Join             ON table1 (jkey);
  (Different Indexes)                                                    gregate function to each group. Aggregate functions are
                        SELECT COUNT(*)
  [46, 49, 61, 77]      FROM table1 INNER JOIN table2                    divided into three categories: distributive, algebraic, and
                        on table1.jkey = table2.fkey;                    holistic. Distributive functions, such as the Count function
                                                                         used in W2 (see Table 1), can be decomposed and processed
  W5)TPC-H [14]         22 queries that mimic business
                        questions on a decision support system.          in a distributed manner. This means that the input can
                        Combination of joins and aggregations            be split up, processed, and recombined to produce the final
                                                                         result. Algebraic functions combine two or more distribu-
                                                                         tive functions. For instance, Average can be broken down
2.    BACKGROUND                                                         into two distributive functions: Count and Sum. Holistic
   A NUMA system is divided into several NUMA nodes.                     aggregate functions, such as the Median function used in
Each node consists of one or more processors and their lo-               W1, cannot be decomposed into multiple functions or steps.
cal memory resources. Multiple NUMA nodes are linked                     These aggregate functions do not produce intermediate val-
together using an interconnect to form a NUMA topology.                  ues, and each output tuple is the result of processing all of
The topology of our machines is shown in Figure 1. A local               the input tuples for its corresponding group. As a result,
memory access involves data that resides on the same node,               these aggregate functions are more demanding on the mem-
whereas accessing data on any other node is considered a                 ory system. W3 represents a hash join query. As described
remote access. Remote data travels over the interconnect,                in [8], the query joins two tables with a size ratio of 1:16,
and may need to hop through one or more nodes to reach its               which is designed to mimic common decision support sys-
destination. Consequently, remote memory access is slower.               tems. The join is performed by building a hash table on
   In addition to remote memory access, contention is an-                the smaller table, and probing the larger table for match-
other possible cause of sub-optimal performance on NUMA                  ing keys. W4 is an index nested loop join using the same
systems. Due to the memory wall [1], modern CPUs are                     dataset as W3. The main difference between W3 and W4
capable of generating memory requests at a very high rate,               is that W3 builds an ad hoc hash table to perform the join,
which may result in pressure on either the interconnect or               whereas W4 uses a pre-built in-memory index that accel-
the memory controller [15]. Lastly, the abundance of hard-               erates lookups to one of the relations. W5 is a database
ware threads in NUMA systems presents a challenge in terms               system workload, using the queries and datasets from the
of scalability, particularly in scenarios with many concurrent           TPC-H benchmark [14]. We evaluate W5 on two database
memory allocation requests. In Section 3, we explore strate-             systems: MonetDB [53] and PostgreSQL [73].
gies which can be used to mitigate these issues.

2.1    Experiment Workloads                                              3.     IMPROVING QUERY PERFORMANCE
   Our goal is to analyze the effects of NUMA on data analyt-                   ON NUMA SYSTEMS
ics workloads, and show effective strategies to gain speedups              Achieving good performance on NUMA systems involves
in these workloads. We have selected five workloads shown                careful consideration of thread placement, memory manage-
in Table 1, to represent a variety of data operations that               ment, and load balancing. We explore application-agnostic

                                                                   3
strategies that can be applied to the data analytics applica-         by Jason Evans. It was later expanded and adapted for
tion in either a black box manner, or with minimal tweaks to          other applications as a general purpose memory allocator.
the code. Some strategies are exclusive to NUMA systems,              When a thread requests memory from jemalloc for the first
whereas others may also yield benefits on uniform memory              time, it is assigned a memory allocation arena. For multi-
access (UMA) systems. These strategies consist of: over-              threaded applications, jemalloc will assign threads to dif-
riding the memory allocator, defining a thread placement              ferent arenas in a round-robin fashion. In order to fur-
and affinity scheme, using a memory placement policy, and             ther improve performance, this allocator also uses thread-
changing the operating system configuration. In this sec-             specific caches, which allows some allocation operations to
tion, we describe these strategies and outline the options            completely avoid arena synchronization. jemalloc divides
used for each one.                                                    allocations into three size categories: small (up to 14KB),
                                                                      large (16-3584KB), and huge (4MB+). Lock-free radix trees
3.1     Dynamic Memory Allocators                                     track allocations across all arenas. jemalloc attempts to re-
   Dynamic memory allocators are used to track and man-               duce memory fragmentation by packing allocations into con-
age dynamic memory during the lifetime of an application.             tiguous blocks of memory, and by re-using the first available
The performance impact of memory allocators is often over-            low address. This approach improves cache locality, but also
looked in favor of exploring ways to tweak the application’s          entails a risk of false sharing, which can hinder performance
algorithms. It can be argued that this makes them one                 and must be mitigated by application developers. jemalloc
of the most under-appreciated system components. Both                 provides a solution for this issue by allowing developers to
UMA and NUMA systems can benefit from faster or more                  specify cache alignment when allocating memory. To better
efficient memory allocators. NUMA systems typically con-              support NUMA systems, jemalloc maintains allocation are-
tain more processing cores, and are particularly sensitive to         nas on a per-CPU basis and associates threads with their
performance penalties induced by memory access and cache              parent CPU’s arena. We use jemalloc version 5.1.0 for our
behavior. Key allocator attributes include allocation speed,          experiments.
fragmentation, and concurrency. Most developers use the
default memory allocation functions to allocate or deallo-            3.1.3    tcmalloc
cate memory (malloc/new and free/delete), and trust that                 The tcmalloc [23] allocator was developed by Google, and
their library will perform these operations efficiently. In re-       is included as part of the gperftools library. Its goal is to pro-
cent years, with the growing popularity of multi-threaded             vide faster memory allocations in memory-intensive multi-
applications, there has been a renewed interest in memory             threaded applications. tcmalloc divides allocations into two
allocators, and several alternative allocators have been pro-         categories: large allocations and small allocations. Large
posed. Earlier iterations of malloc used a single lock which          allocations use a central heap that is organized into con-
serialized access to the global memory pool. Although recent          tiguous groups of pages called “spans”. Each span is de-
malloc implementations provide support for multi-threaded             signed to fit multiple allocations (regions) of a particular
scalability, there are now several competing memory allo-             size class. Since all the regions in a span are of the same
cators that aim to reduce multi-threaded contention, and              size, only one metadata header is maintained for each span.
memory consumption overhead. We evaluate the following                However, allocations from a size class cannot be allocated
allocators: ptmalloc, jemalloc, tcmalloc, Hoard, tbbmalloc,           inside spans for other classes. As a result, applications that
mcmalloc, and supermalloc.                                            use many different classes may waste memory due to inef-
                                                                      ficient utilization of the memory spans. The central heap
3.1.1    ptmalloc                                                     uses fine-grained locking on a per-class basis. As a result,
   ptmalloc (pthreads malloc) is the memory allocator used            two threads requesting memory from the central heap can
in the GNU C Library [72] (glibc), which is the standard C            do so concurrently, as long as their requests fall in differ-
library in most Linux distributions. It is based on dlmal-            ent class categories. Small allocations are served by private
loc [44] (Doug Lea’s Malloc). This allocator aims to attain           thread-local caches and do not require any locking. We use
a balance between speed, portability, and space-efficiency.           the version of tcmalloc included in gperftools 2.7.
ptmalloc supports multi-threaded applications by employ-
ing multiple mutexes to synchronize and protect access to             3.1.4    Hoard
its data structures. The downside of this approach is the                 Hoard [5] is a standalone cross-platform allocator replace-
possibility of lock contention on the mutexes. In order to            ment designed specifically for multi-threaded applications.
mitigate this issue, ptmalloc creates additional regions of           Hoard ’s main design goals are to provide memory efficiency,
memory (arenas) for allocation tasks, whenever contention             reduce allocation contention, and prevent false sharing. At
is detected. A key limitation of ptmalloc’s arena allocation          its core, Hoard consists of a global heap (the “hoard”) that
is that memory can never move between arenas. As of glibc             is protected by a lock and accessible by all threads, as well
version 2.26, which was released in 2017, ptmalloc employs a          as per-thread heaps that are mapped to each thread using
per-thread cache for small allocations. This helps to reduce          a hash function. The allocator counts the number of times
lock contention by skipping access to the memory arenas               that a thread has acquired the global heap lock in order to
when possible. Due to differences in how the machines are             decide if contention is occurring. Hoard also employs heuris-
configured, we evaluate the versions of ptmalloc that shipped         tics to detect temporal locality, and uses this information to
with releases 2.27, 2.26, and 2.24 of the glibc library.              fill cache lines with objects that were allocated by the same
                                                                      thread, thus avoiding false sharing. Recent updates to Hoard
3.1.2    jemalloc                                                     have increased the size of the per-thread heap, and reduced
  jemalloc [18] first appeared as a new SMP-aware mem-                the heap layer overhead. We evaluate Hoard version 3.13 in
ory allocator for the FreeBSD operating system, designed              our experiments.

                                                                  4
ptmalloc          jemalloc            tcmalloc                                 ptmalloc           jemalloc            tcmalloc
                   hoard             tbbmalloc           mcmalloc                                 hoard              tbbmalloc           mcmalloc

                                                                                                                                                     6.577
                   supermalloc                                                                    supermalloc
       100
                                                                                                        7

                                                                                    Memory Overhread
                                                                                     (used/requested)
                                                                                                        6

                                                                                                                                 3.465
                                                                                                        5
   Time (s)

              10                                                                                        4

                                                                                                                                               1.825
                                                                                                                                              1.741
                                                                                                        3

                                                                                                            1.131

                                                                                                                                            1.010
                                                                                                            1.010

                                                                                                                                            1.008

                                                                                                                                                        1.007
                                                                                                            1.005
                                                                                                            1.006
                                                                                                            1.006
                                                                                                            1.006

                                                                                                                         1.005
                                                                                                                         1.006
                                                                                                                         1.006
                                                                                                                         1.007
                                                                                                                                    1.007

                                                                                                                                            1.007
                                                                                                            1.003

                                                                                                                         1.011
                                                                                                        2
                                                                                                        1
              1                                                                                         0
                     1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16                                                      1            4                 16
                               Number of threads                                                                       Number of Threads
                          (a) Multi-threaded Scalability                                                    (b) Memory Consumption Overhead
                                                 Figure 2: Memory Allocator Microbenchmark - Machine A

3.1.5               tbbmalloc                                                 sizes, and infrequently used memory chunk sizes. Dedi-
   The tbbmalloc [40] allocator is included as part of the In-                cated homogeneous memory pools are created to support
tel Thread Building Blocks (TBB ) library [38]. It is based                   frequently used chunk sizes. Infrequent memory chunk sizes
on some of the concepts and ideas outlined in their prior                     are handled using size-segregated memory pools. mcmalloc
work on McRT-Malloc [31]. This allocator pursues better                       reduces system calls by batching multiple chunk allocations
performance and scalability for multi-threaded applications,                  together, and by not returning memory to the OS when free
and generally considers increased memory consumption as                       is called. We use the latest mcmalloc source code, which
an acceptable tradeoff. In response to the issues with mem-                   was updated in March 2018.
ory footprint, TBB 4.2 update 1 (released in 2014) allowed
developers to set a soft limit on the allocator’s memory con-
                                                                              3.1.8             Memory Allocator Microbenchmark
sumption. Reaching this limit triggers the allocator’s in-                       We now describe a multi-threaded microbenchmark that
ternal buffers to free their memory. Allocations in tbbmalloc                 we use to gain insight on the relative performance of these
are supported by per-thread memory pools. If the allocating                   memory allocators. The goal of the microbenchmark is to
thread is the owner of the target memory pool, no locking                     answer the question: how well do these allocators scale up on
is required. If the target pool belongs to a different thread                 a NUMA machine? This experiment simulates a memory-
then the request is placed in a synchronized linked list, and                 intensive workload with multiple threads utilizing the allo-
the owner of the pool will allocate the object. We used ver-                  cator at the same time. Each thread completes 100 million
sion 2019 Update 4 of the TBB library for our experiments.                    memory operations, consisting of allocating memory and
                                                                              writing to it, or reading an existing item and then deal-
3.1.6               supermalloc                                               locating it. The distribution of allocation sizes is inversely
                                                                              proportional to the size class (smaller allocations are more
   supermalloc [41] is a malloc replacement that synchro-                     frequent). We use two metrics to compare the allocators:
nizes concurrent memory allocation requests using hardware                    execution time, and memory allocation overhead. The ex-
transactional memory (HTM) if available, and falls back to                    ecution time gives an idea of how fast an allocator is, as
pthread mutexes if HTM is not available. It prefetches all                    well as its efficiency when being used in a NUMA system
necessary data while waiting to acquire a lock in order to                    by concurrent threads. In Figure 2a, we vary the number
minimize the amount of time spent in the critical section.                    of threads in order to see how each allocator behaves un-
supermalloc uses homogeneous chunks of objects for alloca-                    der contention. The results show that tcmalloc provides the
tions smaller than 1MB, and supports larger objects using                     fastest single-threaded performance, but falls behind as the
operating system primitives. In order to reduce conflicts be-                 number of threads is increased. Hoard and tbbmalloc show
tween different class sizes, each class is a prime multiple of                good scalability, and outperform the other allocators by a
the cache line size. Given a pointer to an object, its corre-                 considerable margin. In Figure 2b, we show each allocator’s
sponding chunk is tracked using a look up table. The chunk                    overhead. This is calculated by measuring the amount of
table is implemented as a large 512MB array, but the al-                      memory allocated by the operating system (as maximum
locator takes advantage of the fact that most of its virtual                  resident set size), and dividing it by the amount of mem-
memory will not be committed to physical memory by the                        ory that was requested by the microbenchmark. This ex-
operating system. For our experiments, we use the latest                      periment shows considerably higher memory overhead for
publicly released source code, which was last updated in                      mcmalloc as the number of threads increases. Hoard and
October 2017.                                                                 tbbmalloc are slightly more memory hungry than the other
                                                                              allocators. Based on these results, we omit supermalloc and
3.1.7               mcmalloc                                                  mcmalloc from subsequent experiments, due to their poor
   mcmalloc [74] focuses on mitigating multi-threaded lock                    performance in terms of scalability and memory overhead
contention by reducing calls to kernel space, dynamically ad-                 respectively.
justing the memory pool structures, and using fine-grained
locking. Similar to other allocators, it uses a global and lo-                3.2      Thread Placement and Scheduling
cal (per-thread) memory pool layout. mcmalloc monitors                          Defining an efficient thread placement strategy is a well-
allocation requests, and dynamically splits its global mem-                   known and essential step toward obtaining better perfor-
ory pool into two categories: frequently used memory chunk                    mance on NUMA systems. By default, the kernel thread

                                                                          5
No Affinity       Affinitized (Sparse)                                            180           Dense Thread Affinity
                                                                                                                                                    built on OpenMP can use the OMP PROC BIND and

                                                                   Execution Time (Billion CPU Cycles)
               10-9                                                                                                    Sparse Thread Affinity
                                                                                                         160
               10-8                                                                                                                                 OMP PLACES environment variables in order to fine-tune
                                                                                                         140
               10-7                                                                                      120
                                                                                                                                                    thread placement at runtime. If none of the above options
Relative Runtime

               10-6                                                                                      100
                                                                                                                                                    are feasible, the numactl tool can be used to bind the appli-
               10-5                                                                                       80
                                                                                                                                                    cation process to a specific set of processors, but does not
               10-4                                                                                       60                                        prevent migrations within the set.
               10-3                                                                                       40                                           To demonstrate the impact of affinitization, we evaluate
               10-2                                                                                       20                                        workload W1 from Table 1, using Machine A shown in Fig-
                                                                                                           0                                        ure 1. The workload involves building a hash table with
               10-1
                                                                                                               2 4 8 16 2 4 8 16 2 4 8 16           key-value pairs taken from a moving cluster distribution.
                   1
                   -                                                                                            Moving Sequential     Zipf          Figure 3 depicts 10 consecutive runs of this workload. The
                        1   2   3    4   5 6    7   8   9 10                                                    Cluster
                                         Run                                                                   Dataset and Number of Threads        runtime number of the default configuration (no affinity) is
                                                                                                                                                    expressed in relation to the affinitized configuration. The
  Figure 3: Multiple runs                                              Figure 4: Comparison of                                                      results highlight the inconsistency of the operating system’s
  of the holistic aggregation                                          two thread affinitization                                                    default behavior. In the best case, the affinitized configura-
  workload (W1) - affinitized                                          strategies - Holistic Aggre-                                                 tion is several orders of magnitude faster, and the worst case
  threads versus default op-                                           gation Workload (W1) -                                                       runtime is still around 27% faster. In order to gain a better
  erating system scheduling -                                          Machine A                                                                    understanding of how each configuration affects the work-
  Machine A (16 threads)                                                                                                                            load, we use the perf tool to measure several key metrics.
                                                                                                                                                    The results depicted in Table 2, show that the operating
  Table 2: Profiling holistic aggregation workload (W1) - Ma-                                                                                       system is migrating threads many times during the course
  chine A (16 threads) - Impact of thread affinity - Default                                                                                        of a workload. The sparse affinity configuration prevents
  (managed by operating system) vs Modified (Sparse thread                                                                                          migration-induced cache invalidation, which in turn reduces
  placement)                                                                                                                                        cache misses. Furthermore, the stabilized thread placement
                                                                                                                                                    increases the ratio of memory accesses that are satisfied by
                   Metric                               Default                                           Modified                 Diff
                                                                                                                                                    local memory, resulting in more bandwidth.
                   Thread Migrations                        33196                                                     16      -99.95%                  In Figure 4 we evaluate the sparse and dense thread affin-
                   Cache Misses                           1450M                                                  972M         -32.95%               ity strategies on workload W1, and vary the number of
                                                                                                                                                    threads. We also vary the dataset (see Section 4.2) in order
                   Local Memory                                                                                                                     to ensure that the distribution of the data records is not
                                                            367M                                                 374M         +2.06%
                   Access                                                                                                                           the defining factor. The goal of this experiment is to de-
                   Remote Memory                                                                                                                    termine if threads benefit from being on the same NUMA
                                                            159M                                                 108M         -31.95%               node against utilizing a greater number of the system’s mem-
                   Access
                                                                                                                                                    ory controllers. The sparse policy achieves better perfor-
                   Local Access Ratio                           0.70                                               0.78     +10.77%                 mance when the workload is not using all available hard-
                                                                                                                                                    ware threads. This is due to the threads having access to
                                                                                                                                                    additional memory bandwidth, which plays a major role in
  scheduler is free to migrate threads created by the program
                                                                                                                                                    memory-intensive workloads. When all hardware threads
  between all available processors. The reasons for doing so in-
                                                                                                                                                    are occupied, the two policies perform almost identically.
  clude power efficiency and balancing the heat output of dif-
                                                                                                                                                    Henceforth, we use the sparse configuration (when applica-
  ferent processors. This behavior is not ideal for large data
                                                                                                                                                    ble) for all our experiments.
  analytics applications, and may result in significantly re-
  duced query throughput. The thread migrations slow down
  the program due to cache invalidation, as well as a likeli-                                                                                       3.3   Memory Placement Policies
  hood of moving threads away from their local data. The                                                                                               Memory pages are not always accessed from the same
  combination of cache invalidation, loss of locality, and non-                                                                                     threads that allocated them. Memory placement policies
  deterministic behavior of the OS scheduler can result in wild                                                                                     are used to control the location of memory pages in relation
  performance fluctuations (as depicted in Figure 3). Binding                                                                                       to the NUMA topology. As a general rule of thumb, data
  threads to processor cores can solve this issue by preventing                                                                                     should be on the same node as the thread that processes it,
  the OS from migrating threads. However, deciding how to                                                                                           and sharing should be minimized. However, too much con-
  place the threads requires careful consideration of the topol-                                                                                    solidation can lead to congestion of the interconnects, and
  ogy, as well as the software environment.                                                                                                         contention on the memory controllers. The numactl tool
     A thread placement strategy details the manner in which                                                                                        applies a memory placement policy to a process, which is
  threads are assigned to processors. We explore two strate-                                                                                        then inherited by all its children (threads). We evaluate the
  gies for assigning thread affinity: Dense and Sparse. A                                                                                           following policies: First Touch, Interleave, Localalloc, and
  Dense thread placement involves packing threads in as few                                                                                         Preferred. We also use hardware counters to measure the
  processors as possible. The idea behind this approach                                                                                             ratio of local to total (local+remote) memory accesses.
  is to minimize remote access distance and maximize re-                                                                                               Modern Linux systems employ a memory placement pol-
  source sharing. In contrast, the Sparse strategy attempts                                                                                         icy called First Touch. In First Touch, each memory page
  to maximize memory bandwidth utilization by spreading                                                                                             is allocated to the first node that performs a read or write
  the threads out among the processors. There are a vari-                                                                                           operation on it. If the selected node does not have suffi-
  ety of ways to implement and manage thread placement,                                                                                             cient free memory, an adjacent node is used. This is the
  depending on the level of access to the source code and                                                                                           most popular memory placement policy, and represents the
  the library used to provide multithreading. Applications                                                                                          default configuration for most Linux distributions. Inter-

                                                                                                                                                6
leave places memory pages on all NUMA nodes in a round-                                Table 3: Machine Specifications
robin fashion. In some prior works, memory interleaving                    System        Machine A Machine B             Machine C
was used to spread a shared hash table across all available
                                                                           CPUs/         8×Opteron         4×Xeon         4×Xeon
NUMA nodes [2, 43, 45]. In Localalloc, the memory pages
                                                                           Model           8220             E7520        E7-4850 v4
are placed on the same NUMA node as the thread perform-
ing the allocation. The Preferredx policy places all newly                  CPU
                                                                                           2.8GHz          2.1GHz         2.1GHz
allocated memory pages on node x. It will use other nodes                 Frequency
for allocation only when node x has run out of free space                                  AMD              Intel          Intel
and cannot fulfill the allocation.                                    Architecture
                                                                                         Santa Rosa        Nehalem       Broadwell
3.4     Operating System Configuration                                 Physical/
                                                                                            16/16           16/32          32/64
  In this section, we outline two key operating system mech-         Logical Cores
anisms that affect NUMA applications: Virtual Memory                      Last Level
                                                                                            2MB             18MB           40MB
Page Size (Transparent Hugepages), and Load Balancing                       Cache
Schedulers (AutoNUMA). These mechanisms are enabled
                                                                          4KB TLB       L1:32×4KB L1:64×4KB L1:64×4KB
out-of-the-box on most Linux distributions.
                                                                          Capacity      L2:512×4KB L2:512×4KB L2:1536×4KB
3.4.1    Virtual Memory Page Size                                         2MB TLB        L1:8×2MB         L1:32×2MB L1:32×2MB
   Operating system memory management works at the vir-                    Capacity          -                 -    L2:1536×2MB
tual page level. Pages represent chunks of memory, and                     NUMA
their size determines the granularity of which memory is                                      8               4              4
                                                                           Nodes
tracked and managed. Most Linux systems use a default
memory page size of 4KB in order to minimize wasted space.                 NUMA            Twisted          Fully          Fully
The CPU’s TLB caches can only hold a limited number of                    Topology         Ladder         Connected      Connected
page entries. When the page size is larger, each TLB en-              Relative            Local:    1.0    Local: 1.0     Local: 1.0
try spans a greater memory area. Although the TLB ca-                NUMA Node            1 hop:    1.2    1 hop: 1.1     1 hop: 2.1
pacity is even smaller for large entries, the total volume of         Memory              2 hop:    1.4
cached memory space is increased. As a result, larger page             Latency            3 hop:    1.6
sizes may reduce the occurrence of TLB misses. Transparent
                                                                      Interconnect
Hugepages (THP) is an abstraction layer that automates the                                 2GT/s           4.8GT/s         8GT/s
                                                                       Bandwidth
process of creating large memory pages from smaller pages.
Some prior works have found that larger memory pages can                  Memory         16GB/node 16GB/node            768GB/node
improve query runtimes by reducing TLB misses [45, 66].                   Capacity      128GB Total 64GB Total           3TB Total
These findings are not universal however, as several prod-                 Memory
uct documentations recommend disabling THP, including                                      800MHz          1600MHz        2400MHz
                                                                            Clock
the Red Hat Performance Tuning Guide [63], Oracle [55],
Redis [64], and MongoDB [32]. Other database systems,                     Operating        Ubuntu          Ubuntu         CentOS
such as VoltDB [70] will refuse to start until THP has been                System           16.04           18.04           7.5
disabled. Reasons cited include incompatibilities with ex-                 Linux              4.4            4.15           3.10
isting memory management framework, increased memory                       Kernel           x86 64          x86 64         x86 64
consumption, and additional swapping latency. The hard-
ware architecture also plays an important role, as the size of        C++ library
                                                                                             2.26            2.27           2.24
the TLB cache varies between different CPU architecture.                (glibc)
On Linux machines, control over the page size is provided
by the Transparent Hugepages (THP) library. We evaluate              unnecessarily migrated between nodes 2) it does not fac-
the effect of using 4KB (default) and 2MB memory pages.              tor in the cost of migration or contention, and thus aims
                                                                     to improve locality at any cost. AutoNUMA has received
3.4.2    Automatic NUMA Load Balancing                               continuous updates, and is considered to be one of the most
   There have been several projects to develop NUMA-aware            well-rounded kernel-based NUMA schedulers. We use the
schedulers that facilitate automatic load balancing. Among           numa balancing kernel parameter to enable or disable this
these projects, Dino and AsymSched do not provide any                NUMA scheduler.
source code, and Numad is designed for multi-process load
balancing. Carrefour [15] provides public source code, but
requires an AMD CPU based on the K10 architecture (with              4.     EVALUATION
instruction-based sampling), as well as a modified operat-              In this section, we describe our setup, and evaluate the
ing system kernel. Consequently, we opted to evaluate the            effectiveness of our techniques. In Section 4.1 we outline
AutoNUMA scheduler, which is open-source and supports                the specifications of our machines, as well as the software
all hardware architectures. AutoNUMA was initially devel-            configuration. We begin by analyzing the impact of the
oped by Red Hat and later on merged with the Linux ker-              operating system configuration in Section 4.3. In Section 4.5
nel. It attempts to maximize data and thread co-location             we evaluate these techniques on database engines running
by migrating memory pages and threads. AutoNUMA has                  TPC-H queries. We explore the effects of overriding the
two key limitations: 1) workloads that utilize data sharing          default system memory allocator in Section 4.4. Finally, we
can be mishandled as memory pages may be continuously                summarize our findings in Section 4.6.

                                                                 7
Table 4: Experiment Parameters (bolded values are used as             The code for all our experiments is written in C++ and
defaults)                                                          compiled using GCC 7.3.0 with the -O3 and -march=native
 Parameter             Values                                      flags. Likewise, all dynamic memory allocators are synchro-
                                                                   nized to the same versions, and compiled from source on
 Experiment            W1) Holistic Aggregation [51]
                                                                   each machine. Machines B and C are owned and main-
 Workload              W2) Distributive Aggregation[51]
                                                                   tained by external parties, and are based on different Linux
                       W3) Hash Join [8]
                                                                   distributions. Unless otherwise noted, all experiments are
                       W4) Index Nested Loop Join [46]
                                                                   configured to utilize all available hardware threads.
                       W5) TPC-H Query [14]
 Thread                None (operating system is free to           4.2   Datasets and Implementation Details
 Placement Policy      migrate threads), Sparse, Dense
                                                                      In this section, we outline the datasets and codebases used
 Memory                First Touch, Interleaved,                   for the experiments. We use well-known synthetic datasets
 Placement Policy      Localalloc, Preferredx                      outlined in prior work as the basis for all of our experiments
 Memory Allocator      ptmalloc, jemalloc, tcmalloc,               [12, 8, 14]. Unless otherwise noted, all workloads operate on
                       Hoard, tbbmalloc                            datasets that are stored in memory resident data structures,
                                                                   and any impact from disk I/O is not measured in our results.
 Dataset               Moving Cluster, Sequential,                    The aggregation workloads (W1 and W2) evaluate a typ-
 Distribution          Zipf, TPC-H Dataset                         ical hash-based aggregation query, based on a state-of-the-
 Operating System      AutoNUMA on/off,                            art concurrent hash table [48], which is implemented as a
 Configuration         Transparent Hugepages                       shared global hash table [51]. The datasets used for the
                       (THP) on/off                                aggregation workloads are based on three different data dis-
                                                                   tributions: Moving Cluster, Sequential, and Zipfian. In the
 Hardware System       Machine A, Machine B,                       Moving Cluster dataset, the keys are chosen from a window
                       Machine C                                   that gradually slides. The Moving Cluster dataset provides
                                                                   a gradual shift in data locality that is similar to workloads
                                                                   encountered in streaming or spatial applications. In the se-
4.1   Experimental Setup                                           quential dataset, we generate a series of segments that con-
   We run our experiments on three machines based on com-          tain multiple number sequences. The number of segments is
pletely different architectures. This is done to ensure that       equal to the group-by cardinality, and the number of records
our findings are not biased by a particular system’s char-         in each segment is equal to the dataset size divided by the
acteristics. The NUMA topologies of these machines are             cardinality. This dataset mimics transactional data where
depicted in Figure 1 and their specifications are outlined         the key incrementally increases. In the Zipfian dataset, the
in Table 3. We used LIKWID [30] to measure each sys-               distribution of the keys is skewed using Zipf’s law [57]. We
tem’s relative memory access latencies, and the remainder          first generate a Zipfian sequence with the desired cardinal-
of the specifications were obtained from product pages, spec       ity c and Zipf exponent e = 0.5. Then we take n random
sheets, and Linux system queries. Now we outline some              samples from this sequence to build n records. The Zipfian
of the key hardware specifications for each machine. Ma-           distribution is used to model many big data phenomena,
chine A is an eight socket AMD-based server, with a to-            such as word frequency, website traffic, and city population.
tal of 128GB of memory. As the only machine with eight             For all aggregation datasets, the number of records is 100
NUMA nodes, machine A provides us with an opportunity              million, and the group-by cardinality is one million.
to study NUMA effects on a larger scale. The twisted lad-             The join workloads (W3 and W4) evaluate a typical join
der topology shown in Figure 1a is designed to minimize            query involving two tables. W3 is a non-partitioning hash
inter-node latency with three HyperTransport interconnect          join, using the code and dataset from [8]. The dataset con-
links per node. As a result, Machine A has three categories        tains two tables sized at 16 million and 256 million tuples,
of memory access latencies, depending on number of hops            and is designed to simulate a decision support system. W4 is
required to get from the origin to the destination of the          an index nested loop join, and uses the same dataset as W3.
memory access. Each node contains an AMD Opteron 8220              We evaluated several in-memory indexes for this workload,
CPU running at 2.8GHz and 16GB of memory. Each of                  including ART [46] which is based on the concept of a Radix
the Opteron 8220’s cores feature a 128KB L1 cache, and a           tree and is used in the HyPer [36] database, MassTree [49]
2MB L2 cache. Machine B is a quad-socket Intel server with         is a key-value store which uses an indexing technique that
four NUMA nodes and a total memory capacity of 64GB.               is a hybrid of B+Tree and trie, and an in-memory Skip List
The NUMA nodes are fully connected, and each node con-             implementation [61, 77].
sists of an Intel Xeon E7520 CPU running at 1.87GHz, and              We evaluate a TPC-H workload (W5) on the Mon-
16GB of memory. Each core in the Xeon E7520 features a             etDB [53] (version 11.33.3) and PostgreSQL [73] (version
256KB L1 and 1MB L2 cache, and an 18MB L3 cache that               11.4) databases. MonetDB is an open-source columnar store
is shared between all cores. Lastly, Machine C contains four       that uses memory mapped files with demand paging and
sockets populated with Intel Xeon E7-4850 v4 processors.           multiple worker threads for its query processing. Post-
Each processor constitutes a NUMA node with 768MB of               greSQL is an open-source row store that uses a volcano-style
memory, providing a total system memory capacity of 3TB.           query processing model. We configured PostgreSQL with a
The NUMA nodes of this machine are fully connected. Each           42GB buffer pool. This workload uses version 2.18 of the in-
processor is equipped with 40MB of L3 cache that is shared         dustry standard TPC-H dataset specifications. The dataset
between all cores, and each core features 256KB of L2 cache        is designed to mimic a decision support system with eight
and 64KB of L1 cache.                                              tables, and is paired with a set of queries which answer typ-

                                                               8
AutoNUMA On   AutoNUMA Off                         AutoNUMA On     AutoNUMA Off                                                 THP Off      THP On                                                9       Machine A               Machine B                   Machine C

                                                                                                                                                                                                  Execution Time CPU Cycles (Billions)
                                         9                                               90%                                                                      8                                                                      8

                                                                                                                            Execution Time (Billion CPU Cycles)
Execution Time (Billion CPU Cycles)

                                                                                               79             79 78 76
                                         8                                               80%      75                   75                                         7                                                                      7
                                         7                                               70%                                                                                                                                             6

                                                                    Local Access Ratio
                                                                                                                                                                  6
                                         6                                               60%                                                                                                                                             5
                                                                                                                                                                  5                                                                      4
                                         5                                               50%
                                                                                                                                                                  4                                                                      3
                                         4                                               40%
                                                                                                                                                                  3                                                                      2
                                         3                                               30%                                                                                                                                             1
                                         2                                               20%          17 17                                                       2
                                                                                                                                                                                                                                         0
                                                                                                                                                                  1

                                                                                                                                                                                                                                              First Touch

                                                                                                                                                                                                                                                                                       First Touch
                                                                                                                                                                                                                                                            Interleave

                                                                                                                                                                                                                                                                                                     Interleave
                                                                                                                                                                                                                                                                         Localalloc

                                                                                                                                                                                                                                                                                                                  Localalloc
                                         1                                               10%
                                         0                                                                                                                        0

                                                                                                                                                                                                                                             AutoNUMA and THP                         AutoNUMA and THP
                                          Memory Placement Policy                              Memory Placement Policy                                                 Dynamic Memory Allocator                                                   enabled                                  disabled

    (a) AutoNUMA effect on                                            (b) AutoNUMA effect on                                    (c) Impact of THP on mem-                                              (d) Combined effect of AutoNUMA and
    execution time - Machine A                                        Local Access Ratio - Ma-                                  ory allocators - Machine A                                             THP on different memory placement
                                                                      chine A                                                                                                                          policies - variable machine
    Figure 5: Impact of operating system configuration (AutoNUMA and THP) on memory placement policies and memory
    allocators - Holistic Aggregation Workload (W1)

    ical business questions. Our experiment involves running all                                                                                                        results highlight the value of modifying these parameters,
    22 queries using a dataset scale factor of 20. We then mod-                                                                                                         as First Touch with load balancing (system default) is 86%
    ify the operating system configuration and run all 22 queries                                                                                                       slower than Interleave without load balancing.
    again. Finally, we use Query 5 as a basis for our memory
    allocator experiment, as it provides a good combination of                                                                                                           4.3.2      Transparent Hugepages Experiments
    both joins and aggregation.                                                                                                                                            Next we evaluate the effect of the Transparent Hugepages
       The experimental parameters are shown in Table 4. Un-                                                                                                            (THP) configuration, which automatically merges groups of
    less otherwise noted, we use the maximum number of                                                                                                                  4KB memory pages into 2MB memory pages. As shown in
    threads supported by each machine. In the synthetic work-                                                                                                           Figure 5c, THP’s impact on the workload execution time
    loads (W1-W4), we measure workload execution time using                                                                                                             ranges from detrimental in most cases to a negligible ef-
    the timer from [8]. In the TPC-H workload (W5), we use                                                                                                              fect in other cases. As THP alters the composition of the
    each database system’s built-in query timing feature.                                                                                                               operating system’s memory pages, support for THP within
                                                                                                                                                                        the memory allocators is the defining factor on whether it
    4.3                                    Operating System Configuration Experi-                                                                                       is detrimental to performance. tcmalloc, jemalloc, and tbb-
                                           ments                                                                                                                        malloc are currently not handling THP well. We hope that
      In this section, we evaluate three key operating system                                                                                                           future versions of these memory allocators will rectify this
    mechanisms that affect NUMA behavior: NUMA Load Bal-                                                                                                                issue out-of-the-box. Although most Linux distributions en-
    ancing (AutoNUMA), Transparent Hugepages (THP), and                                                                                                                 able THP by default, our results indicate that it is generally
    the system’s memory placement policy. To determine if                                                                                                               worthwhile to disable THP for data analytics workloads.
    these variables are affected by other experiment parameters,
    we also examine the impact of hardware architecture, and                                                                                                             4.3.3      Hardware Architecture Experiments
    the interaction between THP and memory allocators.                                                                                                                     Here we show how the performance of data analytics ap-
                                                                                                                                                                        plications running on different machines with different hard-
                  4.3.1                       AutoNUMA Load Balancing Experiments                                                                                       ware architectures is affected by the memory placement
       In Figures 5a and 5b, we evaluate W1 and toggle the state                                                                                                        strategies. For all machines, the default configuration uses
    of AutoNUMA Load Balancing between On (the system de-                                                                                                               the First Touch memory placement, and both AutoNUMA
    fault) and Off. The results in Figure 5a show that AutoN-                                                                                                           and THP are enabled. The results depicted in Figure 5d
    UMA worsens runtime for the First Touch, Interleave, and                                                                                                            show that Machine A is slower than Machine B when both
    Localalloc memory placement policies. Only the Preferred0                                                                                                           machines are using the default configuration. However, us-
    memory placement policy shows an improvement in run-                                                                                                                ing the Interleave memory placement policy, and disabling
    time with AutoNUMA enabled. The Preferred0 policy tries                                                                                                             the operating system switches allows Machine A to outper-
    to allocate memory from NUMA node 0, which is why it                                                                                                                form Machine B by up to 15%. Machine A shows the most
    benefits the most from AutoNUMA load balancing. These                                                                                                               significant improvement from operating system and memory
    results were obtained using W1 on Machine A, but we ob-                                                                                                             placement policy changes, and the workload runtime is re-
    served very similar results on the other workloads and ma-                                                                                                          duced by up to 46%. The runtime for Machine C is reduced
    chines. AutoNUMA had a significantly detrimental effect                                                                                                             by up to 21%. The performance improvement on Machine
    on runtime. The best overall approach is to use memory in-                                                                                                          B is around 7%, which is fairly modest compared to the
    terleaving and disable AutoNUMA. The Local Access Ratio                                                                                                             other machines. Although Machines B and C have a similar
    (LAR) shown in Figure 5b specifies the ratio of memory ac-                                                                                                          inter-socket topology, the relative local and remote memory
    cesses that were satisfied with local memory [15] compared                                                                                                          access latencies are much closer in Machine B (see Table 3).
    to all memory accesses. For example, on our eight node ma-                                                                                                          This, along with other hardware differences, plays a signif-
    chine, we expect interleaving to result in an average LAR of                                                                                                        icant role in the benefit gained from altering the memory
    100/8 = 12.5%, which is close to our measurement of 17%.                                                                                                            placement policy. Henceforth, we run our experiments with
    AutoNUMA’s main goal is to improve the LAR by. These                                                                                                                AutoNUMA and THP disabled, unless otherwise noted.

                                                                                                                                                                  9
First Touch          Interleave   Local Alloc                                            First Touch                           Interleave    Localalloc                                    First Touch         Interleave   Local Alloc                                Moving Cluster         Sequential    Zipf
                                         6                                                                                                         6                                                                                 3                                                                         25

                                                                                                                                                                                                                                                                       Execution Time CPU Cycles (Billions)
 Execution Time CPU Cycles (Billions)

                                                                                                                                                                                             Execution Time CPU Cycles (Billions)
                                                                               Execution Time CPU Cycles (Billions)
                                         5                                                                                                         5                                                                                2.5                                                                        20
                                         4                                                                                                         4                                                                                 2
                                                                                                                                                                                                                                                                                                               15
                                         3                                                                                                         3                                                                                1.5
                                                                                                                                                                                                                                                                                                               10
                                         2                                                                                                         2                                                                                 1

                                         1                                                                                                         1                                                                                0.5                                                                         5

                                         0                                                                                                         0                                                                                 0                                                                          0

                                                Memory Allocator                                                                                           Memory Allocator                                                                Memory Allocator                                                          Memory Allocator

                                         (a) W1 - Machine A                                                                                        (b) W1 - Machine B                                                                 (c) W1 - Machine C                   (d) W1 - Machine A - Effect
                                                                                                                                                                                                                                                                           of dataset distribution
                             First Touch          Interleave   Local Alloc                                            First Touch                           Interleave    Localalloc                                    First Touch         Interleave   Local Alloc                               First Touch        Interleave   Local Alloc
                                        16                                                                                         30                                                                                                7                                                                         2.0

                                                                                                                                                                                             Execution Time CPU Cycles (Billions)
 Execution Time CPU Cycles (Billions)

                                                                                                                                                                                                                                                                        Execution Time CPU Cycles (Billions)
                                                                               Execution Time CPU Cycles (Billions)

                                        14                                                                                                                                                                                           6
                                                                                                                                   25
                                        12                                                                                                                                                                                           5                                                                         1.5
                                                                                                                                   20
                                        10
                                                                                                                                                                                                                                     4
                                         8                                                                                         15                                                                                                                                                                          1.0
                                                                                                                                                                                                                                     3
                                         6
                                                                                                                                   10
                                         4                                                                                                                                                                                           2                                                                         0.5
                                                                                                                                                   5                                                                                 1
                                         2
                                         0                                                                                                         0                                                                                 0                                                                         0.0

                                                Memory Allocator                                                                                           Memory Allocator                                                                Memory Allocator                                                          Memory Allocator

                                         (e) W3 - Machine A                                                                                        (f) W3 - Machine B                                                                 (g) W3 - Machine C                                                        (h) W2 - Machine A
                                                                     Figure 6: Comparison of memory allocators - variable memory placement policy

                                         Build Time      Join Time                                                                     First Touch           Interleave   Local Alloc
                                                                                                                                                                                             (W3) workloads, running on each of our three machines.
                                        70                                                                                                         90
                                                                                                            Execution Time CPU Cycles (Billions)
Execution Time CPU Cycles (Billions)

                                                                                                                                                   80
                                                                                                                                                                                             In addition to the memory allocators, we vary the mem-
                                        60                                                                                                                                                   ory placement policies for each workload. The results show
                                                                                                                                                   70
                                        50                                                                                                         60                                        significant runtime reductions on all three machines, par-
                                        40                                                                                                         50                                        ticularly when using tbbmalloc in conjunction with the In-
                                        30                                                                                                         40                                        terleave memory placement policy. The holistic aggregation
                                                                                                                                                   30                                        workload (W1) shown in Figure 6a to 6c extensively uses
                                        20
                                                                                                                                                   20                                        memory allocation during its runtime to store the tuples for
                                        10                                                                                                         10                                        each group and calculate their aggregate value. Utilizing
                                        0                                                                                                              0
                                                                                                                                                                                             tbbmalloc reduced the runtime of W1 by up to 62% on ma-
                                                                                                                                                                                             chine A, 83% on machine B, and 72% on machine C, com-
                                                                                                                                                                                             pared to the default allocator (ptmalloc). The results for
                                              Index Data Structure                                                                                          Memory Allocator
                                                                                                                                                                                             the join query (W3) depicted in Figures 6e to 6g also show
     (a) W4 - Index data struc-                                                                                 (b) W4 with ART Index -                                                      significant improvements, with tbbmalloc reducing workload
     ture comparison - build and                                                                                Impact of Memory Alloca-                                                     execution time by 70% on machine A, 94% on machine B,
     join times                                                                                                 tors
                                                                                                                                                                                             and 92% on machine C. The distributive aggregation query
                Figure 7: Index nested loop join experiments - Machine A
                                                                                                                                                                                             (W2) shown in Figure 6h does not gain much of a benefit,
                                                                                                                                                                                             as it calculates a running count using a hash table, and is
     4.4                                     Memory Allocator Experiments                                                                                                                    therefore comparatively light on memory allocation.
       In Section 3.1.8, we used a memory allocator microbench-
     mark to show that there are significant differences in both                                                                                                                              4.4.2                                       Impact of Dataset Distribution
     multi-threaded scalability and memory consumption over-                                                                                                                                    The performance of query workloads and memory alloca-
     head. In this section, we explore the performance impact of                                                                                                                             tors can be sensitive to the access patterns induced by the
     overriding the system default memory allocator, using four                                                                                                                              dataset distribution. The datasets are the same size and
     in-memory data analytics workloads. These experiments                                                                                                                                   their key differentiating factor is the way their records are
     aim to reveal the relationship between workload, hardware                                                                                                                               distributed (see Section 4.2 for more information). In our
     architecture, and memory allocator.                                                                                                                                                     previous figures, we used the Heavy Hitter dataset as the
                                                                                                                                                                                             default dataset for W1. In Figure 6d, we vary the dataset
                   4.4.1                        Hashtable-based Experimental Workloads                                                                                                       to see if overriding the default memory allocator is still ben-
       In Figure 6, we show our results for the holistic aggre-                                                                                                                              eficial. With the exception of the Hoard allocator, all of the
     gation (W1), distributive aggregation (W2), and hash join                                                                                                                               alternative memory allocators improve W1’s runtime on the

                                                                                                                                                                                        10
50%                                            MonetDB                      PostgreSQL
                                                                                                                                                                                                                                systems are loading data on demand from the disk rather

                                                                                                                                                                           42.6
                                                                                                                                                                                                                                than keeping all the data memory resident. To ensure fair
                            40%
                                         35.7
                                                                                                                                                                                                                                results, we clear the page cache before running the workload
Query Latency Improvement

                                                                                                                                                             31.1
                                                                                                                                                                                                                                and report the average runtime, after disregarding the first

                                                                                                27.6
                                                                         26.9
                            30%                                                                                                                                                                                                 (cold) run. In a similar vein to the other experiments, we

                                                                                                                                                                                                            24.0
                                                                                                                                                                                                                                evaluated the impact of the operating system configuration,

                                                                                                                                                                                              18.6
                                                                                              17.5
                            20%

                                                                                15.5
                                                                                                                                                                                                                                memory placement policies, and memory allocators. First
                                  14.0

                                                                                                                                                    12.1
                                                                                                                                                                                                                                we ran all 22 TPC-H queries, and calculated the query la-
                                                            10.5

                                                                                       10.5

                                                                                                                                   10.0

                                                                                                                                                               Took too long

                                                                                                                                                                                                Took too long
                                                                                                                                            8.8
                                                      9.0

                                                                                                                                                                                        8.1
                            10%                                                                                                                                                                                                 tency reduction caused by disabling AutoNUMA and THP,
                                                                   6.5

                                                                                                       6.1
                                                                                  5.6
                                                5.2

                                                                                                                                             4.6
                                                                                                             4.3
                                                        4.3

                                                                                                                                                                                          4.2
                                                                           3.6

                                                                                                                                               3.1
                                                                                                                                               3.4
                                                               3.1

                                                                                                                                                                                                                  2.2
                                                                                                                                                                                                                                compared to the system default. The results depicted in

                                                                                                                                                                                                                  2.4
                                                                                                               2.7
                                                                2.3
                                                  2.2
                                    1.5

                                                                                         1.8
                                           0.3

                                                                                                                                                                           0.0

                                                                                                                                                                                                            0.0
                                                                                                                                                                                                                                Figure 8 show that MonetDB’s query latencies improved be-

                                                                                                                                                      -1.0

                                                                                                                                                                                                                    -1.0
                                                                                                                                     -1.1
                                                                                                        -1.7                                                                                                                    tween 2% and 43%, with an average improvement of 14.5%.

                                                                                                                                                                                 -3.4
                 -10.%                                                                                                                                                                                                          The results for PostgreSQL are less impressive, with an av-
                                   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
                                                    Query Number
                                                                                                                                                                                                                                erage improvement of 3% and five queries taking longer to
                                                                                                                                                                                                                                complete. We believe this is due to PostGreSQL’s rigid
   Figure 8: Query latency improvement gained from disabling                                                                                                                                                                    multi-process query processing approach. Next we evalu-
   AutoNUMA and THP - all 22 queries - Machine A                                                                                                                                                                                ate the effect of memory allocator overriding on MonetDB.
                                                                                                                                                                                                                                To do so, we selected queries 5 and 18 due to their usage of
                       2.5                                                                                                         12                                                                                           both joins and aggregation. The results shown in Figure 9a
                                                                                                                                                                                                                                indicate that tbbmalloc can provide an average query latency
                                                                                                                                   10                                                                                           reduction of up to 12% for Query 5, and 20% for Query 18,
                            2
                                                                                                               Query Latency (s)
Query Latency (s)

                                                                                                                                     8
                                                                                                                                                                                                                                compared to ptmalloc.
                       1.5
                                                                                                                                     6                                                                                          4.6    Summary
                            1
                                                                                                                                     4                                                                                             The strategies outlined in this paper, when carefully ap-
                                                                                                                                                                                                                                plied, can significantly speed up data analytics workloads
                       0.5
                                                                                                                                     2                                                                                          without the need for modifying the application source code.
                            0                                                                                                        0                                                                                          The effectiveness and applicability of these strategies to a
                                                                                                                                                                                                                                workload depend on several factors. Starting with the op-
                                                                                                                                                                                                                                erating system configuration, we showed that the default
                                         Memory Allocator                                                                                           Memory Allocator                                                            settings for AutoNUMA and THP can have a significant
                                                                                                                                                                                                                                detrimental effect on performance. AutoNUMA’s overhead
                                   (a) Query 5                                                                                                    (b) Query 18
                                                                                                                                                                                                                                has proven to be too costly for multi-threaded data analyt-
   Figure 9: Effect of memory allocator on TPC-H query la-                                                                                                                                                                      ics workloads. THP provides no benefit to these workloads
   tency - MonetDB - Machine A                                                                                                                                                                                                  because they rely on random rather than contiguous mem-
                                                                                                                                                                                                                                ory access patterns. Furthermore, some memory allocators
                                                                                                                                                                                                                                do not support THP, potentially resulting in dramatic per-
   Zipf and Sequential datasets. In particular, jemalloc and                                                                                                                                                                    formance drops. Although root access is required to access
   tbbmalloc provide the largest benefits.                                                                                                                                                                                      the AutoNUMA setting, we observed that that the Inter-
                                                                                                                                                                                                                                leave memory policy (which can be used by a regular user)
            4.4.3                        Effect on In-memory Indexing                                                                                                                                                           can largely nullify AutoNUMA’s negative impact. We noted
      The index used to accelerate the nested loop join work-                                                                                                                                                                   in our evaluation that the effects of the memory placement
   load (W4) plays a major role in determining its efficiency.                                                                                                                                                                  policies are less pronounced when AutoNUMA is disabled,
   Although there are many data structures that could be used                                                                                                                                                                   with Machine A obtaining the most benefit from interleaved
   for indexing, efficient concurrency is less trivial to imple-                                                                                                                                                                memory placement. Different dynamic memory allocators
   ment. In Figure 7a, we evaluate the time to build the index                                                                                                                                                                  have targeted different use cases and systems, and our mi-
   and the time to run the join workload for three in-memory                                                                                                                                                                    crobenchmark showed significant differences in terms of scal-
   indexes: ART [46], Masstree [49], and Skip List [77]. Based                                                                                                                                                                  ability and efficiency. In our evaluation, we demonstrated
   on the results, we select ART as the index with the best                                                                                                                                                                     that these differences translate into real gains in data analyt-
   overall performance. In W4, we are interested in the join                                                                                                                                                                    ics workloads. Deciding whether to use an alternative mem-
   time, given a pre-built index. We included the build time                                                                                                                                                                    ory allocator depends on the answer to the following ques-
   as an interesting sidenote, since ART performs well in this                                                                                                                                                                  tion: does my workload frequently involve multiple threads
   regard as well. In Figure 7b, we show the beneficial effect                                                                                                                                                                  concurrently allocating memory? If the answer is yes, then
   of overriding the memory allocators for W4 when using the                                                                                                                                                                    memory allocators are an avenue worth exploring for the ap-
   ART index. The reduction in runtime is substantial, par-                                                                                                                                                                     plication. We believe the combination all of these findings
   ticularly with the jemalloc memory allocator, and further                                                                                                                                                                    can provide guidance to developers and practitioners.
   performance gains are obtained from memory interleaving.

   4.5                            Database Engine Experiments                                                                                                                                                                   5.    RELATED WORK
      In this section, we analyze the the TPC-H workload (W5)                                                                                                                                                                     The rising demand for high performance parallel comput-
   on two database systems. Measuring NUMA-related effects                                                                                                                                                                      ing has motivated many works on leveraging NUMA archi-
   on database systems like MonetDB or PostgreSQL is more                                                                                                                                                                       tectures. We now explore some of works in this context that
   difficult compared to synthetic workloads, as the database                                                                                                                                                                   are relevant to query processing and data analytics.

                                                                                                                                                                                                                           11
You can also read