A Study of Page-Based Memory Allocation Policies for the Argo Distributed Shared Memory System - Ioannis Anevlavis

Page created by Chris Jackson
 
CONTINUE READING
IT 21 002

                                  Examensarbete 30 hp
                                         Januari 2021

A Study of Page-Based Memory
Allocation Policies for the Argo
Distributed Shared Memory System

Ioannis Anevlavis

              Institutionen för informationsteknologi
                 Department of Information Technology
Abstract
                                      A Study of Page-Based Memory Allocation Policies
                                      for the Argo Distributed Shared Memory System
                                      Ioannis Anevlavis

Teknisk- naturvetenskaplig fakultet
UTH-enheten                           Software distributed shared memory (DSM) systems have been one of the main areas
                                      of research in the high-performance computing community. One of the many imple-
Besöksadress:                         mentations of such systems is Argo, a page-based, user-space DSM, built on top of MPI.
Ångströmlaboratoriet
Lägerhyddsvägen 1                     Researchers have dedicated considerable effort in making Argo easier to use and
Hus 4, Plan 0                         alleviate some of its shortcomings that are culprits in hurting performance and scaling.
                                      However, there are several issues left to be addressed, one of them concerning the
Postadress:                           simplistic distribution of pages across the nodes of a cluster. Since Argo works on page
Box 536
751 21 Uppsala                        granularity, the page-based memory allocation or placement of pages in a distri-
                                      buted system is of significant importance to the performance, since it determines the
Telefon:                              extent of remote memory accesses. To ensure high performance, it is essential to
018 – 471 30 03                       employ memory allocation policies that allocate data in distributed memory modules
Telefax:                              intelligently, thus reducing latencies and increasing memory bandwidth. In this thesis,
018 – 471 30 00                       we incorporate several page placement policies on Argo and evaluate their impact on
                                      performance with a set of benchmarks ported on that programming model.
Hemsida:
http://www.teknat.uu.se/student

                                      Handledare: Stefanos Kaxiras
                                      Ämnesgranskare: Konstantinos Sagonas
                                      Examinator: Philipp Rümmer
                                      IT 21 002
                                      Tryckt av: Reprocentralen ITC
Contents

1 Introduction                                                                                                            1
2 Background                                                                                                              2
  2.1 Argo Distributed Shared Memory . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    2
      2.1.1 System Design . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    2
      2.1.2 Memory Management . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    3
      2.1.3 Signal Handler . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    3
  2.2 MPI One-Sided Communication . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    4
      2.2.1 Memory Windows . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    4
      2.2.2 Basic Operations . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    5
      2.2.3 Atomic Operations . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    5
      2.2.4 Passive Target Synchronization       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    6
      2.2.5 Memory Models . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    7
3 Contributions                                                                                                           8
  3.1 Default Memory Allocation Policy: Drawbacks                    .   .   .   .   .   .   .   .   .   .   .   .   .    8
      3.1.1 Performance and Scalability . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .    8
      3.1.2 Ease of Programmability . . . . . . . .                  .   .   .   .   .   .   .   .   .   .   .   .   .    8
  3.2 Page-Based Memory Allocation Policies . . . .                  .   .   .   .   .   .   .   .   .   .   .   .   .    9
      3.2.1 Cyclic Group . . . . . . . . . . . . . . .               .   .   .   .   .   .   .   .   .   .   .   .   .    9
      3.2.2 First-Touch . . . . . . . . . . . . . . . .              .   .   .   .   .   .   .   .   .   .   .   .   .   11
  3.3 Implementation Details . . . . . . . . . . . . .               .   .   .   .   .   .   .   .   .   .   .   .   .   12
      3.3.1 MPI Backend . . . . . . . . . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   12
      3.3.2 Data Distribution . . . . . . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   16
4 Benchmarks                                                                                                             23
  4.1 Stream Benchmark . . . . . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   23
  4.2 Himeno Benchmark . . . . . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   24
  4.3 Matrix Multiplication . . . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   25
  4.4 NAS Parallel Benchmarks . . . . . . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   25
      4.4.1 Fast Fourier Transform . . . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   26
      4.4.2 Conjugate Gradient . . . . . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   26
  4.5 Bayesian Probabilistic Matrix Factorization            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   27
5 Evaluation                                                                                                             28
  5.1 Performance Characteristics . . . . . . . . . . . . . . . . . . . . . . . .                                        29
      5.1.1 Synthetic Benchmarks . . . . . . . . . . . . . . . . . . . . . . .                                           29
      5.1.2 Application Benchmarks . . . . . . . . . . . . . . . . . . . . . .                                           31
6 Conclusion                                                                                                             34
Bibliography
A Additional Listings

                                          i
1
                                                             Introduction

Nowadays, business and government organizations create large amounts of both un-
structured and structured information which needs to be processed, analyzed and
linked. Applications that address this need can be generally classified as either
compute-intensive or data-intensive or both. The most important reason for develop-
ing such applications in parallel is the potential performance improvement, which can
either be obtained by expanding the memory or the compute capabilities of the de-
vice they are being run on. Due to these characteristics, typical hardware computing
infrastructures for large-scale applications are a group of multicore nodes connected
via a high-bandwidth commodity network, each one having each own private shared
memory and disk storage (a.k.a. a cluster).
    For programming distributed memory multiprocessor systems, such as cluster of
workstations, message passing is usually used. However, message passing systems
require explicit coding of the inter-process communications, which makes parallel
programming difficult. This has led to a diverse ecosystem of programming models
that enable programming in a much larger scale than a single multicore or a single
symmetric multiprocessor (SMP) and ease the development by specializing to algo-
rithmic structure and dynamic behavior; however, applications that do not fit well
into one particular model suffer in performance. Software distributed shared mem-
ory (DSM) systems improve the programmability of message passing machines and
workstation clusters by providing a shared memory abstraction (i.e., a coherent global
address space) to programmers. One of the many implementations of such systems
is Argo [Kax+15], a page-based, user-space DSM, built one top of message pass-
ing interface (MPI). Argo provides a transparent shared address space with scalable
performance on a cluster with fast network interfaces.
    Although this design preserves the abstraction of a single shared memory to the
programmer, it comes at the cost of load-balancing issues and remote accesses. In or-
der to guarantee high performance on these architectures, an efficient data placement
strategy becomes crucial. Due to this need, memory allocation policies for hierarchi-
cal shared memory architectures [IWB02; Rib+09; Ser+12a; Ser+12b] have attracted
considerable research efforts and have shown significant network and memory perfor-
mance benefits when benchmarking a variety of scientific applications.
    In this thesis, we investigate incorporating seven page-based memory allocation
policies in Argo. We begin by giving an overview of the Argo system, in terms of how
its global memory is laid out and managed, as well as an overview of MPI one-sided
communication, which composes a significant part of Argo’s backend (Section 2). We
then explain why Argo’s default way of managing memory is inefficient, and propose
data placement strategies to address these deficiencies (Section 3). Consequently,
we present the benchmarks ported on Argo for the sake of this thesis, which will be
used to evaluate the impact of the implemented policies on performance (Section 4).
Finally, we depict and elaborate on the execution results (Section 5).
    We deploy and evaluate the correctness of the policies on an 8-node RMDA-enabled
cluster. Their performance, however, is evaluated on a larger distributed cluster.

                                          1
2
                                                                Background

2.1     Argo Distributed Shared Memory
Argo [Kax+15] is a page-based, user-space DSM and its prototype implementation is
built on top of MPI. It ensures coherence in the global address space in a distributed
system, thus enabling shared memory programming in a much larger scale than a
single multi-core or a single SMP [Kax+15]. Coherence could be accomplished both
with hardware or software, but since there is no dedicated hardware support at this
scale, interest is focused more on software solutions. Argo is another from the existing
plethora of software solutions to create a shared virtual address space among all
nodes in a distributed system, that introduces three innovative techniques in terms
of coherence and critical-section handling. In particular, a novel coherence protocol
(Carina) based on passive classification directories (Pyxis) and a new locking system
(Vela) [Kax+15].

2.1.1    System Design
Similar to other DSM systems [Li88; BZS93; RH01; Kel+94], Argo implements shared
memory using the facilities provided by the virtual memory system. It is implemented
entirely in user space and uses MPI as the underlying messaging library as well as for
process setup and tear down, for portability and flexibility reasons.
    Figure 2.1 shows an overview of Argo’s DSM system. In the Argo system, each
node contributes an equal share of memory to the globally shared memory of the
system. The size of the shared memory space is user specified at the application

                Figure 2.1: Argo Distributed Shared Memory Layout

                                           2
level and has to be large enough to fit the desired workload. For example, if the
application code includes the collective initializer call argo::init(10GB) and the
nodes being used are four, then every node will contribute 2.5GB of size from their
physical memory in order for the global address space of the system to be constructed.

2.1.2    Memory Management
Since a memory page is the smallest unit of data that is possible to map virtual
memory to, Argo works with a page granularity of 4KB. The API sets up a shared
virtual address space spanning all nodes using POSIX shared memory, which is first
initialized by each node, allocating the same range of virtual addresses using the mmap
system call. These addresses are then available for allocation at the page-level using
Argo’s own allocators.
    As Argo being a home-based DSM, each virtual page is assigned a home node.
The term ‘home node’ essentially refers to the actual node in the distributed system
in whose physical memory the logical page will be mapped to. Argo’s default memory
management scheme falls into the category of the bind memory policies in general and
bind all in particular [Rib+09]. That said, this policy will use all available memory
(physical) contributed to the global address space from the first node, before using
the next node’s memory.
    Figure 2.2 depicts the bind all memory allocation policy in a cluster machine
using four nodes. The globally allocated application data is composed of M memory
pages which are divided into four groups (each group is represented by color). In that
setting, using the bind all policy, the first group of virtual pages (starting from the
left) will begin to get mapped to the physical memory of node0 , and when that node
runs out of physical page frames, the mapping will continue to the next node by id
which is node1 . As aforementioned, the size of the shared memory space should be
sufficient to host all memory pages of the application data.

              Figure 2.2: Argo’s Default Memory Management Scheme

2.1.3    Signal Handler
Even though the default page placement scheme follows a static approach to asso-
ciate virtual memory ranges to nodes, the very process of binding is not done at
the initialization phase, but at runtime. The mapping between virtual to physical
memory is taken care of by Argo’s signal handler. Argo’s signal handler is actually a
user SIGSEGV signal handler implemented in the MPI backend of the system and is
invoked when the access permission into a memory region is not valid.
    In an application, by the time all operations issued by argo::init have finished
and execution resumes, all virtual memory addresses available for allocation have
no access permissions. Even after encountering Argo’s allocation calls, the memory
addresses continue to have no access permissions, since no physical allocation takes
place at the allocation-point, pretty similar to how memory allocation works on Linux.

                                          3
From that point onwards, any first access to a memory page from the globally
allocated data structures would result in a page fault (considered read miss by default),
which is passed to the handler function via a SIGSEGV signal. The execution path
of the function is divided into two main branches and which one is taken is decided
from whether the home node of the faulting page is the current process1 executing
the function or not. If the memory page belongs to the node based on the memory
allocation policy, it is mapped to the backing memory of the local machine, otherwise
the page data is fetched from the remote node and mapped to the local page cache.

2.2        MPI One-Sided Communication
Argo does not have its own custom fine-tuned network layer to perform its underlying
communications, rather, it uses passive one-sided communications of MPI [Kax+15].
    One-sided communication, also known as remote direct memory access (RDMA)
or remote memory accesss (RMA), was introduced in the MPI-2 standard [Gro+98].
This form of communication, unlike the two-sided communication, decouples data
movement from process synchronization — hence its name. In essence, it allows a
process to have direct access to the memory address space of a remote process through
the use of non-blocking operations, without the intervention of that remote process.

2.2.1      Memory Windows
The fact that the target process does not perform an action that is the counterpart
of the action on the origin, it does not mean the origin process can access and modify
arbitrary data on the target at arbitrary times. In order to allow processes to have
access into each others’ memory, processes have to explicitly expose their own memory
to others. That said, one-sided communication in MPI is limited to accessing only a
specifically declared memory area on the target known as a window.
    In the one-sided communication model, each process can make an area from its
physical memory, called a window, available to one-sided transfers. The variable type
for declaring a window is MPI_Win. The window is defined on a communicator, and
thus a process in that communicator can put arbitrary data from its own memory to
the window of another process, or get something from the other process’ window in
its own memory, as seen in Figure 2.3.
    The memory for a window is at first sight ordinary data in user space. There are
multiple ways to associate data with a window, one of them being to pass a user buffer
to MPI_Win_create, along with its size measured in bytes, the type of its elements
(also in bytes), and the relevant communicator.

        Figure 2.3: Remote Put & Get between Processes in a Communicator

   1 MPI   processes are considered as nodes in the ArgoDSM system.

                                                4
2.2.2     Basic Operations
There are multiple routines for performing one-sided operations, but three of the most
basic ones are the Put, Get, and Accumulate. These calls somewhat correspond to
the Send, Receive, and Reduce of the two-sided communicator model, except that of
course only one process makes the call.
   We shall denote by origin the process that performs the call, and by target the
process in which the memory is accessed. Thus, in a put operation, source=origin
and destination=target; in a get operation, source=target and destination=origin.

2.2.2.1   Put and Get
The MPI_Put call can be considered as a one-sided send and, as such, it must specify:

   • the target rank,
   • the data to be send from the origin, and

   • the location where it is to be written on the target.

    The description of the data on the origin supplied to the call is the usual trio of
the pointer to the buffer, the number of its elements to be considered and the type of
each individual element. The description of the data on the target is almost similar to
that of the origin, as the number of elements and datatype need also to be specified,
but instead of an address to the buffer, a displacement unit with respect to the start
of the window on the target needs to be supplied. This displacement can be given
in bytes, but essentially it is a multiple of the displacement unit (datatype in bytes)
that was specified in the window definition.
    As an example, consider a window created with a displacement unit of four bytes
(sizeof(int)). Any access to that window with a MPI_INT datatype and a target
displacement of three provided to the call, would read or write, depending on the
operation, the third element of the window memory based on the calculation:
    window_base + target_disp × disp_unit.

                 Figure 2.4: Offset Calculation for an MPI Window

   The MPI_Get call has exactly the same parameters as MPI_Put, however, they hold
a different meaning to the operation, since now the origin buffer will host the data
coming from the remote window.

2.2.2.2   Accumulate
The third of the basic one-sided routines is MPI_Accumulate, which does a reduction
operation on the data being put to the remote window, thus, introducing only one
additional parameter to its call with respect to the put operation.
    Accumulate is a reduction with remote result. As with MPI_Reduce, the same pre-
defined operators are available, but no user-defined ones. There is one extra operator:
MPI_REPLACE, this has the effect that only the last result to arrive is retained.

2.2.3     Atomic Operations
One-sided calls are said to emulate shared memory in MPI, but the put and get
calls are not enough for certain scenarios with shared data. The problem is that

                                          5
reading and updating shared data structures is not an atomic operation, thus leading
to inconsistent views of the data (race condition).
    In the MPI-3 standard [For12] some atomic routines have been added. These
routines refer to a set of operations that includes remote read and update, and re-
mote atomic swap operations as “accumulate” operations. In the former group belong
the routines MPI_Get_accumulate and MPI_Fetch_and_op, which atomically retrieve
data from the window indicated, apply an operator, and then combine the data on the
target with the data on the origin. In the latter belongs the MPI_Compare_and_swap
routine, in which the origin data are swapped with the target data, only if the target
data are equal to a user-specified value.
    All of the previously mentioned routines perform the same operations: return
data before the operation, then atomically update data on the target, but among
them, the most flexible in data type handling is MPI_Get_accumulate. The routines
MPI_Fetch_and_op and MPI_Compare_and_swap, which operate on only a single ele-
ment, allow for faster implementations, in particular through hardware support.

2.2.4    Passive Target Synchronization
Within the one-sided communication, MPI has two modes: active RMA and passive
RMA. In active RMA, or active target synchronization, the target sets boundaries on
the time period (the ‘epoch’) during which its window can be accessed. This type of
synchronization acts much like asynchronous transfer with a concluding MPI_Waitall
in the two-sided communication model.
    In passive RMA, or passive target synchronization, the target puts no limitation
on when its window can be accessed. Based on this model, only the origin can be
actively involved, allowing it to be able to read and write from a target at arbitrary
time without requiring the target to make any calls whatsoever. This means that the
origin process remotely locks the window on the target, performs a one-sided transfer,
and releases the window by unlocking it again.
    During an access epoch, also called an passive target epoch, a process can initiate
and finish a one-sided transfer, by locking the window with the MPI_Win_lock call
and unlocking it with MPI_Win_unlock. The two lock types are:
   • MPI_LOCK_SHARED which should be used for Get calls: since multiple processes
     are allowed to read from a window in the same epoch, the lock can be shared.
   • MPI_LOCK_EXCLUSIVE which should be used for Put and Accumulate calls: since
     only one process is allowed to write to a window during one epoch, the lock
     should be exclusive.
   These routines make MPI behave like a shared memory system; the instructions
between locking and unlocking the window effectively become atomic operations.

Completion and Consistency
In one-sided communication one should be aware of the multiple instances of the data,
and the various completions that effect their consistency.
   • The user data. This is the buffer that is passed to a Put or Get call. For
     instance, after a Put call, but still in an access epoch, it is not safe to reuse.
     Making sure the buffer has been transferred is called local completion.
   • The window data. While this may be publicly accessible, it is not necessarily
     always consistent with internal copies.
   • The remote data. Even a successful Put does not guarantee that the other
     process has received the data. A successful transfer is a remote completion.
    We can force remote completion, that is, update on the target with MPI_Win_unlock
or some variant of it, concluding the epoch.

                                          6
2.2.5    Memory Models
The window memory is not the same as the buffer that is passed to MPI_Win_create.
The memory semantics of one-sided communication are best understood by using the
concept of public and private window copies. The former refers to the memory region
that a system has exposed, so it is addressable by all processes, while the latter refers
to the fast private buffers (e.g., transparent caches or explicit communication buffers)
local to each process where copies of the data elements from the main memory can
be stored for faster access. The coherence between these two distinct memory regions
is determined by the memory model. One-sided communication of MPI defines two
distinct memory models, the separate and the unified memory model.
    In the separate memory model, the private buffers local to each process are not
kept coherent with all the updates to main memory. Thus, conflicting accesses to
main memory need to be synchronized and updated in all private copies explicitly.
This is achieved by explicitly calling one-sided functions in order to reflect updates
to the public window in the private memory.
    In the unified memory model, the public and private windows are identical. This
means that updates to the public window via put or accumulate calls will be even-
tually observed by load operations in the private window. Contrariwise, local store
accesses are eventually visible to remote get or accumulate calls without additional
one-sided calls. These stronger semantics of the unified model allow to omit some
synchronization calls and potentially improve performance.

                                           7
3
                                                             Contributions

In this chapter, we highlight the drawbacks of the default memory management ap-
proach used in Argo, present the contributions of our work to address these deficiencies
by introducing several page-based memory allocation policies, and then outline the
code modifications applied to incorporate these policies in Argo’s backend.
    In the context of this thesis, as far as the page placement policies are concerned,
we are particularly interested in static other than dynamic techniques in managing
memory allocation. The reason for this selection is because no prior work up until now
has been done on tinkering with memory management, so as this being the first step
in this direction, it was judged to be a good approach to implement simple allocation
techniques and see if they favor performance, before moving on to more complicated
ways of managing memory, such as preallocation, data migration mechanisms, etc.

3.1     Default Memory Allocation Policy: Drawbacks
The default memory management scheme presented in Section 2.1.2 is a double-edged
sword, meaning that despite its simple implementation, it can be detrimental to the
performance and scaling of an application, especially if not exploited efficiently.

3.1.1    Performance and Scalability
One of the drawbacks of the simplistic data placement is that it hurts performance
and scaling. The issue is caused by the relation between the size of the global address
space specified, and the size of the allocated data structures in an application. As an
example, consider a case where the size of the global address space is specified to be
10GB, while the globally allocated data structures in the application are of size 1GB.
This will result in all memory pages to fit in the physical memory of node0 , since each
node has reserved a space of 2.5GB from its physical memory for the DSM to use. In
that setting, since the workload will be distributed across all nodes, there would be
essentially no locality for all except the zero process and a lot of network traffic will
be generated for the purpose of fetching data from node0 .

3.1.2    Ease of Programmability
Another drawback of the simplistic data placement is that it does not contribute
to one of the key concepts of Argo, which is the ease of programmability. Argo’s
creators have dedicated considerable efforts in making this programming model easier
to use, by keeping the abstraction as high as possible from the user, thereby making
it possible with just a few modifications to scale a shared memory application to the
distributed system level. However, Argo will not offer its most optimal performance,
especially in data-intensive applications, if the user is not aware of the default page
placement scheme. To acquire such information, the user will have to either look
into the system’s research literature or into the system’s source code, or ask one

                                           8
of the system’s creators, or to figure it out by conducting his/her own performance
tests, since that information is not quickly available through the small programming
tutorial1 under the Argo’s web page. In the case that the user realizes that the page
placement policy affects performance and gets to understand its functionality, then
in each application, if he strives for optimal performance, he will have to take care of
the relation between the size of the global address space and the size of the globally
allocated data, by providing the exact globally allocated data structures’ size (plus
some padding) to the initializer call argo::init.

3.2      Page-Based Memory Allocation Policies
In order to tackle the side effects of the default memory management scheme, having
in mind that memory in Argo is handled at a page granularity level, we look into
page-based memory allocation policies. Considering the diversity of the data parallel
applications characteristics, such as for example different memory access patterns,
only one memory policy might not deem appropriate to enhance performance in all
cases, so we incorporate seven static page placement policies on Argo. We propose
memory policies that handle both bandwidth and latency issues, and also address
different granularities other than handling data at the default page granularity level.

3.2.1     Cyclic Group
As mentioned above, a downside of the default memory policy is that it binds all
data to the physical memory of node0 , thus causing network contention when the
workload is distributed. The cyclic group of memory policies addresses both the per-
formance and programmability issues the default memory policy elicits. Specifically,
it improves performance by spreading data across all physical memory modules in the
distributed system, thus balancing memory modules’ usage, improving network band-
width, and easing programmability, since whatever size is provided to the initializer
call argo::init does not affect the placement of data. The cyclic group consists of
six memory policies, namely cyclic, cyclic block, skew mapp, skew mapp block, prime
mapp, and prime mapp block.
    Figure 3.1 depicts the cyclic and the cyclic block memory policies, on the left and
right side of the figure respectively, in a cluster machine using four nodes. Each node
of the machine has a physical memory which will host the physical page frames of the
application data. The application data allocated in global memory is composed of M
memory pages, which are divided into four contiguous groups (each color represents
a group). In general, the cyclic group of memory policies spread memory pages
over a number of memory modules of the machine following a type of round-robin

                         Figure 3.1: Cyclic & Cyclic Block Policies

   1 Porting   a Pthread application on Argo: https://parapluu.github.io/argo/tutorial.html

                                               9
distribution and, in particular, cyclic and cyclic block policies do so in a linear way.
The cyclic policy uses a memory page per round, a page i is placed in the memory
module i mod N, where N is the number of nodes being used to run the application.
In the cyclic block policy on the other hand, a block of pages b (user specified) is
placed in the memory module b mod N.
    Cyclic and cyclic block memory policies can be used in applications with regular
and irregular behavior that have a high level of sharing since the distribution of pages
is extremely uniform, thus smoothening out the traffic generated in the network,
providing more bandwidth and better memory modules’ usage. However, the fact that
these data placement techniques make a linear power of two distribution of memory
pages on a power of two number of nodes, can still lead to contention problems in some
scientific applications. For example, in the field of numerical scientific applications,
the data structure sizes used are also power of two and thus using the cyclic memory
policy may lead to memory pages used by different processes to reside in the same
memory modules [IWB02].

                Figure 3.2: Skew Mapp & Skew Mapp Block Policies

    To overcome this phenomenon Iyer et al. [IWB02] introduced two non-linear round-
robin allocation techniques, the skew mapp and prime mapp memory policies. The
basic idea behind these allocation schemes is to perform a non-linear page placement
over the machines memory modules, in order to reduce concurrent accesses directed to
the same memory modules for parallel applications. The skew mapp memory policy
is a modification of the cyclic policy that has a linear page skew. In this policy, a
page i is allocated in the node (i + b i/N c + 1) mod N, where N is the number of
nodes used to run the application. In this case, the skew mapp policy is able to skip
a node for every N pages allocated, resulting in a non-uniform distribution of pages
across the memory modules of the distributed system. Figure 3.2 depicts the skew
mapp memory policy as well as its corresponding block implementation in a cluster
machine using four nodes. Notice the red arrows pointing to the node skipped in the
first round of the data distribution.
    The prime mapp memory policy uses a two-phase round-robin strategy to better
distribute memory pages over a cluster machine. In the first phase, the policy places
data using the cyclic policy in (P ) nodes, where P is a prime number greater than
or equal to N (number of nodes used). Due to the condition that the prime number
has to satisfy, and also for ease of programmability, it is calculated at runtime and is
equal to 3N/2. Aside from the reasons specified, the mathematical expression used to
calculate the prime number also preserves a good analogy between the real and virtual
nodes. In the second phase, the memory pages previously placed in the virtual nodes
are re-placed into the memory modules of the real nodes also using the cyclic policy.
In this way, the memory modules of the real nodes are not used in a uniform way to
place memory pages. Figure 3.3 depicts the prime mapp memory policy as well as
its corresponding block implementation in a cluster machine using four nodes. Notice
the red arrows pointing to the re-placement of pages from the virtual to the memory

                                          10
Figure 3.3: Prime Mapp & Prime Mapp Block Policies

modules of the real nodes. The red arrows are two in the page-level allocation case
and four in the block-level case because, since four nodes are used, the prime number
is equal to six, which makes up two virtual nodes.

3.2.2    First-Touch
First-touch is the default policy in Linux operating systems to manage memory al-
location on NUMA. This policy places data in the memory module of the node that
first accesses it. Due to this characteristic, data initialization must be done with care
so that data is first accessed by the process that is later on going to use it. Two
of the most common strategies to initialize data in parallel programming is initial-
ization only by the master thread or having each worker thread to initialize its own
data chunk. Figure 3.4 shows the difference between these two strategies in a cluster
machine using four nodes. Since we parallelize on the distributed system level, we
talk about master process and team process initialization, presented on the left and
right side of the figure respectively. In this example, global memory is composed of
two arrays, which are operated in the computation part of the program with an even
workload distribution across the processes. Using the master process to initialize the
global arrays, the outcome is no different from the default memory policy of Argo
(if not handled correctly), where all data reside in node0 . On the contrary, using
team process initialization the memory pages are spread over the four memory mod-
ules of the cluster, with each node hosting only the data that will need during the
computation, thus exploiting locality and dramatically reducing remote accesses.
    So if initialization is handled correctly in applications that have a regular access
pattern, first-touch will present performance gains deriving from the short access la-
tencies to fetch data. However, in the case of irregular access pattern applications,
this allocation scheme may result in a high number of remote accesses, since pro-

              Figure 3.4: Master Process & Team Process Initialization

                                           11
cesses will not access the data which is bind to their memory modules. Regarding
the drawbacks that the default memory management scheme brings to the surface,
first-touch certainly addresses both the performance and programmability issues and
becomes the best choice among the presented policies in regular applications. An
example of this would be the fact that memory layouts as in Figure 3.4, where the
size of the allocated arrays is equal, make cyclic block also a favorable choice as a
policy. However, this choice comes with an ease of programmability cost, since the
user will have to pre-calculate the optimal page block size to set up Argo and then
proceed with the execution of the application.

3.3          Implementation Details
To incorporate the seven page placement policies mentioned in the previous section
on Argo, we attained a satisfactory knowledge of the programming model’s backend,
in order to deliver the most optimal solution in terms of code modifications and design
quality. With that said, we apply modifications and introduce new code on the source
directories backend/mpi and data_distribution, from the Argo’s main repository2 .

3.3.1         MPI Backend
As a starting point, we look into the file swdsm.cpp under the source directory
backend/mpi. This source file contains most of the Argo’s MPI backend implemen-
tation, including the SIGSEGV signal handler.
    As mentioned in Section 2.1.3, the handler function is invoked in a cache miss,
which in Argo also occurs in an access to an unmapped memory page or an access
to a memory page without the right access permissions. First-time accesses to mem-
ory pages are specially important, since they determine if memory pages have to be
mapped to the backing memory of the local machine or be cached from another node
and mapped to the local page cache. The location to where a memory page should
be mapped is pointed out by the chosen memory policy.
    The functions that make up the functionality of the page placement policy are
getHomenode and getOffset, both defined in swdsm.cpp and called at the beginning
of the handler function. The first function returns the home node of the memory
page, while the latter the relevant offset in the backing memory in case the page
is mapped to the local machine. Prerequisite to the calculation of the home node
and offset is the aligned at a page granularity offset of the faulting address from
the starting point of the global address space. This offset is calculated at the very
beginning of the handler function and is passed, amongst other functions, as an
argument to getHomenode and getOffset (Listing 3.1).
    In the vanilla version of Argo, the functionality of the bind all page placement pol-
icy is directly defined in the getHomenode and getOffset functions (Listing 3.2). This
is not a bad design approach in the original case, since it is the only policy available to
handle the placement of data across the nodes of a distributed system. However, since
we incorporate seven other policies, we make use of the global_ptr template class

318     const std : : size_t access_offset =
           s t a t i c _ c a s t ( s i −>s i_a dd r ) − s t a t i c _ c a s t ( s t a r t A d d r ) ;
321     const std : : size_t aligned_access_offset =
           a l i g n _ b a c k w a r d s ( a c c e s s _ o f f s e t , CACHELINE∗ p a g e s i z e ) ;
327     u n s i g n e d l o n g homenode = getHomenode ( a l i g n e d _ a c c e s s _ o f f s e t ) ;
328     unsigned long o f f s e t = g e t O f f s e t ( aligned_access_offset ) ;

Listing 3.1: Argo: Function Invocation of getHomenode & getOffset (swdsm.cpp)
                                (Original Version)
      2 Argo’s   repository on Github: https://github.com/etascale/argodsm

                                                              12
498   u n s i g n e d l o n g getHomenode ( u n s i g n e d l o n g addr ) {
499      u n s i g n e d l o n g homenode = addr / size_of_chunk ;
500       i f ( homenode >=(u n s i g n e d l o n g ) numtasks ) {
501           e x i t (EXIT_FAILURE) ;
502      }
503      r e t u r n homenode ;
504   }
505
506   u n s i g n e d l o n g g e t O f f s e t ( u n s i g n e d l o n g addr ) {
508      u n s i g n e d l o n g o f f s e t = addr − ( getHomenode ( addr ) ) ∗ size_of_chunk ;
509       i f ( o f f s e t >=size_of_chunk ) {
510           e x i t (EXIT_FAILURE) ;
511      }
512      return o f f s e t ;
513   }

Listing 3.2: Argo: Function Definition of getHomenode & getOffset (swdsm.cpp)
                                (Original Version)

defined in data_distribution.hpp under the source directory data_distribution,
to improve readability by hiding the implementation details (lines 518-9 and 534-5 of
Listing 3.4).
    Other than hiding the internals of the page placement policies using the global_ptr
class, we introduce an if-else construct which if its condition is satisfied, the code
that retrieves the home node and offset is protected by a semaphore and a mutex
lock, while in the opposite case, it remains unprotected. The choice of which branch
of the if-else statement is taken is decided by the cloc function parameter, whose
purpose is to identify if the program is at a specific location running under a specific
memory policy. In case the branch is taken, it means that we are at that specific lo-
cation (later analyzed) in the program running under the first-touch memory policy,
which corresponds to the function parameter cloc being seven.
    In the backend of the Argo system, almost all the functional code is enclosed by
different mutex locks and a semaphore. The reason for using pthread mutex locks
is to protect from concurrent accesses the locally and globally operated by multiple
threads data structures, which serve the purpose of ensuring data coherency but also
mitigating some performance bottlenecks. Despite allowing one thread at a time
to perform operations on these data structures, global operations that involve the
InfiniBand network need also to be serialized. This is due to the fact that either the
settings or the hardware itself of a cluster machine might not support concurrent one-
sided operations coming from the same node, and in case such operations are imposed,
execution might fail and, if not, unpredictable delays will be observed because the
network would have downgraded from InfiniBand to Ethernet. The abstract data
type used in the backend of Argo to ensure atomicity of the InfiniBand network is
a semaphore. The semaphore ibsem is shared between the threads of a process and

334   const std : : size_t access_offset =
         s t a t i c _ c a s t ( s i −>s i_a dd r ) − s t a t i c _ c a s t ( s t a r t A d d r ) ;
337   const std : : size_t aligned_access_offset =
         a l i g n _ b a c k w a r d s ( a c c e s s _ o f f s e t , CACHELINE∗ p a g e s i z e ) ;
343   u n s i g n e d l o n g homenode =
         getHomenode ( a l i g n e d _ a c c e s s _ o f f s e t , MEM_POLICY) ;
344   unsigned long o f f s e t =
         g e t O f f s e t ( a l i g n e d _ a c c e s s _ o f f s e t , MEM_POLICY) ;

Listing 3.3: Argo: Function Invocation of getHomenode & getOffset (swdsm.cpp)
                                (Modified Version)

                                                            13
514   u n s i g n e d l o n g getHomenode ( u n s i g n e d l o n g addr , i n t c l o c ) {
515       i f ( c l o c == 7 ) {
516           pthread_mutex_lock(& spinmutex ) ;
517           sem_wait(&ibsem ) ;
518           dm : : g l o b a l _ p t r  g p t r ( r e i n t e r p r e t _ c a s t ( addr +
                  r e i n t e r p r e t _ c a s t ( s t a r t A d d r ) ) , 0 ) ;
519           addr = g p t r . node ( ) ;
520           sem_post(& ibsem ) ;
521           pthread_mutex_unlock(& spinmutex ) ;
522      } else {
523               dm : : g l o b a l _ p t r  g p t r ( r e i n t e r p r e t _ c a s t ( addr +
                      r e i n t e r p r e t _ c a s t ( s t a r t A d d r ) ) , 0 ) ;
524               addr = g p t r . node ( ) ;
525      }
526
527       r e t u r n addr ;
528   }
529
530   u n s i g n e d l o n g g e t O f f s e t ( u n s i g n e d l o n g addr , i n t c l o c ) {
531       i f ( c l o c == 7 ) {
532           pthread_mutex_lock(& spinmutex ) ;
533           sem_wait(&ibsem ) ;
534           dm : : g l o b a l _ p t r  g p t r ( r e i n t e r p r e t _ c a s t ( addr +
                  r e i n t e r p r e t _ c a s t ( s t a r t A d d r ) ) , 1 ) ;
535           addr = g p t r . o f f s e t ( ) ;
536           sem_post(& ibsem ) ;
537           pthread_mutex_unlock(& spinmutex ) ;
538      } else {
539               dm : : g l o b a l _ p t r  g p t r ( r e i n t e r p r e t _ c a s t ( addr +
                      r e i n t e r p r e t _ c a s t ( s t a r t A d d r ) ) , 1 ) ;
540               addr = g p t r . o f f s e t ( ) ;
541      }
542
543       r e t u r n addr ;
544   }

Listing 3.4: Argo: Function Definition of getHomenode & getOffset (swdsm.cpp)
                                (Modified Version)

ensures that even though threads might be willing to concurrently process different
global data structures, only one of them will proceed at a time.
    The cloc parameter that determines which branch of the if-else statement
is taken in the getHomenode and getOffset functions, is introduced because of the
internals of the first-touch memory policy in conjunction with where the two functions
are called inside the handler. First-touch, as a standalone category from the other
memory policies, is the only one that makes use of a directory to keep track of the
owner of every page. That said, its implementation involves one-sided operations and
that is why it is protected by a directory dedicated lock and the relevant semaphore,
as seen in lines 516-7, 520-1 and 532-3, 536-7 of Listing 3.4. Note that the mutex
lock spinmutex used in that particular case is used solely to avoid the overhead of
sleeping imposed by the semaphore.
    A stripped-down version of the handler function is shown in Listing 3.5. Despite
the plethora of operations performed in the actual code, the operation workflow of
the function is rather simple. Initially, the faulting address is aligned at a 4KB page
granularity and is passed to the getHomenode and getOffset functions. Once the
home node and offset of the faulting address are retrieved, it is checked if the page
belongs to the node and, in that case, it is mapped to the backing memory of the
local machine (globalData), as seen in line 384 of Listing 3.5. Otherwise, it is being
fetched from its relevant home node and mapped to the local page cache (cacheData).
    Observe that operations that involve the two aforementioned data structures as
well as the Pyxis directory (globalSharers) are enclosed by the lock cachemutex and

                                                           14
326   v o i d h a n d l e r ( i n t s i g , s i g i n f o _ t ∗ s i , v o i d ∗ unused ) {
334       const std : : size_t access_offset =
             s t a t i c _ c a s t ( s i −>si _a dd r ) − s t a t i c _ c a s t ( s t a r t A d d r ) ;
337       const std : : size_t aligned_access_offset =
             a l i g n _ b a c k w a r d s ( a c c e s s _ o f f s e t , CACHELINE∗ p a g e s i z e ) ;
341       char ∗ const aligned_access_ptr =
             s t a t i c _ c a s t ( s t a r t A d d r ) + a l i g n e d _ a c c e s s _ o f f s e t ;

343       u n s i g n e d l o n g homenode =
             getHomenode ( a l i g n e d _ a c c e s s _ o f f s e t , MEM_POLICY) ;
344       unsigned long o f f s e t =
             g e t O f f s e t ( a l i g n e d _ a c c e s s _ o f f s e t , MEM_POLICY) ;

          // P r o t e c t s g l o b a l D a t a , cacheData and g l o b a l S h a r e r s .
348       pthread_mutex_lock(&cachemutex ) ;

350       // I f t h e page i s l o c a l . . .
351       i f ( homenode == ( getID ( ) ) ) {
353           sem_wait(&ibsem ) ;

              // update t h e P y x i s d i r e c t o r y ( g l o b a l S h a r e r s ) and
              ...

382           // map t h e page t o t h e b a c k i n g memory o f t h e l o c a l machine .
384           vm : : map_memory( a l i g n e d _ a c c e s s _ p t r , p a g e s i z e ∗CACHELINE,
                 c a c h e o f f s e t+o f f s e t , PROT_READ) ;

423           sem_post(& ibsem ) ;
424           pthread_mutex_unlock(&cachemutex ) ;
425           return ;
426       }

          // I f t h e page d o e s not b e l o n g t o t h e node ,
          // f e t c h i t from t h e r e l e v a n t home node and
          // map i t t o t h e l o c a l page c a c h e .
          ...

          // Update t h e P y x i s d i r e c t o r y and p e r f o r m f u r t h e r o p e r a t i o n s .
          ...

507       pthread_mutex_unlock(&cachemutex ) ;
510       return ;
511   }

Listing 3.5: Argo: Function Definition of the Signal Handler: handler (swdsm.cpp)
                                (Modified Version)

the semaphore ibsem when one-sided operations are about to take place. It can be
clearly seen that if we move both of these locking structures just before the invocation
of the getHomenode and getOffset functions, the introduced branch and the locking
structures in these functions would be unnecessary, but the very motivation of not
changing their original location is the reason why this code is injected.
    Since the first-touch memory policy requires a globally accessed directory to keep
track of the owner as well as the offset of every page, further code is added to the
initialization function argo_initialize in order to set up this data structure, and is
executed only when the relevant function is being selected (Listing 3.6). We allocate
and initialize the first-touch implementation directory in the same way as the rest
global data structures. In the beginning, the size of the directory is calculated and
is set to be twice the size of the total distributed shared memory (in pages), since
the format for every page is [home node, offset], identical to the globalSharers
directory which is [readers, writers]. Then, the implementation specific directory
globalOwners is allocated at a 4KB alignment, mapped to the Argo’s virtual address
space and associated with the newly created window ownerWindow. Lastly, the buffer

                                                            15
935    void a r g o _ i n i t i a l i z e ( std : : size_t argo_size , std : : size_t cache_size ) {
977      #i f MEM_POLICY == 7
978         ownerOffset = 0;
979      #e n d i f

1019       #i f MEM_POLICY == 7
1020          ownerSize = argo_size ;
1021          o w n e r S i z e += p a g e s i z e ;
1022          o w n e r S i z e /= p a g e s i z e ;
1023          o w n e r S i z e ∗= 2 ;
1024          unsigned long ownerSizeBytes = ownerSize ∗ s i z e o f ( unsigned long ) ;
1025
1026          o w n e r S i z e B y t e s /= p a g e s i z e ;
1027          o w n e r S i z e B y t e s += 1 ;
1028          o w n e r S i z e B y t e s ∗= p a g e s i z e ;
1029       #e n d i f

1056       #i f MEM_POLICY == 7
1057          globalOwners =
                 s t a t i c _ c a s t (vm : : a l l o c a t e _ m a p p a b l e ( p a g e s i z e ,
                 ownerSizeBytes ) ) ;
1058       #e n d i f

1086       #i f MEM_POLICY == 7
1087          c u r r e n t _ o f f s e t += p a g e s i z e ;
1088          tmpcache=g l o b a l O w n e r s ;
1089         vm : : map_memory( tmpcache , o w n e r S i z e B y t e s , c u r r e n t _ o f f s e t ,                  ...
                 PROT_READ | PROT_WRITE) ;
1090       #e n d i f

1107       #i f MEM_POLICY == 7
1108          MPI_Win_create ( globalOwners , o w n e r S i z e B y t e s , s i z e o f ( u n s i g n e d
                 l o n g ) , MPI_INFO_NULL, MPI_COMM_WORLD, &ownerWindow ) ;
1109       #e n d i f

1119       #i f MEM_POLICY == 7
1120          memset ( globalOwners , 0 , o w n e r S i z e B y t e s ) ;
1121       #e n d i f
1130   }

 Listing 3.6: Argo: Function Definition of argo_initialize (swdsm.cpp)
                                 (Modified Version)

 in the process space is initialized to zero.
     The memory region of ownerWindow is initialized in the argo_reset_coherence
 function (Listing 3.7), which is invoked at the very end of the initialization function.
 Notice that initialization is being done in the local memory region, but since we are
 under the unified memory model, the local and public copies are kept coherent.

 3.3.2         Data Distribution
 For the rest of the memory policies implementation we apply modifications on the
 files under the source directory data_distribution. These files contain the two
 predefined template classes which we modify and later use in order to hide the com-
 putational part of the memory policies.
      In the official unmodified version of Argo, the contents of the source directory
 data_distribution is only one file named data_distribution.hpp. Aside from the
 template class definitions of global_ptr and naive_data_distribution, this file
 also contains the definitions of their member functions. The code blocks that we are
 particularly interested in, is the constructor of global_ptr and the member functions
 homenode and local_offset of naive_data_distribution.
      The interaction between the global_ptr constructor (Listing 3.8) and the mem-

                                                                 16
1240    void argo_reset_coherence ( i n t n) {
1253      #i f MEM_POLICY == 7
1254         MPI_Win_lock (MPI_LOCK_EXCLUSIVE, workrank , 0 , ownerWindow ) ;
1255         g l o b a l O w n e r s [ 0 ] = 0 x1 ;
1256         g l o b a l O w n e r s [ 1 ] = 0 x0 ;
1257         f o r ( j = 2 ; j < o w n e r S i z e ; j ++)
1258             globalOwners [ j ] = 0 ;
1259         MPI_Win_unlock ( workrank , ownerWindow ) ;
1260         o w n e r O f f s e t = ( workrank == 0 ) ? p a g e s i z e : 0 ;
1261      #e n d i f
1268    }

 Listing 3.7: Argo: Function Definition of argo_reset_coherence (swdsm.cpp)
                                 (Modified Version)

 ber functions of naive_data_distribution (Listing 3.9) is rather apparent. Once
 a global_ptr object is created with the faulting address passed as an argument,
 as previously seen in lines 518 and 534 of Listing 3.4, the constructor is invoked,
 which in turn invokes the member functions homenode and local_offset of the
 naive_data_distribution class to do the policy computation (lines 43-4 of List-
 ing 3.8). After the computation finishes, the private members of the global_ptr
 class homenode and local_offset are retrieved with the public member functions
 node and offset, respectively, as seen in lines 519 and 535 of Listing 3.4.
     As it can be seen in Listing 3.9, the body of the member class functions homenode
 and local_offset is defined inside the class. The choice of not separating the def-
 inition from the declaration in that particular case is not bad and does not damage
 readability and design quality, since only the bind all memory policy is implemented,
 expressed as an one-liner in each of the functions.
     However, due to the introduction of the other seven memory policies, we increase
 the abstraction further by introducing a new implementation file which will host the
 computational part of the policies. So, under the source directory data_distribution
 we introduce the implementation file data_distribution.cpp3 to host the body of
 the functions homenode and local_offset of the naive_data_distribution class.
 The computational part of each policy is distinguished through the use of the prepro-
 cessor directive MEM_POLICY defined in data_distribution.hpp as a number from
 zero to seven, starting from bind all (Listing A.1 and A.2) and continuing with the
 rest of the policies in the order presented in Section 3.2.

 24     t e m p l a t e 
 25     c l a s s global_ptr {
 26         private :
 27             /∗ ∗ @ b r i e f The node t h i s p o i n t e r i s p o i n t i n g t o .                                                ∗/
 28             node_id_t homenode ;
 30             /∗ ∗ @ b r i e f The o f f s e t i n t h e node ’ s b a c k i n g memory .                                               ∗/
 31             std : : size_t l o c a l _ o f f s e t ;
 32
 33          public :
 42            g l o b a l _ p t r (T∗ p t r )
 43                : homenode ( D i s t : : homenode ( r e i n t e r p r e t _ c a s t ( p t r ) ) ) ,
 44                    l o c a l _ o f f s e t ( D i s t : : l o c a l _ o f f s e t ( r e i n t e r p r e t _ c a s t ( p t r ) ) )
 45                {}
 88     };

 Listing 3.8: Argo: Class Constructor of global_ptr (data_distribution.hpp)
                                 (Original Version)

       3 CMakeLists.txt        under argodsm/src was modified for the file to be included in the compilation.

                                                                     17
99    t e m p l a t e 
100   c l a s s naive_data_distribution {
101       private :
102           /∗ ∗ @ b r i e f Number o f ArgoDSM nodes .                                                   ∗/
103           s t a t i c i n t nodes ;
105           /∗ ∗ @ b r i e f S t a r t i n g a d d r e s s o f t h e memory s p a c e .                   ∗/
106           s t a t i c char ∗ start_address ;
108           /∗ ∗ @ b r i e f S i z e o f t h e memory s p a c e .                                         ∗/
109           s t a t i c long total_size ;
111           /∗ ∗ @ b r i e f One node ’ s s h a r e o f t h e memory s p a c e .                          ∗/
112           s t a t i c l o n g size_per_node ;
113
114        public :
133          s t a t i c node_id_t homenode ( c h a r ∗ c o n s t p t r ) {
134              r e t u r n ( p t r − s t a r t _ a d d r e s s ) / size_per_node ;
135          }

142          s t a t i c std : : size_t l o c a l _ o f f s e t ( char ∗ const ptr ) {
143              r e t u r n ( p t r − s t a r t _ a d d r e s s ) − homenode ( p t r ) ∗ size_per_node ;
144          }
155   };

Listing 3.9: Argo: Class Member Functions homenode & local_offset of
naive_data_distribution (data_distribution.hpp)
                            (Original Version)

    Along with MEM_POLICY, we also introduce the PAGE_BLOCK preprocessor directive
to set the block size for the policies working on varying granularities.

3.3.2.1       Cyclic Memory Policies
The software implementation of the cyclic group of memory policies consists of a series
of simple mathematical expressions in conjuction with conditions and loops (only in
the prime mapp policies case) executed at runtime for the address from the global
address space provided to the relevant functions. In Listing 3.10, the calculation of
the home node for the cyclic and cyclic block policies is presented. We don’t present
the rest of the policies in this category, since the method of calculation is similar.
    In particular, for all the memory policies in both of the homenode and local_offset
functions, the faulting from the shared virtual address space address which is passed
as an argument to them is subtracted from the starting address of the memory space
resulting in the actual offset being held in the addr variable. Notice that in the cyclic
group of memory policies, we don’t use addr in the calculation of the homenode and
offset variables but, rather, only use it in some conditions instead. The variable
we use in the calculation of the aforementioned variables is lessaddr, which is addr
minus granularity (size of a page). That said, the cyclic policies work by taking the
second page of the global address space as the first page and start the distribution
from there. We do this due to the fact that the allocation of any global data structure
in an application starts from the second page (offset 0x1000) onwards, since the very
first page of globalData (offset 0x0000) is reserved by the system to hold the amount
of memory pool currently allocated as well as the data structure for the TAS lock
to update this variable. The first page of the global address space is assigned the
master process (proc0 ) to be its home node, since the execution experience stalls if
the ownership is passed somewhere else.
    The first page of the global memory space will always be assigned to the master
process in the bind all memory policy, due to its distribution pattern, and that is why
the lessaddr variable is not introduced in that case.
    The difference between the scalar and block implementation of the cyclic group of
memory policies is less apparent in the homenode function and more in local_offset.

                                                        18
67    t e m p l a t e 
68    node_id_t n a i v e _ d a t a _ d i s t r i b u t i o n :: homenode ( c h a r ∗ c o n s t p t r ) {
72        # e l i f MEM_POLICY == 1
73           s t a t i c constexpr std : : size_t zero = 0;
74           c o n s t s t d : : s i z e _ t addr = p t r − s t a r t _ a d d r e s s ;
75           const std : : size_t lessaddr =
                  ( addr >= g r a n u l a r i t y ) ? addr − g r a n u l a r i t y : z e r o ;
76           c o n s t s t d : : s i z e _ t pagenum = l e s s a d d r / g r a n u l a r i t y ;
77           c o n s t node_id_t homenode = pagenum % nodes ;
78        # e l i f MEM_POLICY == 2
79           s t a t i c constexpr std : : size_t zero = 0;
80           s t a t i c c o n s t s t d : : s i z e _ t p a g e b l o c k = PAGE_BLOCK ∗ g r a n u l a r i t y ;
81           c o n s t s t d : : s i z e _ t addr = p t r − s t a r t _ a d d r e s s ;
82           const std : : size_t lessaddr =
                  ( addr >= g r a n u l a r i t y ) ? addr − g r a n u l a r i t y : z e r o ;
83           c o n s t s t d : : s i z e _ t pagenum = l e s s a d d r / p a g e b l o c k ;
84           c o n s t node_id_t homenode = pagenum % nodes ;
137   }

Listing 3.10: Argo: Class Member Function homenode of naive_data_distribution
(data_distribution.cpp) for the cyclic & cyclic_block memory policies
                              (Modified Version)

More specifically, what is being done differently in the block implementation than the
scalar one is the use of the implementation specific variable pageblock instead of
granularity in the calculation of pagenum, which happens in line 76 and 83 for the
cyclic and cyclic block implementation, respectively. In contrast, in the local_offset
function, the calculation of the pagenum variable is not the only difference that sets
the two implementations apart, but also the calculation of the offset variable which
happens in line 150 and 160 for cyclic and cyclic block, respectively.
    Aside from the differences between the scalar and block implementations, in gen-
eral, prime mapp and prime mapp block work a bit differently from the other policies
in their way of calculating the offset. The peculiarity of their implementation is the
use of loops for calculating the offset of a page in certain address ranges. These ad-
dress ranges refer to the ones corresponding to the real nodes in the system, after
the very first cyclic distribution of pages. For these address ranges, no mathematical
expression for a 3N/2 prime number is found to calculate correctly the offset of the
pages in the backing memory of the nodes. However, the offset of the pages in the
first cyclic distribution as well as for those corresponding to the virtual nodes in the
system is correctly calculated with the same statement as the one used in the cyclic
memory policies (line 190 and 217 for prime mapp and prime mapp block, respectively,
Listing A.8). That said, we can take advantage of these offsets to correctly calculate
the ones corresponding to real nodes. We accomplish this by iterating backwards a
page or a block of pages, depending on the implementation, till we hit a page of the
same owner in the correctly calculated offset address ranges, counting all the pages
of the same home node along the way (lines 195-9 and 222-6 for prime mapp and
prime mapp block, respectively, Listing A.8). If the page is hit, we calculate its offset
and then add to it the number of pages that we counted multiplied by granularity,
resulting in the correct offset of a page corresponding to a real node (line 202-3 and
229-30 for prime mapp and prime mapp block, respectively, Listing A.8).

3.3.2.2      First-Touch Memory Policy
The software implementation of the first-touch memory policy uses a directory that
it accesses with a very simple index function to fetch these values.
    The calculation of the index for fetching the home node and offset for an address
happens in line 119 and 238 of Listing 3.11 respectively, with the corresponding ac-
cesses to the directory in line 122 and 241. Notice that in the local_offset function,

                                                        19
You can also read