Department Informatik - Technical Reports / ISSN 2191-5008 - OPUS 4

Page created by Francisco Chavez
 
CONTINUE READING
Department Informatik
Technical Reports / ISSN 2191-5008

Stefan Reif and Wolfgang Schröder-Preikschat

Predictable Synchronisation Algorithms for
Asynchronous Critical Sections
Technical Report CS-2018-03

February 2018

Please cite as:
Stefan Reif and Wolfgang Schröder-Preikschat, “Predictable Synchronisation Algorithms for Asynchronous Critical
Sections,” Friedrich-Alexander-Universität Erlangen-Nürnberg, Dept. of Computer Science, Technical Reports,
CS-2018-03, February 2018.

                                                                               Friedrich-Alexander-Universität Erlangen-Nürnberg
                                                                                                           Department Informatik

                                                                                        Martensstr. 3 · 91058 Erlangen · Germany

                                                                                                                   www.cs.fau.de
Predictable Synchronisation Algorithms for
                Asynchronous Critical Sections
                                 Stefan Reif, Wolfgang Schröder-Preikschat
                                   Department of Computer Science
                    Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany

Abstract—Multi-core processors are ubiquitous. Even em-      is a synchronisation algorithm that is fast in in the worst
bedded systems nowadays use processors with multiple         case, without neglecting average-case performance.
cores. Such use cases often impose latency requirements         A promising concept that avoids blocking, while
because they interact with physical objects.
                                                             maintaining the convenient program structure of critical
   One consequence is a need for synchronisation algo-
                                                             sections, are asynchronous critical sections [10], [11],
rithms that provide predictable latency, in addition to
high throughput. A promising approach are asynchronous       [12]. This concept means that threads can request the
critical sections that avoid waiting even when a resource    execution of arbitrary critical sections without blocking.
is occupied.                                                 For mutual exclusion, only the execution of the critical
   This paper introduces two algorithms that allow for       section is delayed, but the requesting thread can pro-
both synchronous and asynchronous critical sections. Both    ceed. The critical section is hence decoupled from the
algorithms base on a novel wait-free queue. The evalua-      requesting thread, and both control flows are potentially
tion shows that both algorithms outperform pre-existing      concurrent to each other. Traditional synchronous critical
synchronisation algorithms for asynchronous requests, and    sections, in contrast, force threads to wait in case of
perform similarly to traditional lock-based algorithms for
                                                             contention.
synchronous requests. In summary, our synchronisation
algorithms can improve throughput and predictability of         For asynchronous critical sections, a run-time system
parallel applications.                                       has to provide a mechanism that ensures that each
                                                             submitted critical section eventually runs. This system
                   I. I NTRODUCTION                          also enforces mutual exclusion of all requests.
                                                                The contributions of this paper are two general-
   Shared-memory multi-core processors have become           purpose synchronisation algorithms that support asyn-
omnipresent [1], [2]. Even small embedded and portable       chronous critical sections:
devices employ processors with more than one core [3].
   Systems running on multi-core processors typically          •   A predictability-oriented synchronisation algorithm
need coordination for interacting threads. Most applica-           for shared-memory multi-core processors
tions ensure data-structure integrity by mutual exclusion      •   An adapted version of the above algorithm for
of critical sections. The corresponding synchronisation            many-core platforms where excess processor cores
overhead is often performance critical in parallel appli-          are available
cations and data structures [4], [5].                        Both algorithms are not limited to asynchronous critical
   With embedded applications running on multi-core          sections. Instead, they support traditional synchronous
processors, the need for predictable synchronisation         critical sections as well.
emerges. Seemingly small delays can accumulate and              The rest of the paper is structured as follows. Sec-
significantly harm the overall system performance and        tion II introduces existing concepts for delegation-based
responsiveness [6], [7], [8]. Blocking operations of tra-    synchronisation, which is a prerequisite for asynchronous
ditional lock algorithms are therefore problematic, espe-    critical sections. Then, Section III presents both novel
cially for embedded systems that interact with physical      synchronisation algorithms for asynchronous critical sec-
entities [9]. Locks generally delay threads until an as-     tions. Section IV and Section V examine their correct-
sociated resource is available, which leads to situation-    ness analytically and performance empirically. After-
specific waiting times. In the worst case, a deadlock oc-    wards, Section VI discusses related work and Section VII
curs and threads have to wait forever. Therefore, the goal   concludes the paper.
II. BACKGROUND                                           Listing 1: Guard sequencing loop
   Contention for shared resources generally requires            job_t *job = ...
coordination of concurrent threads. In case of conflict,         job_t *cur;
                                                                 if (NULL != (cur = vouch(guard, job))) do {
threads have two options available. First, they can wait           run(cur);
until the resource is available. This is the traditional lock-   } while (NULL != (cur = clear(guard)));
based synchronisation model. Second, the thread can
encapsulate the critical operation in a dedicated job data
structure and submit that job to a synchronisation entity
that executes the critical section at the right moment           asynchronous—If the queue is full, threads are forced
in time [13]. This second concept has been generalised           to wait. This paper presents an alternative algorithm that
under the term delegation-based synchronisation [4].             uses an unbounded queue. It thus never requires waiting
Delegation-based synchronisation can be extended for             for asynchronous requests.
asynchronous requests. By decoupling critical sections
from the requesting thread, blocking is not mandatory.           C. Guards
While the critical section is executed asynchronously and
                                                                    Guards provide delegation-based synchronisation with
concurrently, the requesting thread can continue doing
                                                                 focus on predictable execution time. As far as we know,
meaningful work.
                                                                 the original implementation [12] was the first to achieve
A. Remote Core Locking                                           wait-free progress guarantee for the entry and exit pro-
   Remote Core Locking (RCL) [14] provides general-              tocols of critical sections.
purpose delegation-based synchronisation. By concept,               Guards replace the dedicated server thread by an
RCL migrates arbitrary critical sections to dedicated            on-demand solution, the sequencer. Every thread that
server threads. Other threads thereby take the role of           requests execution of a critical section can take the role
client threads that encapsulate each critical section in         of the sequencer, if the guard protocols demand for it.
a job, which they submit to the server. Such a job                  The guard protocols are accompanied by a program-
describes the critical section in a closure [15]. The server     ming convention, which is shown in Listing 1. Each
thread iteratively executes all incoming requests. Since         thread submits critical sections using the vouch func-
only a single server thread exists, all critical sections        tion. This function is the entry protocol for the critical
are executed sequentially. RCL thus guarantees mutual            section and negotiates the sequencer thread. If the current
exclusion of all requests.                                       thread is supposed to take that role, the function returns
   While the control flow contrasts strongly with lock-          a job handle. Otherwise, it returns NULL to indicate that
based synchronisation, RCL is transparent in functional          no sequencing is required.
terms. To achieve compatibility, the RCL protocols force            The guard concept requires that the sequencer executes
every client to wait for completion of each request.             all pending requests. This convention is integrated into
On the one hand, waiting for completion simplifies the           the clear function. This function returns a handle to the
implementation because, at any moment in time, each              next request, if available. Otherwise, the clear function
client can submit at most one request. This limit allows         returns NULL to indicate that no more jobs are pending.
for bounded data structures. On the other hand, RCL                 An important aspect of the guard concept is the
does not allow for asynchronous critical sections.               progress guarantee. Even if the guard is occupied,
   This paper presents an alternative implementation for         threads neither block inside vouch nor clear. For non-
RCL that differs in functional aspects. In contrast to           sequencer threads, this means that they continue their
the original solution [14], this version allows for asyn-        original control flows immediately and run in parallel to
chronous critical sections.                                      their critical sections. For the sequencer, however, the
                                                                 progress guarantee is more complex. It is possible that
B. Queue Delegation Locking                                      the sequencer remains in the sequencing loop forever,
  Queue Delegation Locking (QDL) [11] combines                   when other threads submit too many jobs too quickly.
a lock with a bounded queue. This queue collects                 For the application, it is therefore mandatory that critical
delegated critical sections. At lock release, the old            sections are rare. Variants of the guard concept that
owner executes all pending requests. However, the                remove this restriction by safely renegotiating the role
queue is bounded so the execution is only conditionally          of the sequencer are beyond the scope of this paper.
Guards rely on an additional reply mechanism. It                   Listing 2: Data structure definitions
allows guards to signal that a request has been executed
                                                            typedef struct {
and has terminated, ensuring that all control flow and        chain_t *next;
data dependencies are fulfilled. A typical implementation   } chain_t;
of such a reply mechanism is a future [16] object that      typedef struct {
manages results of critical sections. In consequence,         chain_t *head;
                                                              chain_t *tail;
guards allow threads to wait immediately (fully syn-        } guard_t;
chronous critical section), never (fully asynchronous
                                                            typedef struct {
critical section), or at any later moment in time.            chain_t *head;
   This paper presents an alternative implementation for      chain_t *tail;
                                                              sleep_t wait;
guards that differs in non-functional aspects. Compared     } actor_t;
to other solutions, it significantly improves performance
and predictability.

D. Actors                                                                  Listing 3: Guard protocols
   Actors [17] are a concept of structuring parallel        void guard_setup(guard_t *self)
                                                            {
programs [10]. Such a program consists of sequential          self->head = self->tail = NULL;
entities that communicate via messages [18]. Typically,     }
dedicated actor libraries [19] or languages support         chain_t *guard_vouch(guard_t *self)
programmers to implement software entities as actors.       {
                                                              item->next = NULL;
Actors are therefore accompanied by a run-time system         chain_t *last = FAS(&self->tail, item);          // V1
that facilitates message passing and actor scheduling.        if (last) {
                                                                if (CAS(&last->next, NULL, item))              // V2
   For the scope of this paper, an actor combines a               return NULL;
mailbox data structure and a server thread. Similar to          // last->next == DONE
                                                              }
RCL, each thread encapsulates critical sections in jobs       self->head = item;                               // V3
and enqueue them in the mailbox. The server thread            return item;
                                                            }
executes all incoming requests sequentially and thus
guarantees mutual exclusion for all critical sections.      chain_t *guard_clear(guard_t *self)
                                                            {
   This interpretation is a mixture of the RCL and guard      chain_t *item = self->head;                      // C1
concepts. It combines a dedicated server thread with          // item != NULL
                                                              chain_t *next = FAS(&item->next, DONE);          // C2
the support for asynchronous requests. Therefore, this        if (!next)
paper focusses on efficient message passing, and presents       CAS(&self->tail, item, NULL);                  // C3
an algorithm that has higher throughput than existing         CAS(&self->head, item, next);                    // C4
                                                              return next;
alternatives, especially at high contention. Furthermore,   }
it has lower request latency for asynchronous critical
sections.

                  III. A LGORITHMS                             Listing 3 summarises all guard-related functions, using
   Both synchronisation algorithms presented in this        atomic load/store, compare-and-swap (CAS), and fetch-
paper operate on a wait-free [20] multiple-producer         and-set (FAS) operations. A setup function initialises
single-consumer (MPSC) queue. As a distinctive fea-         the guard data structure, vouch enqueues a critical
ture, the enqueue operation detects whether the queue       section to the guard, and clear function removes a
was empty beforehand. The synchronisation algorithms        request after completion. Figure 1 shows how vouch
utilise this speciality internally.                         and clear are mapped to queue operations, and how
                                                            the queue represents critical section states.
A. The Guard Algorithm                                         The entry protocol of the guard is the vouch function,
  The guard data structure, as shown in Listing 2, is       which internally performs an enqueue operation. The
basically an MPSC queue. Therefore, it has a head           FAS operation V1 orders request and, consequently,
pointer referencing the oldest element, and a tail          critical sections. Then, V2 detects whether the queue
pointer that indicates where new elements can be added.     was empty beforehand. If so, the vouch function returns
Listing 4: Actor protocols
                                                                void actor_setup(actor_t *self)
                                        free                    {
                         N N                                      self->head = self->tail = NULL;
                                                                  sleep_setup(&self->wait);
                                                                  thrd_create(actor_serve, self);
                            vouch(cs1) = cs1                    }

                                                                void actor_submit(actor_t *self, chain_t *item)
                                                                {
                      #1                                          item->next = NULL;
                                        #1 running                chain_t *last = FAS(&self->tail, item);
                                                                  if (last) {
                                                                    if (CAS(&last->next, NULL, item))
                                                                      return;
                                                                    // last->next == DONE
 clear() = NULL             vouch(cs2) = NULL                     }
                                                                  self->head = item;
                                                                  sleep_awake(&self->wait);
                      #1     #2                                 }
                                        #1 running
                                        #2 pending              chain_t *actor_shift(actor_t *self, chain_t *item)
                                                                {
                                                                  chain_t *next = FAS(&item->next, DONE);
                                                                  if (!next)
                            clear() = cs2                           CAS(&self->tail, item, NULL);
                                                                  CAS(&self->head, item, next);
                                                                  return next;
                                                                }
                             #2
                                        #2 running              void actor_serve(actor_t *self)
                                                                {
                                                                  chain_t *item = NULL;
                                                                  while (1) {
                                                                    if (!item)
                                                                      item = sleep_await(&self->wait, &self->head);
 Fig. 1: Queue representation of critical section states            // item != NULL
                                                                    run(item);
                                                                    item = actor_shift(self, item);
                                                                  }
non-null, indicating that the current thread is supposed to     }
take the role of the sequencer. As sequencer, the thread
is allowed to execute its request immediately. Otherwise,
the critical section is already occupied, because the
                                                                operation, it is possible for a short moment that the tail
current job is not the first item in the queue. In this case,
                                                                pointer points to a new request while the update of the
a sequencer must already be present. Therefore, vouch
                                                                next pointer is still pending. In this case, the sequencer
returns NULL, indicating that the current thread is not
                                                                leaves the queue, because the next element is not yet
supposed to operate as sequencer.
                                                                available. To manage this situation, the FAS operation
   The exit protocol for the sequencer is implemented by        C2 signals job completion to V2 using a unique magic
the clear function. The sequencer calls this function           value, DONE.
to remove a request from the queue after completion.
If another item is pending in the queue, clear returns          B. The Actor Algorithm
a reference to that job. The sequencer is then obliged             The queue algorithms that implement the guard pro-
to execute it. Otherwise, clear returns NULL and the            tocols also constitute, with minor modifications, an actor
sequencer resumes to its original control flow.                 mailbox implementation. Listing 4 summarises the actor
   It is possible that calls to vouch and clear overlap.        protocols, which contain the same MPSC queue as the
If the queue already contains multiple elements, they           guard algorithms. Conceptually, the difference to guards
do not interfere, because vouch modifies the tail only,         is that a server thread is permanently available, instead
and clear operates on the head. However, if the queue           of an on-demand sequencer thread. In consequence, the
contains only a single element, then the two functions          interface differs.
interact. Figure 2 details the interaction between con-            The core component of the actor is the server thread.
current calls to vouch and clear. Inside the vouch              This thread runs the serve function, which executes all
incoming requests sequentially. Internally, this function
repeatedly calls shift, which dequeues a request from               N                   N      N                   N
                                                                                V1                      V2
the mailbox. Similarly to the guard algorithm, the first
queue element describes the currently running job, and
trailing elements represent pending requests. While no
requests are pending, the server thread waits passively            C2
until a client submits a job, to improve energy efficiency.
To this end, the guard queue algorithm needs minor
                                                                    D
adaptions to signal the presence of requests in the
mailbox.                                                                               C2                    C2
   On the client side, the submit function enqueues
requests to the actor mailbox. It is equivalent to the guard
vouch function, except for an additional wake-up notifi-           C3        V1
cation for the server thread when work is available. The
enqueue operation helps to avoid unnecessary signals,               D                   D      N             D     N
since it detects whether the queue was empty beforehand.                        V1
Only if the queue is empty, submit sends a wake-up
                                                                         N
signal. Otherwise, work is already pending, so that no
signal is required.
   Ideally, the worker thread of each actor is pinned to           C4                  C4          V3        C4
an exclusive core to avoid scheduler interference. This
interpretation of actors assumes that the system contains           D                   D      N             D     N
                                                                                V1                  V3
plenty of cores (“many-core system”). Therefore, the
application can dedicate one or more processor cores to              N N                 N
the execution of critical sections.

         IV. C ORRECTNESS C ONSIDERATIONS
                                                               Fig. 2: Complete state space of overlapping vouch and
   For the sake of comprehensibility and brevity, this
                                                               clear operations with NULL (N) and DONE (D) pointers
paper sketches a proof of correctness for the guard
algorithm only informally. To this end, two properties
need consideration. First, it is essential that, at every
moment in time, at most one critical section is under          bottom left node, the figure also covers the case were
execution (mutual exclusion) at a given guard. Second,         the sequencer leaves first, and afterwards, a new thread
every critical section submitted to a guard must eventu-       becomes sequencer. The top right node, in contrast, is the
ally be executed (liveliness).                                 scenario where the vouch operation completes before
   Mutual exclusion of critical sections is achieved           clear begins. In summary, it is impossible that multiple
by ensuring that, at every moment in time, at most             sequencers co-exist at any moment in time.
one sequencer exists. Here, three scenarios need to be            Liveliness of synchronisation based on mutual ex-
considered. First, mutual exclusion of sequencer threads       clusion is not possible to prove in general, because
initially holds because, at initialisation of the guard,       malicious threads can acquire a resource which they
no sequencer exists. Second, if the queue already con-         never release. In this artificial case, every further ac-
tains at least one element, vouch returns NULL. In             quisition attempt to that resource is necessarily delayed
consequence, no further thread can take the role of            forever. Therefore, this paper only considers cooperating
the sequencer. Hence, the property of mutual exclusion         critical sections that certainly release the corresponding
remains when adding jobs to a non-empty queue. Third,          resource after a finite number of instructions. In contrast,
if the sequencer leaves and another thread enters the          non-cooperating critical sections that hold the resource
sequencing loop, multiple complex overlapping patterns         forever are considered a programming mistake. Under
are possible. Figure 2 details the complete state space of     this precondition, liveliness of the guard algorithms can
interfering vouch and clear operations, considering            be shown by induction. Thereby, two situations need
every possible intermediate state. In the path through the     consideration. First, if the queue is empty, a job request
is granted immediately when vouch returns. Since               performance cpufreq-governor, which disables dynamic
vouch is wait-free, the job is certainly running after         voltage and frequency scaling (DVFS) for consistent
a bounded number of instructions. Second, if the queue         performance. The small system has an Intel Xeon E3-
is not empty, a previous request exists. By induction,         1275 v5 processor with 4 cores and hyper-threading. All
the previous request is eventually executed. Afterwards,       eight logical cores run at 3.6 GHz. This machine runs
the next job is executed, with nothing but a clear             Ubuntu 15.10 and also has DVFS disabled.
operation in between. Since clear is wait-free, every
critical section is certainly executed after a finite number   B. Throughput Evaluation
of instructions.                                                  The first part of the evaluation focusses on the average
   We have also verified both mutual exclusion and             case performance. The maximum throughput represents
liveliness properties of the guard algorithm using             the overhead to synchronise critical sections.
CDSChecker [21], [22]. This tool applies exhaustive               To quantify the throughput, a micro-benchmark ap-
state space exploration to multi-threaded test cases writ-     plication spawns 1 to 79 threads, or 1 to 7 threads,
ten in C or C++. For the actor algorithm, liveliness           for the large and the small system, respectively. Each
considerations are identical because it uses the same          thread is thereby pinned to a core. The last core is
queue, and mutual exclusion is trivial because only a          reserved for the actor server thread. For uniformity, we
single server thread executes critical sections.               restrict all measurements to 79 or 7 cores even though
                                                               guards and lock-based variants need no server thread.
                    V. E VALUATION                             Every thread requests execution of critical sections in a
   The evaluation compares both algorithms of Sec-             tight loop. Thus, the only relevant work performed by
tion III with pre-existing synchronisation algorithms.         the micro-benchmark is the synchronisation overhead to
It targets throughput and latency of each variant. For         sequentialise the execution of critical sections.
the evaluation, we have implemented micro-benchmarks              The synchronisation throughput, averaged over 107
in C. The benchmarks are tailored for this evaluation          requests, is shown in Figure 3. On the large system,
because, for delegation-based synchronisation, critical        the performance drops significantly at 10 cores due
sections must be representable as closures.                    to the hardware architecture. In case of more than
                                                               10 cores, communication between NUMA nodes is re-
A. Evaluation Setup                                            quired, which has a higher latency than local memory
   This evaluation examines the performance of multiple        access operations. The consequence is a significant per-
synchronisation algorithms. First, the GUARD and AC -          formance decline. For more than 30 threads, the perfor-
TOR synchronisation methods implement the algorithms           mance is relatively constant amongst all synchronisation
presented in Section III. Second, The OTHERGUARD               algorithms. On the small system, the performance is
variant implements a pre-existing guard algorithm [23]         relatively constant for more than 4 threads.
that uses a general-purpose wait-free queue [24]. Third,          For synchronous critical sections, the throughput
a fast wait-free MPSC queue [25] is used for an alter-         of the GUARD and ACTOR algorithms is between a
native actor (OTHERACTOR) implementation. A variant            TICKET and an MCS lock. This is, however, a test
of this queue is used, for instance, in the akka actor         where these algorithms cannot profit from the support
framework [19], [26]. Fourth, locks are represented by         for asynchronous execution. PTHREAD mutexes have the
TICKET and MCS locks [27]. Furthermore, PTHREAD                highest throughput because they put cores to sleep, which
mutexes serve as a performance baseline. They are              effectively reduces the degree of contention.
the only non-fair synchronisation algorithm considered            For asynchronous critical sections, the ACTOR vari-
in the evaluation. Additionally, the ACTOR, GUARD,             ant outperforms all lock-based synchronisation algo-
OTHERACTOR , and OTHERGUARD algorithms support                 rithms, except for the passively-waiting PTHREAD mu-
asynchronous critical sections and are therefore evalu-        tex. The GUARD variant is slightly slower. The OTHER -
ated for both synchronous and asynchronous requests.           ACTOR implementation also has a high throughput, but
   All experiments were conducted on two computers.            it is always behind the ACTOR algorithm of Section III.
The large system has 80 logical cores. It contains four        The OTHERGUARD algorithm, however, is bottlenecked
Intel Xeon E5-4640 processors, where each processor            by the relatively slow wait-free queue algorithm. Surpris-
has 10 cores and hyper-threading. The processor runs at        ingly, it cannot profit from asynchronous execution be-
2.2 GHz. The machine runs Ubuntu 16.04 and it uses the         cause it effectively increases the degree of contention—
108                                                                                                              109

                       107                                                                                                              108
  Throughput (Ops/s)

                                                                                                                   Throughput (Ops/s)
                       106                                                                                                              107

                       105                                                                                                              106
                             0   10       20       30         40          50     60             70     80                                     1   2   3     4       5   6   7

                                                           Threads                                                                                        Threads

                                                   (a) Large system                                                                       (b) Small system

                         actor              mcs              otheractor                ticket            guard-async                              otherguard-async
                        guard            pthread            otherguard           actor-async         otheractor-async

                                 Fig. 3: Maximum throughput for synchronous and asynchronous critical sections

as threads need not wait for the completion of jobs,                           need to wait while the critical section is unoccupied.
they eagerly submit further requests. In consequence,                          Again, the OTHERGUARD algorithm is relatively slow
the contention on the guard queue increases, and hence,                        because of the complex queue.
performance decreases. The ACTOR and GUARD variant                                The worst-case request costs are presented in Fig-
are unaffected because of the efficient queue algorithm.                       ure 5. The 95 % quantile represents the worst-case la-
                                                                               tency, but ignores hardware unpredictability (such as
C. Latency Evaluation                                                          hardware interrupts) and Linux scheduler interference.
   The second part of the evaluation examines the la-                          The results are similar to the average-case evaluation.
tency of synchronisation requests. Each request causes                         As an exception, PTHREAD mutexes fall behind because
a specific overhead, compared to the potential non-                            they are not fair. Notably, the GUARD algorithm scales
synchronised execution of the same code. This part of the                      nearly perfectly: At more than 20 cores, the latency is
evaluation therefore measures the overhead associated                          relatively constant. The asynchronous ACTOR variant has
with each individual synchronisation request.                                  the lowest worst-case latency in most scenarios.
   Similarly to the throughput evaluation, the application
consists of identical threads that eagerly request execu-                      D. Analysis
tion of 105 critical sections. The rdtsc instruction mea-
sures the associated costs with processor cycle precision.                        In summary, the evaluation shows that the GUARD
   The average request costs are presented in Figure 4,                        and ACTOR variants are competitive to existing synchro-
in processor cycles per request. Similarly to the through-                     nisation algorithms at synchronous requests, and they
put evaluation, the synchronisation costs are comparable                       outperform locks at asynchronous requests.
to lock-based synchronisation in the case of synchronous                          For synchronous critical sections, the throughput of
critical sections. On the large system, the latency grows                      the ACTOR and GUARD variants is between TICKET and
significantly when using more than 10 cores because                            MCS locks. When the application supports asynchronous
of the hardware architecture. The non-uniform memory                           critical sections, however, delegation-based synchroni-
access latency especially affects TICKET locks—they                            sation methods outperform their lock-based competi-
have a low latency at low contention, but they are                             tors. Furthermore, the ACTOR algorithm outperforms
relatively slow at high contention. For asynchronous                           an alternative, widely used queue algorithm. Locks are
requests, however, the GUARD and ACTOR algorithms                              competitive at low contention, but at high contention, the
outperform locks on both systems, because they do not                          ACTOR and GUARD algorithms are faster.
107                                                                                                                105

                     106
                                                                                                                                        104
  Latency (Cycles)

                                                                                                                     Latency (Cycles)
                     105

                                                                                                                                        103

                     104

                                                                                                                                        102
                     103

                     102                                                                                                                101
                           0       10       20       30         40          50     60             70     80                                   1   2   3     4       5   6   7

                                                             Threads                                                                                      Threads

                                                     (a) Large system                                                                     (b) Small system

                       actor                  mcs              otheractor                ticket            guard-async                            otherguard-async
                      guard                pthread            otherguard           actor-async         otheractor-async

                                 Fig. 4: Average request latency for synchronous and asynchronous critical sections

                     107                                                                                                                105

                     106
                                                                                                                                        104
  Latency (Cycles)

                                                                                                                     Latency (Cycles)
                     105

                                                                                                                                        103

                     104

                                                                                                                                        102
                     103

                     102                                                                                                                101
                           0       10       20       30         40          50     60             70     80                                   1   2   3     4       5   6   7

                                                             Threads                                                                                      Threads

                                                     (a) Large system                                                                     (b) Small system

                       actor                  mcs              otheractor                ticket            guard-async                            otherguard-async
                      guard                pthread            otherguard           actor-async         otheractor-async

                               Fig. 5: 95 % quantile request latency for synchronous and asynchronous critical sections

   The average-case latency analysis shows similar re-                           do not need to wait. In this scenario, both are faster than
sults. For synchronous critical sections, the latency of                         their competitors.
the ACTOR and GUARD implementations is between
ticket and MCS locks. In this scenario, lock algorithms                             The worst-case latency evaluation highlights the im-
wait at the beginning at critical sections, while ACTOR                          portance of fairness. The non-fair PTHREAD mutex falls
and GUARD implementations wait for completion of                                 behind most competitors, even though the average-case
critical sections. In consequence, the latency differences                       performance is relatively good. Asynchronous variants
are small for synchronous requests. For asynchronous                             are very fast, except for the OTHERGUARD algorithm.
critical sections, however, ACTOR and GUARD variants                             Asynchronous GUARD requests scale nearly perfectly.
E. Threats To Validity                                         the synchronisation algorithms in this paper base on an
   The performance of synchronisation algorithms al-           unbounded queue to fully support asynchronous requests,
ways depends on the actual hardware. The evaluation            but they nevertheless achieve competitive performance.
has used two different systems with varying processor             Previous work on general-purpose synchronisation al-
speed, core count, and memory uniformity (NUMA and             gorithms has often focussed on scalability [27], [33],
UMA). Other hardware platforms can possibly perform            [34] and the average-case performance [4], [5], [35].
differently. In particular, hardware vendors have started      Many lock algorithms have shown good performance
to provide dedicated mechanisms for efficient synchro-         in the average case, but they inherently suffer from
nisation, such as transactional memory [28]. Future            blocking delays. Similarly, concurrent queue algorithms
instruction sets might therefore exhibit different perfor-     are typically optimised for the average-case [36], [37],
mance characteristics.                                         even wait-free implementations [24], [38]. In contrast,
   Real-world applications probably use a mixture of           both queue-based algorithms in this paper were designed
synchronous and asynchronous critical sections. Further-       with the worst-case latency in mind.
more, they can submit requests before they need the               Recent research on scalable synchronisation algo-
result, so that they can use a combination of both. Then,      rithms considers timing predictability at hardware level.
threads can continue doing meaningful work while the           For instance, locality-aware locking improves the worst-
critical section runs concurrently. In the ideal case, the     case memory access time by avoiding the overhead
result is already available when the application needs it.     of non-local cache operations in NUMA systems. For
However, this effect completely depends on the applica-        instance, hierarchical NUMA locks [33], [39], [40] prefer
tion and the degree it can utilise asynchronous critical       lock transition between cores of the same NUMA node.
sections. Therefore, writing asynchronous programs re-         They thus avoid the additional delay of remote lock
mains a challenge for the application programmer. How-         transitions, if possible. This optimisation results in higher
ever, we are optimistic that, in many cases, applications      throughput, but it actually comes at the cost of increased
can benefit from this form of micro-parallelism.               worst-case waiting time. Similarly, thread scheduling
   Actor languages and frameworks can impose an ad-            can improve system performance by exploiting data
ditional run-time overhead, to map computations to ac-         locality [41]. The actor concept goes even further and
tor operations. Similarly, run-time overhead related to        restricts data accesses to a single thread, and thus avoids
program transformations required to represent critical         remote data access operations.
sections in closures is outside the scope of the evaluation.      Predictability for synchronisation algorithms often
However, actor frameworks like akka are successful in          refers to fairness and linear blocking time, with respect
industry and academia. Especially on many-core sys-            to the number of waiting threads [42]. FIFO locks, such
tems, the cost of thread coordination likely dominates         as the Ticket Lock [27] ensure that every thread can
the overall system performance.                                eventually enter the critical section. They thus avoid
                                                               starvation, which is an extreme case of unpredictability.
                  VI. R ELATED W ORK                           Similarly to the algorithms in this paper, some FIFO
   Asynchronous critical sections and similar synchro-         locks use a queue internally, such as MCS and LH [43].
nisation techniques have been used in operating sys-           The difference to the algorithms in this paper is the
tems [29], [30]. Similarly, the guard concept [12], [23]       synchronicity of requests—locks force threads to wait
was originally introduced as a “structuring aid” for multi-    if the critical section is occupied. In consequence, the
core real-time systems.                                        waiting time can depend on the number of contenders.
   Many synchronisation concepts support request dele-         In contrast, asynchronous requests always complete in a
gation, but no asynchronous requests. For instance, flat       finite number of operations, regardless of the degree of
combining [31] delegates data structure operations to          contention.
an on-demand combiner thread, but enforces request                In general, predictability is important for real-time
synchronicity. Similarly, RCL [14] and FFWD [32]               systems. Hard real-time systems do not depend on
force threads to wait until requests have completed.           average-case performance. Instead, they are designed to
The reason is that, without asynchronous requests, every       meet deadlines, using a sound worst-case execution time
client thread has at most one pending request. This            (WCET) analysis [9], [44], [45]. Deriving a safe worst-
limitation allows for bounded internal data structures,        case blocking bound, however, is known to be diffi-
which are often relatively fast and simple. In contrast,       cult [46], [47]. Asynchronous critical sections can help
to eliminate the need for a blocking bound estimation              The evaluation shows that both synchronisation algo-
because requests cannot block. Then, the WCET of syn-           rithms are competitive for synchronous critical sections,
chronisation algorithms is independent of the durations         and they outperform lock-based variants at asynchronous
of critical sections. For synchronous requests, blocking        critical sections. Both throughput and latency are better
is performed on an optional future variable. In that case,      than lock-based alternatives. On a 80 core machine, the
a real-time system designer can apply existing blocking         worst-case request latency of guard requests is nearly
bound estimation techniques to derive the waiting time          constant at more than 20 cores. Actors and guard-based
for the future object.                                          variants also scale better than locks in the worst-case la-
   Besides real-time systems, predictability has also be-       tency evaluation. In summary, both algorithms therefore
come an issue for high-performance computing (HPC).             offer high performance and timing predictability. The
At large scale, seemingly minor delays can decrease             actor algorithm furthermore outperforms an alternative
system performance significantly [6], [7], [8]. In conse-       implementation based on a pre-existing MPSC queue.
quence, HPC systems often employ application-specific              In summary, both algorithms show that asynchronous
operating systems that minimise system noise [48]. The          critical sections improve the performance and pre-
synchronisation algorithms presented in this paper can          dictability of parallel programs significantly. Future work
help avoiding synchronisation-related jitter.                   will examine how distributed storage systems, video
                                                                streaming and processing, and other latency-critical
                   VII. C ONCLUSION                             compute-intensive applications benefit from this form of
   The trend towards embedded multi-core processors is          micro-parallelism.
accompanied by a need for fast and predictable thread
coordination. This paper has presented two general-                              ACKNOWLEDGEMENTS
purpose synchronisation algorithms that allow for asyn-            This work is supported by the German Research
chronous critical sections. The benefit of asynchronous         Foundation (DFG) under grants no. SCHR 603/8-2,
execution of critical sections is that threads are not forced   SCHR 603/13-1, SCHR 603/15-1, and the Transregional
to wait if a resource is occupied. Instead, they submit a       Collaborative Research Centre “Invasive Computing”
request that will be executed later. Job management is          (SFB/TR89, Project C1). We further thank Timo Hönig
implemented as an efficient wait-free MPSC queue.               for his insightful comments.
   Both algorithms are not limited to asynchronous crit-
ical sections. They also support traditional synchronous                             A PPENDIX A
critical sections by waiting for completion of requests.                           PASSIVE WAITING
This extension mimics the behaviour of locking proto-              For the actor algorithm, a server thread exists perma-
cols, if required.                                              nently. In consequence, it requires an efficient mecha-
   The guard algorithm assumes that the number of pro-          nism to wait for synchronisation requests while the actor
cessor cores is relatively small and therefore negotiates       is idle.
a sequencer thread on demand. This role demands the                Listing 5 sketches an implementation for the server
execution of all pending requests. The actor algorithm,         thread to sleep while it is idle. In this example, the Linux
in contrast, assumes that plenty of processor cores are         futex system call [49] is used to wait passively until a
available and therefore permanently occupies a dedicated        request is present. A flag variable indicates whether the
core for request processing. The evaluation, however,           server thread is currently waiting for further requests,
shows that both algorithms are fast at any degree of            and therefore sensitive to a wakeup signal.
contention, on large and small systems.                            The awake function first checks the sensitivity flag
   For both algorithms presented in this paper, the num-        before it sends a wake-up signal. On the other side,
ber of instructions per critical section has an upper           await first sets the flag before checking the actual
bound. For the guard, the worst-case costs are seven            sleeping condition. This combination avoids lost-wakeup
atomic memory operations per asynchronous critical              problems because, between checking the condition and
section. The actor has an additional system-specific            calling futex_wait, the flag is set.
overhead when the server thread waits passively. Both              Passive waiting, in general, is not specific to the Linux
variants thus provide wait-free progress guarantees to          futex system call. Alternative implementations could
all interacting threads, assuming that all critical sections    use UNIX signals or hardware interrupts to wait for
terminate eventually.                                           synchronisation requests.
Listing 5: Sleep operations                      Listing 6: Link element with deallocation function
typedef struct {                                              typedef struct {
  int state;                                                    chain_t *next;
} sleep_t;                                                      work_t   work;
                                                                free_t   free;
void sleep_setup(sleep_t *self)                               } chain_t;
{
  self->state = 0;
}

void *sleep_await(sleep_t *self, void **expr)                  Listing 7: Guard protocols with request deallocation
{
  while (1) {                                                 chain_t *guard_vouch(guard_t *self)
    self->state = 1;                                          {
    void *data = *exp;                                          item->next = NULL;
    if (data) {                                                 chain_t *last = FAS(&self->tail, item);      // V1
      self->state = 0;                                          if (last) {
      return data;                                                if (CAS(&last->next, NULL, item))          // V2
    }                                                               return NULL;
    futex_wait(&self->state, 1);                                  // last->next == DONE
  }                                                               last->free(last);
}                                                               }
                                                                self->head = item;                           // V3
void sleep_awake(sleep_t *self)                                 return item;
{                                                             }
  if (CAS(&self->state, 1, 0))
    futex_wake(&self->state);                                 chain_t *guard_clear(guard_t *self)
}                                                             {
                                                                chain_t *item = self->head;                  // C1
                                                                // item != NULL
                                                                chain_t *next = FAS(&item->next, DONE);      // C2
                                                                bool mine = true;
                                                                if (!next)
                    A PPENDIX B                                   mine = CAS(&self->tail, item, NULL);       // C3
               M EMORY M ANAGEMENT                              CAS(&self->head, item, next);                // C4
                                                                if (mine)
   Memory management is an important issue for parallel           item->free(item);
                                                                return next;
algorithms. Since multiple control flows access shared        }
data structures simultaneously, deallocation needs coor-
dination. Otherwise, control flows operate on possibly
invalid data.
   Both algorithms presented in this paper use a              when it is no longer needed. The notification implies that
chain_t data structure that represent jobs. Memory            deallocation is safe since no sequencer or server thread
management for these queue elements has to consider           accesses the link element any more.
that multiple control flows access them.                         Deallocation is straight-forward for the actor algo-
   Memory management further has to support both              rithm, since a dedicated server thread exists. All queue
synchronous and asynchronous critical sections. The           node deallocations can be performed by this thread.
chain_t data structure in Listing 6 therefore embeds a        Since only a single server thread exists, it intrinsically
function pointer (free) which describes the deallocation      knows when a link element is no longer needed. The
procedure, depending on the request type.                     server thread can thus ensure that every link element is
   For asynchronous critical sections, the free function      deallocated exactly once.
can deallocate the request data structure. Since the criti-      For the guard algorithm, however, deallocation is
cal section is asynchronous, it is safe to assume that no     more complex, since threads take the role of the se-
other thread accesses the link data structure afterwards.     quencer only temporarily. Therefore, the code in List-
   For synchronous critical sections, however, a thread       ing 7 identifies the situations where request deallocation
awaits termination of the critical section. Then, the         is safe. Importantly, the link element is not accessed
sequencer must not deallocate the request, since another      afterwards, even in the case of a sequencer change.
thread can still access it. In that case, the free function      In most cases, the sequencer is allowed to deallocate a
pointer notifies a potentially waiting thread that the        request directly after executing it. However, there is one
critical section has completed. It is then up to the          notable exception when concurrent vouch and clear
waiting thread to actually deallocate the link element        interfere in such a way that the role of the sequencer
transitions. Since the vouch function accesses the next                [11] D. Klaftenegger, K. Sagonas, and K. Winblad, “Brief announce-
pointer in V2, a concurrent clear must not deallocate                       ment: Queue delegation locking,” in Proceedings of the 26th
                                                                            ACM Symposium on Parallelism in Algorithms and Architec-
the corresponding request. However, clear can detect                        tures (SPAA 2014). ACM Press, 2014, pp. 70–72.
the concurrent vouch. If the sequencer manages to reset                [12] G. Drescher and W. Schröder-Preikschat, “Guarded sections:
the tail pointer in C3, no concurrent vouch operation                       Structuring aid for wait-free synchronisation,” in Proceedings
is happening. Since the tail pointer is reset to NULL,                      of the 18th International Symposium On Real-Time Computing
                                                                            (ISORC 2015). IEEE Computer Society Press, 2015, pp. 280–
the current job is no longer accessible through the guard                   283.
data structure, especially for future vouch operations.                [13] Y. Oyama, K. Taura, and A. Yonezawa, “Executing paral-
Deallocation is therefore safe. However, if C3 fails, then                  lel programs with synchronization bottlenecks efficiently,” in
                                                                            Proceedings of the International Workshop on Parallel and
a concurrent vouch operation is certainly happening,
                                                                            Distributed Computing for Symbolic and Irregular Applications
and that vouch operation has finished V1 but not                            (PDSIA 1999). World Scientific, 1999, pp. 182–204.
V2. Later, the CAS operation V2 will encounter the                     [14] J.-P. Lozi, F. David, G. Thomas, J. Lawall, and G. Muller,
DONE value which signals the sequencer change. In this                      “Remote core locking: Migrating critical-section execution to
                                                                            improve the performance of multithreaded applications,” in
case, the new sequencer deallocates a request that its                      Proceedings of the USENIX Annual Technical Conference (ATC
predecessor executed.                                                       2012). USENIX Association, 2012, pp. 65–76.
   In summary, both algorithms reliably detect when link               [15] P. J. Landin, “The mechanical evaluation of expressions,” The
element deallocation is safe, without additional com-                       Computer Journal, vol. 6, no. 4, pp. 308–320, 1964.
                                                                       [16] H. C. Baker, Jr. and C. Hewitt, “The incremental garbage
munication. A general-purpose memory management                             collection of processes,” in Proceedings of the 1977 Symposium
scheme, such as hazard pointers [50], is not needed.                        on Artificial Intelligence and Programming Languages. ACM
                                                                            Press, 1977, pp. 55–59.
                         R EFERENCES                                   [17] C. Hewitt, P. Bishop, and R. Steiger, “A universal modular
                                                                            ACTOR formalism for artificial intelligence,” in Proceedings of
 [1] D. Geer, “Chip makers turn to multicore processors,” IEEE
                                                                            the 3rd International Joint Conference on Artificial Intelligence
     Computer, vol. 38, no. 5, pp. 11–13, May 2005.
                                                                            (IJCAI 1973). Morgan Kaufmann Publishers Inc., 1973, pp.
 [2] J. Parkhurst, J. Darringer, and B. Grundmann, “From single core
                                                                            235–245.
     to multi-core: Preparing for a new exponential,” in Proceedings
                                                                       [18] C. A. R. Hoare, “Communicating sequential processes,” Com-
     of the 25th IEEE/ACM International Conference on Computer-
                                                                            munications of the ACM, vol. 26, no. 1, pp. 100–106, Jan. 1983.
     aided Design (ICCAD 2006), ser. ICCAD 2006. ACM Press,
                                                                       [19] Akka Project, “Akka,” https://github.com/akka/akka, 2009.
     2006, pp. 67–72.
                                                                       [20] M. Herlihy, “Wait-free synchronization,” Transactions on Pro-
 [3] A. Carroll and G. Heiser, “Mobile multicores: Use them or
                                                                            gramming Languages and Systems (TOPLAS), vol. 13, no. 1,
     waste them,” ACM SIGOPS Operating Systems Review, vol. 48,
                                                                            pp. 124–149, Jan. 1991.
     no. 1, pp. 44–48, May 2014.
                                                                       [21] B. Norris and B. Demsky, “CDSchecker: Checking concurrent
 [4] H. Guiroux, R. Lachaize, and V. Quéma, “Multicore locks: The
                                                                            data structures written with C/C++ atomics,” in Proceedings
     case is not closed yet,” in Proceedings of the USENIX Annual
                                                                            of the 28th International Conference on Object Oriented Pro-
     Technical Conference (ATC 2016).         USENIX Association,
                                                                            gramming Systems Languages & Applications (OOPSLA 2013).
     2016, pp. 649–662.
                                                                            ACM Press, 2013, pp. 131–150.
 [5] V. Gramoli, “More than you ever wanted to know about
                                                                       [22] ——, “A practical approach for model checking C/C++11
     synchronization: Synchrobench, measuring the impact of the
                                                                            code,” ACM Transactions on Programming Languages and
     synchronization on concurrent algorithms,” in Proceedings of
                                                                            Systems (TOPLAS), vol. 38, no. 3, pp. 10:1–10:51, May 2016.
     the 20th ACM SIGPLAN Symposium on Principles and Practice
                                                                       [23] S. Reif, T. Hönig, and W. Schröder-Preikschat, “In the heat
     of Parallel Programming (PPoPP 2015). ACM Press, 2015,
                                                                            of conflict: On the synchronisation of critical sections,” in
     pp. 1–10.
                                                                            Proceedings of the 20th International Symposium on Real-Time
 [6] P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan, “The in-
                                                                            Distributed Computing (ISORC 2017). IEEE Computer Society
     fluence of operating systems on the performance of collective
                                                                            Press, 2017, pp. 42–51.
     operations at extreme scale,” in Proceedings of the 8th Annual
                                                                       [24] A. Kogan and E. Petrank, “Wait-free queues with multiple en-
     International Conference on Cluster Computing, 2006, pp. 1–
                                                                            queuers and dequeuers,” in Proceedings of the 16th Symposium
     12.
                                                                            on Principles and Practice of Parallel Programming (PPoPP
 [7] J. Dean and L. A. Barroso, “The tail at scale,” Communications
                                                                            2011). ACM Press, 2011, pp. 223–233.
     of the ACM, vol. 56, no. 2, pp. 74–80, Feb. 2013.
                                                                       [25] D.     Vyukov,      “Intrusive   mpsc     node-based      queue,”
 [8] L. Barroso, M. Marty, D. Patterson, and P. Ranganathan, “Attack
                                                                            http://www.1024cores.net/home/lock-free-algorithms/queues/
     of the killer microseconds,” Communications of the ACM,
                                                                            intrusive-mpsc-node-based-queue, 2010.
     vol. 60, no. 4, pp. 48–54, Mar. 2017.
                                                                       [26] V. Klang, “Adding high-performance mpsc queue based
 [9] M. Yang, A. Wieder, and B. Brandenburg, “Global real-time
                                                                            mailbox to akka,” https://github.com/akka/akka/commit/
     semaphore protocols: A survey, unified analysis, and compari-
                                                                            fb2decbcda5dd2b18a2abfbc0425d18e1e780f24, 2013.
     son,” in Proceedings of the 36th Real-Time Systems Symposium
                                                                       [27] J. M. Mellor-Crummey and M. L. Scott, “Algorithms for
     (RTSS 2015). IEEE Computer Society Press, 2015, pp. 1–12.
                                                                            scalable synchronization on shared-memory multiprocessors,”
[10] G. Agha, “Concurrent object-oriented programming,” Commu-
                                                                            Transactions on Computer Systems (TOCS), vol. 9, no. 1, pp.
     nications of the ACM, vol. 33, no. 9, pp. 125–141, Sep. 1990.
                                                                            21–65, 1991.
[28] M. Herlihy and E. Moss, “Transactional memory: Architectural         [39] M. Chabbi and J. Mellor-Crummey, “Contention-conscious,
     support for lock-free data structures,” in Proceedings of the 20th        locality-preserving locks,” in Proceedings of the 21st ACM
     ACM Annual International Symposium on Computer Architec-                  SIGPLAN Symposium on Principles and Practice of Parallel
     ture (ISCA 1993). ACM Press, 1993, pp. 289–300.                           Programming (PPoPP 2016). ACM Press, 2016, pp. 22:1–
[29] C. Pu and H. Massalin, “An overview of the Synthesis operating            22:14.
     system,” Department of Computer Science, Columbia Univer-            [40] M. Chabbi, A. Amer, S. Wen, and X. Liu, “An efficient
     sity, New York, NY, USA, Tech. Rep., 1989.                                abortable-locking protocol for multi-level NUMA systems,” in
[30] F. Schön, W. Schröder-Preikschat, O. Spinczyk, and                      Proceedings of the 22nd ACM SIGPLAN Symposium on Prin-
     U. Spinczyk, “On interrupt-transparent synchronization in an              ciples and Practice of Parallel Programming (PPoPP 2017).
     embedded object-oriented operating system,” in Proceedings                ACM Press, 2017, pp. 61–74.
     of the 3rd IEEE International Symposium on Object-Oriented           [41] S. Boyd-Wickizer, R. Morris, and F. Kaashoek, “Reinventing
     Real-Time Distributed Computing (ISORC 2000).                IEEE         scheduling for multicore systems,” in Proceedings of the 12th
     Computer Society Press, 2000, pp. 270–277.                                Workshop on Hot Topics in Operating Systems (HotOS 2009).
[31] D. Hendler, I. Incze, N. Shavit, and M. Tzafrir, “Flat combining          USENIX Association, 2009, pp. 1–5.
     and the synchronization-parallelism tradeoff,” in Proceedings of     [42] L. Molesky, C. Shen, and G. Zlokapa, “Predictable synchro-
     the 22nd Annual Symposium on Parallelism in Algorithms and                nization mechanisms for multiprocessor real-time systems,”
     Architectures (SPAA 2010). ACM Press, 2010, pp. 355–364.                  University of Massachusetts, Tech. Rep., 1990.
[32] S. Roghanchi, J. Eriksson, and N. Basu, “ffwd: delegation            [43] P. Magnusson, A. Landin, and E. Hagersten, “Queue locks
     is (much) faster than you think,” in Proceedings of the 26th              on cache coherent multiprocessors,” in Proceedings of the
     Symposium on Operating Systems Principles (SOSP 2017).                    8th International Parallel Processing Symposium (IPPS 1994).
     ACM Press, 2017, pp. 342–358.                                             IEEE Computer Society Press, 1994, pp. 165–171.
[33] D. Dice, V. J. Marathe, and N. Shavit, “Lock cohorting: A            [44] B. C. Ward and J. H. Anderson, “Fine-grained multiprocessor
     general technique for designing NUMA locks,” in Proceedings               real-time locking with improved blocking,” in Proceedings of
     of the 17th ACM SIGPLAN Symposium on Principles and                       the 21st International Conference on Real-Time Networks and
     Practice of Parallel Programming (PPoPP 2012). ACM Press,                 Systems (RTNS 2013). ACM Press, 2013, pp. 67–76.
     2012, pp. 247–256.                                                   [45] ——, “Supporting nested locking in multiprocessor real-time
[34] A. Morrison, “Scaling synchronization in multicore programs,”             systems,” in Proceedings of the 24th Euromicro Conference on
     ACM Queue, vol. 14, no. 4, pp. 56–79, Aug. 2016.                          Real-Time Systems (ECRTS 2012), 2012, pp. 223–232.
[35] T. David, R. Guerraoui, and V. Trigonakis, “Everything you           [46] A. Biondi, B. Brandenburg, and A. Wieder, “A blocking bound
     always wanted to know about synchronization but were afraid to            for nested FIFO spin locks,” in Proceedings of the 37th Real-
     ask,” in Proceedings of the 24th ACM Symposium on Operating               Time Systems Symposium (RTSS 2016).             IEEE Computer
     System Principles (SOSP 2013). ACM Press, 2013, pp. 33–48.                Society Press, 2016, pp. 291–302.
[36] M. M. Michael and M. L. Scott, “Simple, fast, and practical          [47] A. Wieder and B. Brandenburg, “On the complexity of worst-
     non-blocking and blocking concurrent queue algorithms,” in                case blocking analysis of nested critical sections,” in Proceed-
     Proceedings of the 15th Annual ACM Symposium on Principles                ings of the 35th Real-Time Systems Symposium (RTSS 2014).
     of Distributed Computing (PODC 1996). ACM Press, 1996,                    IEEE Computer Society Press, 2014, pp. 106–117.
     pp. 267–275.                                                         [48] D. Tsafrir, Y. Etsion, D. Feitelson, and S. Kirkpatrick, “System
[37] A. Morrison and Y. Afek, “Fast concurrent queues for x86 pro-             noise, os clock ticks, and fine-grained parallel applications,” in
     cessors,” in Proceedings of the 18th ACM SIGPLAN Symposium                Proceedings of the 19th Annual International Conference on
     on Principles and Practice of Parallel Programming (PPoPP                 Supercomputing (ICS 2005). ACM Press, 2005, pp. 303–312.
     2013). ACM Press, 2013, pp. 103–112.                                 [49] M. Kerrisk et al., “The linux man-pages project,” https://www.
[38] C. Yang and J. Mellor-Crummey, “A wait-free queue as fast                 kernel.org/doc/man-pages, 2017, version 4.14.
     as fetch-and-add,” in Proceedings of the 21st ACM SIGPLAN            [50] M. Michael, “Hazard pointers: Safe memory reclamation for
     Symposium on Principles and Practice of Parallel Programming              lock-free objects,” Transactions on Parallel and Distributed
     (PPoPP 2016). ACM Press, 2016, pp. 16:1–16:13.                            Systems, vol. 15, no. 6, pp. 491–504, 2004.
You can also read