SOLUTION DONUT PAXOS: A RECONFIGURABLE CONSENSUS PROTOCOL

Page created by Elmer Rice
 
CONTINUE READING
[S OLUTION ] D ONUT PAXOS : A R ECONFIGURABLE C ONSENSUS P ROTOCOL

                                                        Anonymous authors
                                                   Paper under double-blind review

                         Abstract                                    machines, and reconfiguration was used only to replace failed
                                                                     machines with new machines – an infrequent occurrence. This
State machine replication protocols, like MultiPaxos and Raft,
                                                                     made it easy to leave reconfiguration out of sight, out of mind.
are at the heart of numerous distributed systems. To tol-
                                                                     Recently however, systems have become increasingly elastic,
erate machine failures, these protocols must replace failed
                                                                     and the need for frequent reconfiguration has grown. These
machines with new machines, a process known as reconfigu-
                                                                     elastic systems don’t just perform reconfigurations reactively
ration. Reconfiguration has become increasingly important
                                                                     when machines fail; they reconfigure proactively. For exam-
over time as the need for frequent reconfiguration has grown.
                                                                     ple, cloud databases can proactively request more resources
Despite this, reconfiguration has largely been neglected in the
                                                                     to handle workload spikes, and orchestration tools like Kuber-
literature. In this paper, we present Donut Paxos and Donut
                                                                     netes [12] are making it easier to build these types of elastic
MultiPaxos, a reconfigurable consensus and state machine
                                                                     systems. Similarly, in environments with short-lived cloud
replication protocol respectively. Our protocols can perform
                                                                     instances—as with serverless computing and spot instances—
a reconfiguration with little to no impact on the latency or
                                                                     and in mobile edge and Internet of Things settings, protocols
throughput of command processing; they can perform a recon-
                                                                     must adapt to a changing set of machines much more fre-
figuration in a few milliseconds; and they present a framework
                                                                     quently. This frequent need for reconfiguration makes it hard
that can be generalized to other replication protocols in a way
                                                                     to ignore reconfiguration any longer.
that previous reconfiguration techniques can not. We provide
                                                                        In this paper, we present a reconfigurable consensus proto-
proofs of correctness for the protocols and optimizations, and
                                                                     col and a reconfigurable state machine replication protocol:
present empirical results from an open source implementa-
                                                                     Donut Paxos and Donut MultiPaxos. Compared to existing
tion.
                                                                     reconfigurable protocols, our protocols have the following
                                                                     desirable properties.
1   Introduction                                                        Little to No Performance Degradation. Donut Multi-
                                                                     Paxos can perform a reconfiguration without significantly
Many distributed systems [4, 6, 7, 11, 13] rely on a state ma-       degrading the throughput or latency of processing client com-
chine replication protocol, like MultiPaxos [14] or Raft [29],       mands. For example, we show that reconfiguration has less
to keep multiple replicas of their data in sync. Over time,          than a 4% effect on the median of throughput and latency
machines fail, and if too many machines in a state machine           measurements (Section 7).
replication protocol fail, the protocol grinds to a halt. Thus,         Quick Reconfiguration. Donut MultiPaxos can perform
state machine replication protocols have to replace failed ma-       a reconfiguration quickly. Reconfiguring to a new set of ma-
chines with new machines as the protocol runs, a process             chines takes one round trip of communication in the normal
known as reconfiguration.                                            case (Section 4). Empirically, this requires only a few mil-
   Reconfiguration is an essential component of state ma-            liseconds within a single data center (Section 7).
chine replication. It is not an optimization or an afterthought.        Theoretical Insights. Donut Paxos generalizes Vertical
Without a reconfiguration protocol in place, a state machine         Paxos [19], it is the first protocol to achieve the theoretical
replication protocol will inevitably stop working; it’s just a       lower bound on Fast Paxos [16] quorum sizes, and it corrects
matter of when. Despite this, reconfiguration has largely been       errors in DPaxos [28] (Section 6).
neglected by current academic literature. Researchers have              Proven Safe. We describe Donut Paxos and Donut Multi-
invented dozens of state machine replication protocols, yet          Paxos precisely and prove that both are safe (Sections 3, 4, 5,
many papers either discuss reconfiguration briefly with no           A, B). Unfortunately, this is not often done for reconfiguration
evaluation [27, 31–33], propose theoretically safe but inef-         protocols [26, 31–33].
ficient reconfiguration protocols [15, 22], or do not discuss           In a nutshell, our protocols work by leveraging two
reconfiguration at all [2, 3, 16, 24, 25].                           key design ideas. The first is to decouple reconfiguration
   Ignoring reconfiguration has never been ideal, but we have        from the standard processing path. Many replication proto-
largely been able to get away with it. Historically, state ma-       cols [20, 22, 27, 29] have machines that are responsible for
chine replication protocols were deployed on a fixed set of          both processing commands and for orchestrating reconfig-
Submitted to the Journal of Systems Research (JSys)                                                                                  2021

urations. By contrast, Donut Paxos introduces a set of dis-                             f +1     2f +1                    f +1     2f +1
                                                                        Clients                           Clients
                                                                                      Proposers Acceptors               Proposers Acceptors
tinguished matchmaker machines that are solely responsible
for managing reconfigurations. These matchmakers act as a                 c1                     a1         c1                     a1
source of truth; they always know the current configuration.                      1      p1 22 3                    6      p1 44 5
This decoupling, along with a number of novel protocol opti-              c2                   3 a2         c2                   5 a2
mizations, allow us to perform reconfiguration quickly in the                            p2                                p2
background, without degrading performance.                                c3                     a3         c3                     a3
   The second design point is to reconfigure across rounds, a
                                                                                   (a) Phase 1                      (b) Phase 2
technique known as vertical reconfiguration [19]. With verti-
cal reconfiguration, every round of consensus can execute us-                  Figure 1: Paxos communication diagram ( f = 1).
ing a different configuration. Replication protocols based on
classical MultiPaxos instead assume a totally ordered log of
chosen commands and reconfigure across log entries, known              must ensure that if a value x is chosen in round i, then no
as horizontal reconfiguration. Many state machine replica-             other value besides x can ever be chosen in any round less
tion protocol do not have logs and cannot perform horizontal           than i. This is the purpose of Paxos’ two phases. In Phase 1
reconfiguration [2, 8, 27, 30, 33]. Vertical reconfiguration, on       of round i, the proposer contacts the acceptors to (a) learn of
the other hand, is more generally applicable and can be more           any value that may have already been chosen in any round
easily used by other replication protocols.                            less than i and (b) prevent any new values from being chosen
                                                                       in any round less than i. In Phase 2, the proposer proposes a
                                                                       value to the acceptors, and the acceptors vote on whether or
2     Background
                                                                       not to choose it. In Phase 2, the proposer will only propose a
                                                                       value x if it learned through Phase 1 that no other value has
2.1    System Model
                                                                       been or will be chosen in a previous round.
Throughout the paper, we assume an asynchronous network                   More concretely, Paxos executes as follows, as illustrated
model in which messages can be arbitrarily dropped, delayed,           in Figure 1. When a client wants to propose a value x, it sends
and reordered. We assume machines can fail by crashing but             x to a proposer p. Upon receiving x, p begins executing one
do not act maliciously. We assume that machines operate at             round of Paxos, say round i. First, it executes Phase 1. It sends
arbitrary speeds, and we do not assume clock synchronization.          P HASE 1Ahii messages to the acceptors. An acceptor ignores
We assume a discovery service that nodes can use to find               a P HASE 1Ahii message if it has already received a message in
each other, but do not require that this service be strongly           a larger round. Otherwise, it replies with a P HASE 1Bhi, vr, vvi
consistent. A node can safely communicate with outdated                message containing the largest round vr in which the acceptor
nodes. A system like DNS would suffice. Every protocol                 voted and the value it voted for, vv. If the acceptor hasn’t
discussed in this paper assumes (for liveness) that at most f          voted yet, then vr = −1 and vv = null. When the proposer
machines will fail for some configurable f .                           receives P HASE 1 B messages from a majority of the acceptors,
                                                                       Phase 1 ends and Phase 2 begins.
                                                                          At the start of Phase 2, the proposer uses the P HASE 1B
2.2    Paxos                                                           messages that it received in Phase 1 to select a value x such
A consensus protocol is a protocol that selects a single value         that no value other than x has been or will be chosen in any
from a set of proposed values. Paxos [14, 17] is one of the            round less than i. Specifically x is the vote value associated
oldest and most popular consensus protocols. A Paxos deploy-           with the largest received vote round, or any value if no ac-
ment that tolerates f faults consists of an arbitrary number           ceptor had voted (see [17] for details). Then, the proposer
of clients, f + 1 nodes called proposers, and 2 f + 1 nodes            sends P HASE 2Ahi, xi messages to the acceptors. An acceptor
called acceptors, as illustrated in Figure 1. To reach consen-         ignores a P HASE 2Ahi, xi message if it has already received a
sus on a value, an execution of Paxos is divided into a number         message in a larger round. Otherwise, it votes for x and sends
of rounds, each round having two phases: Phase 1 and Phase             back a P HASE 2Bhii message to the proposer. If a majority of
2. Every round is orchestrated by a single pre-determined              acceptors vote for the value, then the value is chosen, and the
proposer. The set of rounds can be any unbounded, totally              proposer informs the client.
ordered set. It is common to let the set of rounds be the set
of lexicographically ordered integer pairs (r, id) where r is an       2.3      Flexible Paxos
integer and id is a unique proposer id, where a proposer is
responsible for executing every round that contains its id.            Paxos deploys a set of 2 f + 1 acceptors, and proposers com-
   When a proposer executes a round, say round i, it attempts          municate with at least a majority of the acceptors in Phase
to get some value x chosen in that round. Paxos is a consensus         1 and in Phase 2. Flexible Paxos [10] is a Paxos variant
protocol, so it must only choose a single value. Thus, Paxos           that generalizes the notion of a majority to that of a quorum.

                                                                   2
Submitted to the Journal of Systems Research (JSys)                                                                                      2021

Specifically, Flexible Paxos introduces the notion of a con-                                 f +1              2f +1
                                                                             Clients
                                                                                           Proposers         Matchmakers
figuration C = (A; P1; P2). A is a set of acceptors. P1 and
P2 are sets of quorums, where each quorum is a subset of A.
                                                                                                             m1
A configuration satisfies the property that every quorum in                                                       m2
P1 (known as a Phase 1 quorum) intersects every quorum                                             23                  m3
                                                                                                        23
in P2 (known as a Phase 2 quorum). For a configuration                         c1          1                  4          a1
to tolerate f failures, there must exist some Phase 1 quorum                           8         p1           5
and some Phase 2 quorum of non-failed machines despite an                                                     4
                                                                               c2                             5          a2   C0 Acceptors
arbitrary set of f failures.                                                                     p2          76
   Flexible Paxos is identical to Paxos with the exception                                              76               a3
                                                                               c3
that proposers now communicate with an arbitrary Phase 1
quorum in Phase 1 and an arbitrary Phase 2 quorum in Phase                                                               b3
                                                                                                                  b2
2. In the remainder of this paper, we assume that all protocols                                              b1
operate using quorums from an arbitrary configuration rather
than majorities from a fixed set of 2 f + 1 acceptors.                                                       C1 Acceptors

                                                                                                2/3      Matchmaking Phase
3     Donut Paxos                                                                   4/5           Phase 1          6/7        Phase 2
We now present Donut Paxos. To ease understanding, we first                                    Figure 2: Donut Paxos ( f = 1).
describe a simplified version of Donut Paxos that is easy to
understand but is also naively inefficient. We then upgrade
the protocol to the complete, efficient version by way of a            The proposer then executes Phase 1 of Paxos with these prior
number of optimizations.                                               configurations, and then executes Phase 2 with configuration
                                                                       Ci , as illustrated in Figure 2. At first, the extra round trip of
3.1    Overview and Intuition                                          communication with the matchmakers and the large number
                                                                       of configurations in Phase 1 make Donut Paxos look slow.
Donut Paxos is largely identical to Paxos. Like Paxos, a Donut         This is for ease of explanation. Later, we will eliminate these
Paxos deployment includes an arbitrary number of clients, a            costs (Section 3.4 – Section 3.6).
set of at least f + 1 proposers, and some set of acceptors, as
illustrated in Figure 2. Paxos assumes that a single, fixed
configuration of acceptors is used for every round. The big            3.2     Details
difference between Paxos and Donut Paxos is that Donut
                                                                       Every matchmaker maintains a log L of configurations in-
Paxos allows every round to have a different configuration of
                                                                       dexed by round. That is, L[i] stores the configuration of round
acceptors. Round 0 may use some configuration C0 , while
                                                                       i. When a proposer receives a request x from a client and
round 1 may use some completely different configuration C1 .
                                                                       begins executing round i, it first selects a configuration Ci to
This idea was first introduced by Vertical Paxos [19].
                                                                       use in round i. It then sends a M ATCH Ahi,Ci i message to all
   Recall from Section 2 that a Paxos proposer in round i
                                                                       of the matchmakers.
executes Phase 1 in order to (1) learn of any value that may
                                                                            When a matchmaker receives a M ATCH Ahi,Ci i message, it
have been chosen in a round less than i and (2) prevent any
                                                                       checks to see if it had previously received a M ATCH Ah j,C j i
new values from being chosen in any round less than i. To do
                                                                       message for some round j ≥ i. If so, the matchmaker ignores
so, the proposer contacts the fixed set of acceptors. A Donut
                                                                       the M ATCH Ahi,Ci i message. Otherwise, it inserts Ci in log
Paxos proposer must also execute Phase 1 to establish that
                                                                       entry i and computes the set Hi of previous configurations
these two properties hold. The difference is that there is no
                                                                       in the log: Hi = {( j,C j ) | j < i,C j ∈ L}. It then replies to
longer a single fixed configuration of acceptors to contact.
                                                                       the proposer with a M ATCH Bhi, Hi i message. Matchmaker
Instead, a Donut Paxos proposer has to contact all of the
                                                                       pseudocode is given in Algorithm 1. An example execution
configurations used in rounds less than i.
                                                                       of a matchmaker is illustrated in Figure 3.
   However, every round uses a different configuration of ac-
ceptors, so how does the proposer of round i know which                     When the proposer in round i receives M ATCH Bhi, Hi1 i,
                                                                                            f +1
acceptors to contact in Phase 1? To resolve this question, a           . . ., M ATCH Bhi, Hi i from f + 1 matchmakers, it com-
                                                                                      f +1 j
Donut Paxos deployment also includes a set of 2 f + 1 match-           putes Hi = ∪ j=1 Hi . For example, with f = 1 and i = 2, if
makers. When a proposer begins executing round i, it selects           the proposer in round 2 receives M ATCH Bh2, {(0,C0 )}i and
a configuration Ci . It sends the configuration Ci to the match-       M ATCH Bh2, {(1,C1 )}i, it computes H2 = {(0,C0 ), (1,C1 )}.
makers, and the matchmakers reply with the configurations              Note that every round is statically assigned to a single pro-
used in previous rounds. We call this the Matchmaking phase.           poser and that a proposer selects a single configuration for a

                                                                   3
Submitted to the Journal of Systems Research (JSys)                                                                               2021

   3              3               3              3 C3                Algorithm 2 Acceptor Pseudocode
   2              2               2 C2           2 C2                State: the largest seen round r, initially −1
                                                                     State: the largest round vr voted in, initially −1
   1              1               1              1                   State: the value vv voted for in round vr, initially null
   0              0 C0            0 C0           0 C0                 1: upon receiving P HASE 1Ahii from p with i > r do
                                                                      2:     r←i
       (a)            (b)             (c)            (d)              3:     send P HASE 1Bhi, vr, vvi to p
Figure 3: A matchmaker’s log over time. (a) Initially, the            4: upon receiving P HASE 2Ahi, xi from p with i ≥ r do
matchmaker’s log is empty. (b) Then, the matchmaker re-               5:     r, vr, vv ← i, i, x
ceives M ATCH Ah0,C0 i. It inserts C0 in log entry 0 and              6:     send P HASE 2Bhii to p
returns M ATCH Bh0, 0i
                    / since the log does not contain any
configuration in any round less than 0. (c) The match-               Algorithm 3 Proposer Pseudocode. Modifications to a Paxos
maker then receives M ATCH Ah2,C2 i. It inserts C2 in log            proposer are underlined and shown in blue.
entry 2 and returns M ATCH Bh2, {(0,C0 )}i. (d) It then re-          State: a value x, initially null
ceives M ATCH Ah3,C3 i, inserts C3 in log entry 3, and returns       State: a round i, initially −1
M ATCH Bh3, {(0,C0 ), (2,C2 )}i. At this point, if the match-        State: the configuration Ci for round i, initially null
maker were to receive M ATCH Ah1,C1 i, it would ignore it.           State: the prior configurations Hi for round i, initially null
                                                                      1: upon receiving value y from a client do
                                                                      2:     i ← next largest round owned by this proposer
Algorithm 1 Matchmaker Pseudocode                                     3:     x←y
State: a log L indexed by round, initially empty                      4:     Ci ← an arbitrary configuration
 1: upon receiving M ATCH Ahi,Ci i from proposer p do
                                                                      5:     send M ATCH Ahi,Ci i to all of the matchmakers
                                                                                                                                      f +1
 2:    if ∃ a configuration C j in round j ≥ i in L then              6:   upon receiving M ATCH Bhi, Hi1 i, . . . , M ATCH Bhi, Hi          i
 3:        ignore the M ATCH Ahi,Ci i message                              from f + 1 matchmakers do
 4:    else                                                                           S f +1        j
                                                                      7:       Hi ←     j=1    Hi
 5:        Hi ← {( j,C j ) |C j ∈ L}
                                                                      8:       send P HASE 1Ahii to every acceptor in Hi
 6:        L[i] ← Ci
 7:        send M ATCH Bhi, Hi i to p                                 9:   upon receiving P HASE 1Bhi, −, −i from a Phase 1 quo-
                                                                           rum from every configuration in Hi do
                                                                     10:      k ← the largest vr in any P HASE 1Bhi, vr, vvi
                                                                     11:      if k 6= −1 then
                                                                     12:           x ← the corresponding vv in round k
round, so if two matchmakers return configurations for the           13:      send P HASE 2Ahi, xi to every acceptor in Ci
same round, they are guaranteed to be the same.
                                                                     14:   upon receiving P HASE 2Bhii from a Phase 2 quorum do
   The proposer then ends the Matchmaking phase and begins           15:      x is chosen, inform the client
Phase 1. It sends P HASE 1A messages to every acceptor in
every configuration in Hi and waits to receive P HASE 1B mes-
sages from a Phase 1 quorum from every configuration. Using          3.3     Proof of Safety
the previous example, the proposer sends P HASE 1A mes-
sages to every acceptor in C0 and C1 and waits for P HASE 1B         We now prove that Donut Paxos is safe; i.e. every execution
messages from a Phase 1 quorum of C0 and a Phase 1 quorum            of Donut Paxos chooses at most one value.
of C1 . The proposer then runs Phase 2 with Ci .
                                                                     Proof. Our proof is based on the Paxos safety proof in [16].
   Acceptor and proposer pseudocode are shown in Algo-               We prove, for every round i, the statement P(i): “if a proposer
rithm 2 and Algorithm 3 respectively. To keep things simple,         proposes a value v in round i (i.e. sends a P HASE 2A message
we assume that round numbers are integers, but generaliz-            for value v in round i), then no value other than v has been or
ing to an arbitrary totally ordered set is straightforward. A        will be chosen in any round less than i.” At most one value
Donut Paxos acceptor is identical to a Paxos acceptor. A             is ever proposed in a given round, so at most one value is
Donut Paxos proposer is nearly identical to a Flexible Paxos         ever chosen in a given round. Thus, P(i) suffices to prove
proposer with the exception of the Matchmaking phase and             that Donut Paxos is safe for the following reason. Assume
the configurations used in Phase 1 and Phase 2. For clarity          for contradiction that Donut Paxos chooses distinct values x
of exposition, we omit straightforward details surrounding           and y in rounds j and i with j < i. Some proposer must have
re-sending dropped messages and nacking ignored messages.            proposed y in round i, so P(i) ensures us that no value other

                                                                 4
Submitted to the Journal of Systems Research (JSys)                                                                                 2021

than y could have been chosen in round j. But, x was chosen,            3.4    Garbage Collection (How)
a contradiction.
                                                                        We’ve discussed how a proposer can change its round and
   We prove P(i) by strong induction on i. P(0) is vacuous              introduce a new configuration. Now, we explain how to shut
because there are no rounds less than 0. For the general                down old configurations. At the beginning of round i, a pro-
case P(i), we assume P(0), . . . , P(i − 1). We perform a case          poser p executes the Matchmaking phase and computes a set
analysis on the proposer’s pseudocode (Algorithm 3). Either             Hi of configurations in rounds less than i. The proposer then
k is −1 or it is not (line 11). First, assume it is not. In this        executes Phase 1 with the acceptors in these configurations.
case, the proposer proposes x, the value proposed in round k            Assume Hi contains a configuration C j for a round j < i. If
(line 12). We perform a case analysis on round j to show that           we prematurely shut down the acceptors in C j , then proposer
no value other than x has been or will be chosen in any round           p will get stuck in Phase 1, waiting for P HASE 1B messages
 j < i.                                                                 from a quorum of nodes that have been shut down. Therefore,
   Case 1: j > k. We show that no value has been or will be             we cannot shut down the acceptors in a configuration C j until
chosen in round j. Recall that at the end of the Matchmaking            we are sure that the matchmakers will never again return C j
phase, the proposer computed the set Hi of prior configura-             during the Matchmaking phase.
tions using responses from a set Mi of f + 1 matchmakers.                  Thus, we extend Donut Paxos to allow matchmakers to
Either Hi contains a configuration C j in round j or it doesn’t.        garbage collect configurations from their logs, ensuring that
                                                                        the garbage collected configurations will not be returned
    First, suppose it does.         Then, the proposer sent             during any future Matchmaking phase. More concretely, a
P HASE 1Ahii messages to all of the acceptors in C j . A Phase          proposer p can now send a G ARBAGE Ahii command to the
1 quorum of these acceptors, say Q, all received P HASE 1Ahii           matchmakers informing them to garbage collect all configu-
messages and replied with P HASE 1B messages. Thus, every               rations in rounds less than i. When a matchmaker receives
acceptor in Q set its round r to i, and in doing so, promised           a G ARBAGE Ahii message, it deletes log entry L[ j] for every
to never vote in any round less than i. Moreover, none of               round j < i. It then updates a garbage collection watermark w
the acceptors in Q had voted in any round greater than k. So,           to the maximum of w and i and sends back a G ARBAGE Bhii
every acceptor in Q has not voted and never will vote in round          message to the proposer. See Algorithm 4.
 j. For a value v0 to be chosen in round j, it must receive votes
from some Phase 2 quorum Q0 of round j acceptors. But, Q
                                                                        Algorithm 4 Matchmaker Pseudocode (with GC). Changes
and Q0 necessarily intersect, so this is impossible. Thus, no
                                                                        to Algorithm 1 are underlined and shown in blue.
value has been or will be chosen in round j.
                                                                        State: a log L indexed by round, initially empty
   Now suppose that Hi does not contain a configuration for             State: a garbage collection watermark w, initially 0
round j. Hi is the union of f + 1 M ATCH B messages from the              1: upon receiving G ARBAGE Ahii from proposer p do
 f + 1 matchmakers in Mi . Thus, if Hi does not contain a con-            2:    delete L[ j] for all j < i.
figuration for round j, then none of the M ATCH B messages                3:    w ← max(w, i)
did either. This means that for every matchmaker m ∈ Mi ,                 4:    send G ARBAGE Bhii to p
when m received M ATCH Ahi,Ci i, it did not contain a con-
                                                                         5: upon receiving M ATCH Ahi,Ci i from proposer p do
figuration for round j in its log. Moreover, by processing
                                                                         6:    if i < w or ∃ C j in round j ≥ i in L then
the M ATCH Ahi,Ci i request, the matchmaker is guaranteed to
                                                                         7:         ignore the M ATCH Ahi,Ci i message
never process a M ATCH Ah j,C j i request in the future. Thus,
                                                                         8:    else
every matchmaker in Mi has not processed a M ATCH A re-
                                                                         9:         Hi ← {( j,C j ) |C j ∈ L}
quest in round j and never will. For a value to be chosen
                                                                        10:         L[i] ← Ci
in round j, the proposer executing round j must first receive
                                                                        11:         send M ATCH Bhi, w, Hi i to p
replies from f + 1 matchmakers, say M j , in round j. But, Mi
and M j necessarily intersect, so this is impossible. Thus, no
value has been or will be chosen in round j.                               We also update the Matchmaking phase in three ways.
                                                                        First, a matchmaker ignores a M ATCH Ahi,Ci i message if
  Case 2: j = k. In a given round, at most one value is pro-
                                                                        i has been garbage collected (i.e. if i < w). Second, a
posed, let alone chosen. x is the value proposed in round k,
                                                                        matchmaker returns its garbage collection watermark w in
so no other value could be chosen in round k.
                                                                        every M ATCH B that it sends. Third, when a proposer
  Case 3: j < k. By induction, P(k) states that no value other                                                                         f +1
                                                                        receives M ATCH Bhi, w1 , Hi1 i, . . ., M ATCH Bhi, w f +1 , Hi i
than x has been or will be chosen in any round less than k.                                                                      f +1 j
                                                                        from f + 1 matchmakers, it again computes Hi = ∪ j=1 Hi . It
This includes round j.                                                                             f +1
                                                                        then computes w = max j=1 w j and prunes every configura-
  Finally, if k is −1, then we are in the same situation as in          tion in Hi in a round less than w. In other words, if any of the
Case 1. No value has or will be chosen in a round j < i.                 f + 1 matchmakers have garbage collected round j, then the

                                                                    5
Submitted to the Journal of Systems Research (JSys)                                                                                     2021

proposer also garbage collects round j.                                 f + 1 machines is not important.
   Once a proposer receives G ARBAGE Bhii messages from                   Later, we’ll extend this garbage collection protocol to
at least f + 1 matchmakers M, it is guaranteed that all future          Donut MultiPaxos (Section 4) and see empirically that match-
Matchmaking phases will not include any configuration in                makers usually return just a single configuration (Section 7).
any round less than i. Why? Consider a future Matchmaking
phase run with f + 1 matchmakers M 0 . M and M 0 intersect, so
some matchmaker in the intersection has a garbage collection
watermark at least as large as i. Thus, once a configuration
has been garbage collected by f + 1 matchmakers, we can                 3.6    Optimizations
shut down the acceptors in the configuration.
                                                                        We now present a couple of protocol optimizations. First,
                                                                        note that a proposer can proactively run the Matchmaking
3.5    Garbage Collection (When)                                        phase in round i before it hears from a client. This is similar
Once a configuration has been garbage collected, it is safe to          to proactively executing Phase 1, a standard optimization [9].
shut it down, but when is it safe to garbage collect a configu-         We call this optimization proactive matchmaking.
ration? It is not always safe. For example, if we prematurely               Second, assume that the proposer in round i has executed
garbage collect configuration C j in round j, a future proposer         the Matchmaking phase and Phase 1. Through Phase 1, it
in round i > j may not learn about a value v chosen in round j          finds that k = −1 and thus learns that no value has been chosen
and then erroneously get a value other than v chosen in round           in any round less than i (see the safety proof above). Assume
i. There are three situations in which it is safe for a proposer        that before executing Phase 2, the proposer transitions from
pi in round i to issue a G ARBAGE Ahii command. We explain              round i to round i + 1 as part of a reconfiguration. After
the three situations and provide intuition on why they are              executing the Matchmaking phase in round i + 1, the proposer
safe. Later, we’ll see that all three scenarios are important for       can skip Phase 1 and proceed directly to Phase 2. Why? The
Donut MultiPaxos. See Section A for a safety proof.                     proposer established in round i that no value has been or will
   Scenario 1. If the proposer pi gets a value x chosen in              be chosen in any round less than i. Moreover, because it did
round i, then it can safely issue a G ARBAGE Ahii command.              not run Phase 2 in round i, it also knows that no value has
Why? When a proposer p j in round j > i executes Phase                  been or will be chosen in round i. Together, these imply that
1, it will learn about the value x and propose x in Phase 2.            no value has been or will be chosen in any round less than
But first, it must establish that no value other than x has been        i + 1. Normally, the proposer would run Phase 1 in round
or will be chosen in any round less than j. The proposer                i + 1 to establish this fact, but since it has already established
pi already established this fact for all rounds less than i, so         it, it can instead proceed directly to Phase 2. We call this
any communication with the configurations in these rounds is            optimization Phase 1 bypassing.
redundant. Thus, we can garbage collect them.
                                                                            Phase 1 Bypassing depends on a proposer being the leader
   Scenario 2. If the proposer pi executes Phase 1 in round i           of round i and the leader of the next round i + 1. We can
and finds k = −1 (see Algorithm 3), then it can safely issue            construct a set of rounds such that this is always the case. Let
a G ARBAGE Ahii command. Recall that if k = −1, then no                 the set of rounds be the set of lexicographically ordered tuples
value has been or will be chosen in any round less than i.              (r, id, s) where r and s are both integers and id is a proposer
This situation is similar to Scenario 1. Any future proposer            id. A proposer is responsible for all the rounds that contain
p j in round j > i does not have to redundantly communicate             its id. With this set of rounds, the proposer p in round (r, p, s)
with the configurations in rounds less than i since pi already          always owns the next round (r, p, s + 1). For example given
established that no value has been chosen in these rounds.              two proposers a and b, we have the following ordering on
   Scenario 3. If the proposer pi learns that a value x has al-         rounds:
ready been chosen and has been stored on f + 1 non-acceptor
machines (e.g., f + 1 proposers), then the proposer can safely
issue a G ARBAGE Ahii command after it informs a Phase 2                        (0, a, 0) < (0, a, 1) < (0, a, 2) < (0, a, 3) < · · ·
quorum of acceptors in Ci of this fact. Any future proposer                     (0, b, 0) < (0, b, 1) < (0, b, 2) < (0, b, 3) < · · ·
p j in round j > i will contact a Phase 1 quorum of Ci and
                                                                                (1, a, 0) < (1, a, 1) < (1, a, 2) < (1, a, 3) < · · ·
encounter at least one acceptor that knows the value x has
already been chosen. When this acceptor informs p j that
a value x has already been chosen, p j stops executing the              In the next section, we’ll see that this optimization is essential
protocol entirely and simply fetches the value x from one of            for implementing Donut MultiPaxos with good performance.
the f + 1 machines that store the value. Note that storing the          Also note that this optimization is not particular to Donut
value on f + 1 machines ensures that some machine will store            Paxos. Paxos and MultiPaxos can both take advantage of this
the value despite f failures. The decision of exactly which             optimization.

                                                                    6
Submitted to the Journal of Systems Research (JSys)                                                                                        2021

4     Donut MultiPaxos                                                            0    1    kc   3     4   kp     6     7    8

                                                                                  a    b    c    d?        e?                        ···
4.1    MultiPaxos
First, we summarize MultiPaxos. Whereas Paxos is a consen-                        Region 1:      Region 2:            Region 3:
                                                                                (already chosen) (maybe chosen)       (not chosen)
sus protocol that agrees on a single value, MultiPaxos [14,35]
is a state machine replication protocol that agrees on a se-
                                                                          Figure 5: A leader’s knowledge of the log after Phase 1.
quence, or “log” of values. MultiPaxos manages multiple
replicas of a state machine. Clients send state machine com-
mands to MultiPaxos, MultiPaxos places the commands in                  (Region 2). More specifically:
a totally ordered log, and state machine replicas execute the
commands in log order. By beginning in the same initial                 • Region 1 [0, kc ]: The leader knows that a command has
state and executing the same commands in the same order, all              been chosen in every log entry less than or equal to kc .
deterministic state machine replicas are kept in sync.
                                                                        • Region 3 [k p + 1, ∞): The leader knows that no command
                    f + 1 Configuration C f +1                            has been chosen (in any round less than i) in any log entry
      Clients     Proposers of Acceptors Replicas                         larger than k p .
                             5
                                                                        • Region 2 [kc + 1, k p ]: If there is a command that may have
        c1                       a1
                1          2                                              already been chosen, then it appears between kc and k p .
                      p1    3     4         r1
                            2                                             Region 2 may also contain some log entries in which the
        c2                 3     a2                                       leader knows a value has already been chosen, and it may
                      p2       4            r2                            contain some log entries in which the leader knows that no
        c3                       a3                                       value has been chosen (we call these “holes”).
                                                                           After Phase 1, the leader sends a P HASE 2A message for
Figure 4: An example execution of MultiPaxos ( f = 1). The              every unchosen log entry in Region 2, proposing a “no-op”
leader is adorned with a crown.                                         command for the holes. Simultaneously, the leader begins
                                                                        accepting client requests. When a client wants to propose a
   To agree on a log of commands, MultiPaxos implements                 state machine command, it sends the command to the leader.
one instance of Paxos for every log entry. The ith instance of          The leader assigns log entries to commands in increasing
Paxos chooses the command in log entry i. More concretely,              order, beginning at k p + 1. It then runs Phase 2 of Paxos to get
a MultiPaxos deployment that tolerates f faults consists of             the command chosen in that entry in round i. Once the leader
an arbitrary number of clients, at least f + 1 proposers, a             learns that a command has been chosen in a given log entry,
configuration C of acceptors, and at least f + 1 replicas, as           it informs the replicas. Replicas insert chosen commands
illustrated in Figure 4.                                                into their logs and execute the logs in prefix order, sending
   One of the proposers is elected leader in some round, say            the results of execution back to the clients. This execution is
round i. We assume the leader knows that log entries up to and          illustrated in Figure 4.
including log entry kc have already been chosen (e.g., by com-             It is critical to note that a leader performs Phase 1 of Paxos
municating with the replicas). We call this log entry the com-          only once per round, not once per command. In other words,
mit index. The leader then runs Phase 1 of Paxos in round i             Phase 1 is not performed during normal operation. It is per-
for every log entry larger than kc . Note that even though there        formed only when the leader fails and a new leader is elected
are an infinite number of log entries larger than kc , the leader       in a larger round, an uncommon occurrence.
can execute Phase 1 using a finite amount of information. In
particular, the leader sends a single P HASE 1Ahii message
                                                                        4.2    Donut MultiPaxos
that acts as the P HASE 1A message for every log entry larger
than kc . Also, an acceptor replies with a P HASE 1Bhi, vr, vvi         We first extend Donut Paxos to Donut MultiPaxos with proac-
message only for log entries in which the acceptor has voted.           tive matchmaking but without Phase 1 bypassing or garbage
The infinitely many log entries in which the acceptor has not           collection. We’ll see how to incorporate these two momen-
yet voted do not yield an explicit P HASE 1B message.                   tarily. The extension from Donut Paxos to Donut MultiPaxos
   The leader’s knowledge about the log after Phase 1 can be            is analogous to the extension of Paxos to MultiPaxos. Donut
characterized by the commit index kc and a pending index                MultiPaxos reaches consensus on a totally ordered log of
k p with kc ≤ k p , as shown in Figure 5. The commit index              state machine commands, one log entry at a time, using one
and pending index divide the log into three regions: a prefix           instance of Donut Paxos for every log entry.
of chosen log entries (Region 1), a suffix of unchosen log                 More concretely, a Donut MultiPaxos deployment consists
entries (Region 3), and a middle region of pending log entries          of an arbitrary number of clients, at least f + 1 proposers, a

                                                                    7
Submitted to the Journal of Systems Research (JSys)                                                                                2021

set of 2 f + 1 matchmakers, a dynamic set of acceptors (one              Further note that configurations do not have to be unique
configuration per round), and a set of at least f + 1 state            across rounds. The leader in round i is free to re-use a config-
machine replicas. We assume, as is standard, that a leader             uration C j that was used in some round j < i.
election algorithm is used to select one of the proposers as a
stable leader in some round, say round i. The leader selects
a configuration Ci of acceptors that it will use for every log
                                                                       4.4    Optimization
entry. The mechanism by which the configuration is chosen is           Ideally, Donut MultiPaxos’ performance would be unaffected
an orthogonal concern. A system administrator, for example,            by a reconfiguration. The latency of every client request
could send the configuration to the leader, or the configuration       and the protocol’s overall throughput would remain constant
could be read from an external service.                                throughout a reconfiguration. Donut MultiPaxos as we’ve
    The leader then executes the Matchmaking phase in the              described it so far, however, does not meet this ideal. During
same way as in Donut Paxos (i.e. it sends M ATCH Ahi,Ci i              a reconfiguration, a leader must temporarily stop processing
messages to the matchmakers and awaits M ATCH Bhi, Hi i re-            client commands and wait for the reconfiguration to finish
sponses). After the Matchmaking phase completes, the leader            before resuming normal operation.
executes Phase 1 for every log entry. This is identical to                This is illustrated in Figure 6. Figure 6 shows a leader p1 re-
MultiPaxos, except that the leader uses the configurations             configuring from a configuration of acceptors Cold consisting
returned by the matchmakers rather than assuming a fixed               of acceptors a1 , a2 , and a3 in round i to a new configuration
configuration. Note that proactive matchmaking allows the              of acceptors Cnew consisting of acceptors b1 , b2 , and b3 in
leader to execute the Matchmaking phase and Phase 1 before             round i + 1. While the leader performs the reconfiguration,
receiving any client requests.                                         clients continue to send state machine commands to the leader.
    The leader then enters Phase 2 and operates exactly as it          We consider such a command and perform a case analysis on
would in MultiPaxos. It executes Phase 2 with Ci for the               when the command arrives at the leader to see whether or not
log entries in Region 2. Moreover, when it receives a state            the command has to be stalled.
machine command from a client, it assigns the command a                   Case 1: Matchmaking (Figure 6a). If the leader receives
log entry in Region 3, runs Phase 2 with the acceptors in              a command during the Matchmaking phase, then the leader
Ci , and informs the replicas when the command is chosen.              can process the command as normal in round i using the
Replicas execute commands in log order and send the results            acceptors in Cold . Even though the leader is executing the
of executing commands back to the clients.                             Matchmaking phase in round i + 1 and is communicating with
                                                                       the matchmakers, the acceptors in Cold are oblivious to this
4.3    Discussion                                                      and can process commands in Phase 2 in round i.
                                                                          Case 2: Phase 1 (Figure 6b). If the leader receives a
To reconfigure from some old configuration Cold in round               command during Phase 1, then the leader cannot process the
i to some new configuration Cnew , the Donut MultiPaxos                command. It must delay the processing of the command
leader of round i simply advances to round i + 1 and selects           until Phase 1 finishes. Here’s why. Once an acceptor in Cold
the new configuration Cnew . The new configuration is active           receives a P HASE 1Ahi + 1i message, it will reject any future
immediately after the Matchmaking phase, a one round trip              commands in rounds less than i + 1, so the leader is unable
delay. Note that the acceptors in the new configuration Cnew           to send the command to Cold . The leader also cannot send
do not have to undergo any sort of warm up or bootstrapping            the command to Cnew in round i + 1 because it has not yet
and do not have to contact any other acceptors in any other            finished executing Phase 1.
configuration.                                                            Case 3: Phase 2 (Figure 6c). If the leader receives a com-
   The new configuration is active immediately, but it is not          mand during Phase 2, then the leader can send the command
safe to deactivate the acceptors in the old configuration imme-        to the new acceptors in Cnew in round i + 1. This is the normal
diately, as we saw in Section 3.5. We extend Donut Paxos’s             case of execution.
garbage collection to Donut MultiPaxos momentarily.                       In summary, any commands received during Phase 1 of a
   Also note that Donut MultiPaxos does not perform the                reconfiguration are delayed. Fortunately, we can eliminate
Matchmaking phase or Phase 1 on the critical path of normal            this problem by using Phase 1 bypassing. Consider a leader
execution. Similar to how MultiPaxos executes Phase 1 only             performing a reconfiguration from Ci in round i to Ci+1 in
once per leader change (and not once per command), Donut               round i + 1. At the end of the Matchmaking phase and at the
MultiPaxos runs the Matchmaking phase and Phase 1 only                 beginning of Phase 1 (in round i + 1), let k be the largest log
when a new leader is elected or when a leader changes its              entry that the leader has assigned to a command. That is, all
round (e.g., when a leader transitions from round i to round i +       log entries after entry k are empty. These log entries satisfy
1 as part of a reconfiguration). In the normal case (i.e. during       the preconditions of Phase 1 bypassing, so it is safe for the
Phase 2), Donut MultiPaxos and MultiPaxos are identical,               leader to bypass Phase 1 in round i + 1 for these log entries in
and Donut MultiPaxos does not introduce any overheads.                 the following way. When a leader receives a command after

                                                                   8
Submitted to the Journal of Systems Research (JSys)                                                                                   2021

                            m1                                          m1                                             m1
                                  m2                                           m2                                            m2
                  12                   m3                                           m3                                                m3
                       12
   c1       a                                    c1                                             c1       a
                                       a1              a                 3          a1                                                a1
        d        p1                                         p1           4                           d       p1
                              b                                          3
   c2                        c         a2        c2                      4          a2          c2                                    a2
                            cb                                                                                              cb
                 p2                                         p2                                               p2        5
                                       a3                                           a3                              56                a3
   c3                                            c3                                             c3             cb 6
                                       b3                                           b3                                                b3
                                  b2                                           b2                                                b2
                            b1                                           b1                                            b1

                (a) Matchmaking                               (b) Phase 1                                     (c) Phase 2

Figure 6: An example Donut MultiPaxos reconfiguration without Phase 1 bypassing. The leader p1 reconfigures from the
acceptors a1 , a2 , a3 to the acceptors b1 , b2 , b3 . Client commands are drawn as gray dashed lines. Note that every subfigure
shows one phase of a reconfiguration using solid colored lines, but the dashed lines show the complete execution of a client
request that runs concurrently with the reconfiguration. For simplicity, we assume that every proposer also serves as a replica.

the Matchmaking phase, it assigns the command a log entry               It enters Phase 2 and chooses commands in Region 2. It in-
larger than k, skips Phase 1, and executes Phase 2 in round             forms a Phase 2 quorum of Ci acceptors once the commands
i + 1 with Cnew immediately.                                            in Region 1 have been stored on f + 1 replicas. It issues
   With this optimization and the round scheme described in             a G ARBAGE Ahii command to the matchmakers and awaits
Section 3.6, no state machine commands are delayed. Com-                 f + 1 G ARBAGE Bhii responses. At this point, all previous
mands received during the Matchmaking phase or earlier are              configurations can be shut down.
chosen in round i by Cold in log entries up to and including               Note that the leader can begin processing state machine
k. Commands received during Phase 1, Phase 2, or later are              commands from clients as soon as it enters Phase 2. It does
chosen in round i + 1 by Cnew in log entries k + 1, k + 2, k + 3,       not have to stall commands during garbage collection. Note
and so on. With this optimization Donut MultiPaxos can be               also that during normal operation, old configurations are
reconfigured with minimal performance degradation.                      garbage collected very quickly. In Section 7, we show that Hi
                                                                        almost always contains a single configuration (i.e. Ci−1 ).
4.5     Garbage Collection
                                                                        5     Reconfiguring Matchmakers
Recall that the Donut MultiPaxos leader pi in round i uses a
single configuration Ci for every log entry. The leader pi can          We’ve discussed how Donut MultiPaxos allows us to recon-
safely issue a G ARBAGE Ahii command to the matchmakers                 figure the set of acceptors. In this section, we discuss how to
once it ensures that every log entry satisfies one of the three         reconfigure proposers, replicas, and matchmakers.
scenarios described in Section 3.5. Recall from Figure 5 that              Reconfiguring proposers and replicas is straightforward. In
at the end of Phase 1 and at the beginning of Phase 2, the log          fact, Donut MultiPaxos reconfigures proposers and replicas
can be divided into three regions. Each of the three garbage            in exactly the same way as MultiPaxos [35], so we do not
collection scenarios applies to one of the regions.                     discuss it at length. In short, a proposer can be safely added
   Scenario 2 applies to Region 3. These are the log entries            or removed at any time. Replicas can also be safely added
for which k = −1. Scenario 1 applies to Region 2, once the              or removed at any time so long as we ensure that commands
leader has successfully chosen commands in all of the log               replicated on f + 1 replicas remain replicated on f + 1 repli-
entries in Region 2. Scenario 3 applies to Region 1 if we make          cas. For performance, a newly introduced proposer should
the following adjustments. First, we deploy 2 f + 1 replicas            contact an existing proposer or replica to learn about the prefix
instead of f + 1. Second, the leader ensures that the prefix of         of already chosen commands, and a newly introduced replica
previously chosen log entries is stored on at least f + 1 of the        should copy the log from an existing replica.
2 f + 1 replicas. Third, the leader informs a Phase 2 quorum               Reconfiguring matchmakers is a bit more involved, but still
of Ci acceptors that these commands have been stored on the             relatively straightforward. Recall that proposers perform the
replicas.                                                               Matchmaking phase only during a change in round. Thus, for
   In summary, the leader pi of round i executes as follows.            the vast majority of the time—specifically, when there is a
It executes the Matchmaking phase to get the prior configu-             single, stable leader—the matchmakers are completely idle.
rations Hi . It executes Phase 1 with the configurations in Hi .        This means that the way we reconfigure the matchmakers has

                                                                    9
Submitted to the Journal of Systems Research (JSys)                                                                                     2021

to be safe, but it doesn’t have to be efficient. The matchmak-                                            α
ers can be reconfigured at any time between round changes                         0   1    2    3    4    5     6       7   8

without any impact on the performance.                                                                    no-   no-
                                                                                 a    b    c    N0   d    op    op      e    f   ···
   Thus, we use the simplest approach to reconfiguration:
we shut down the old matchmakers and replace them with                                    chosen with N               chosen with N 0
new ones, making sure that the new matchmakers’ initial
state is the same as the old matchmakers’ final state. More               Figure 8: A MultiPaxos log during reconfiguration (α = 4).
concretely, we reconfigure from a set Mold of matchmakers
to a new set Mnew as follows. First, a proposer (or any other
node) sends a S TOPAhi message to the matchmakers in Mold .               MultiPaxos seems simple, but the protocol has a number of
When a matchmaker mi receives a S TOPAhi message, it stops                hidden subtleties [23]. For example, a newly elected Horizon-
processing messages (except for other S TOPAhi messages)                  tal MultiPaxos leader with a stale log may not know the latest
and replies with S TOP BhLi , wi i where Li is mi ’s log and wi is        configuration of nodes. It may not even know which config-
its garbage collection watermark. When the proposer receives              uration of nodes to contact to learn the latest configuration
S TOP B messages from f + 1 matchmakers, it knows that the                of nodes. This makes it unclear when it is safe to shut down
matchmakers have effectively been shut down. It computes                  old configurations because a newly elected Horizontal Multi-
w as the maximum of every returned wi . It computes L as                  Paxos leader can be arbitrarily out of date. These subtleties
the union of the returned logs, and removes all entries of L              and the many others described in [23] makes Horizontal Mul-
that appear in a round less than w. An example of this log                tiPaxos significantly more complicated that it initially seems.
merging is illustrated in Figure 7.                                       Donut Paxos addresses these subtleties directly. The match-
                                                                          makers can always be used to learn the latest configuration,
      4   C4      4          4                       4   C4               and our garbage collection protocol details exactly when and
      3           3          3                       3
                                                                          how to shut down old configurations safely.
                                                                             Second, horizontal reconfiguration is not generally applica-
      2           2   C2     2   C2                  2   C2               ble. It is fundamentally incompatible with replication proto-
      1   C1      1   ×      1   C1                  1   ×                cols that do not have a log. Moreover, researchers are finding
                                                                          that avoiding a log can often be advantageous [2, 15, 27, 33,
      0   C0      0   ×      0   ×                   0   ×
                                                                          34, 36]. For example, protocols like Generalized Paxos [15],
          L0          L1         L2                                       EPaxos [27], Atlas [8], and Caesar [2] arrange commands in
                                                                          a partially ordered graph instead of a totally ordered log to
Figure 7: An example of merging three matchmaker logs (L0 ,               exploit commutativity between commands. CASPaxos [33]
L1 , and L2 ) during a matchmaker reconfiguration. Garbage                maintains a single value, instead of a log or graph, for sim-
collected log entries are shown in red.                                   plicity. Databases like TAPIR [36] avoid ordering transac-
                                                                          tions in a log for improved performance, and databases like
   The proposer then sends L and w to all of the matchmakers              Meerkat [34] do the same to improve scalability. Even some
in Mnew . Each matchmaker adopts these values as its initial              protocols with logs cannot use the ideas behind Horizontal
state. At this point, the matchmakers in Mnew cannot begin                MultiPaxos. For example, Raft cannot safely perform hori-
processing commands yet. Naively, it is possible that two                 zontal reconfigurations [29].
different nodes could simultaneously attempt to reconfigure                  Because these protocols do not have logs, they cannot use
to two disjoint sets of matchmakers, say Mnew and Mnew0 . To
                                                                          MultiPaxos’ horizontal reconfiguration protocol. However,
avoid this, we use an instance of Paxos (the matchmakers in               while none of the protocols have logs, all of them have rounds.
Mold are the acceptors) to choose the new matchmakers Mnew .              This means that the protocols can either use Donut Paxos
See Section B for a safety proof.                                         directly, or at least borrow ideas from Donut Paxos for recon-
                                                                          figuration. For example, we are developing a protocol called
6    Insights and Generality                                              BPaxos that is an EPaxos [27] variant which partially orders
                                                                          commands into a graph. BPaxos is a modular protocol that
MultiPaxos To reconfigure from a set of nodes N to a new                  uses Paxos as a black box subroutine. Due to this modularity,
set of nodes N 0 , a MultiPaxos leader gets the value N 0 chosen          we can directly replace Paxos with Donut Paxos to support
in the log at some index i. All commands in the log starting              reconfiguration. The same idea can also be applied to EPaxos.
at position i + α are chosen using the nodes in N 0 instead of            CASPaxos [33] is similar to Paxos and can be extended to
the nodes in N, where α is some configurable parameter. This              Matchmaker CASPaxos in the same way we extended Paxos
protocol is called Horizontal MultiPaxos.                                 to Donut Paxos. These are two simple examples, and we don’t
   Donut MultiPaxos has the following advantages over Hor-                claim that extending Donut Paxos to some of the other more
izontal MultiPaxos. First, the core idea behind Horizontal                complicated protocols is always easy. But, the universality

                                                                     10
Submitted to the Journal of Systems Research (JSys)                                                                              2021

of rounds makes Donut Paxos an attractive foundation on top             machine replication protocol like MultiPaxos, but this would
of which other non-log based protocols can build their own              be both slow and overly complex. Plus, we would have to
reconfiguration protocols.                                              implement a reconfiguration protocol for the master as well.
   One could argue that these other protocols are not used as           Our matchmakers are analogous to the external master but
much in industry, so it’s not that important for them to have           show that such a master does not require a nested invocation
reconfiguration protocols, but we think the causation is in the         of state machine replication.
reverse direction! Without reconfiguration, these protocols                Third, Vertical Paxos requires that a proposer execute Phase
cannot be used in industry.                                             1 in order to perform a reconfiguration. Thus, Vertical Paxos
   Third, optimizing Horizontal MultiPaxos is not easy. A               cannot be extended to MultiPaxos without causing perfor-
MultiPaxos leader can process at most α unchosen commands               mance degradation during reconfiguration. This is not the
at a time. This makes α an important parameter to tune. If we           case for matchmakers thanks to Phase 1 bypassing.
set α too low, then we limit the protocol’s pipeline parallelism           Fourth, Vertical Paxos does not describe how proposers
and the throughput suffers. Note that a small α reduces the             learn the configurations used in previous rounds and instead
normal case throughput of Horizontal MultiPaxos, not just               assumes that configurations are fixed in advance by an oracle.
the throughput during reconfiguration. If we set α too high,            Donut Paxos shows that this assumption is not necessary, as
then we have to wait a long time for a reconfiguration to               the matchmakers store every configuration.
complete. If we are reconfiguring because of a failed node,
then we might have to endure a long reconfiguration with
reduced throughput. Donut MultiPaxos has no α parameter
to tune. Note that Horizontal MultiPaxos can be implemented             Fast Paxos Fast Paxos [16] is a Paxos variant that shaves off
with an optimization in which we select a very large α and              one network delay from Paxos in the best case, but can have
then get a sequence of α noops in the log to force a quick              higher delays if concurrently proposed commands conflict.
reconfiguration. This optimization helps avoid the difficulties         While Paxos quorums consist of f + 1 out of 2 f + 1 acceptors,
of finding a good value of α, but the optimization introduces           Fast Paxos requires larger quorums. Many protocols have
a new set of subtleties into the protocol.                              reduced Fast Paxos quorum sizes a bit, but to date, Fast Paxos
   Horizontal MultiPaxos also requires a Phase 1 and Phase              quorum sizes have remained larger than classic Paxos quorum
2 quorum of acceptors from an old configuration in order to             sizes [8, 27]. Using matchmakers, we can implement Fast
perform a reconfiguration after a leader failure, but Donut             Paxos with a fixed set of f +1 acceptors (and hence with f +1-
MultiPaxos only requires a Phase 1 quorum. Some read                    sized quorums). Specifically, we deploy Fast Paxos with f + 1
optimized MultiPaxos variants perform reads against Phase 1             acceptors, with a single unanimous Phase 2 quorum, and with
quorums [5]. These protocols benefit from having very small             singleton Phase 1 quorums. A full description of the protocol
Phase 1 quorums and very large Phase 2 quorums, requiring               and a proof of correctness is given in Section C.
Horizontal MultiPaxos to contact far more nodes that Donut
MultiPaxos during a reconfiguration.
   Finally, we clarify that if Horizontal MultiPaxos is imple-          DPaxos DPaxos is a Paxos variant that allows every round
mented with all of its subtleties ironed out, is deployed with          to use a different subset of acceptors from some fixed set of
a good choice of α, and is run with small Phase 2 quorums,              acceptors. Donut Paxos obviates the need for a fixed set of
then it can perform a reconfiguration without performance               nodes. DPaxos’ scope is limited to a single instance of con-
degradation. In this case, Horizontal MultiPaxos and Donut              sensus, whereas Donut MultiPaxos shows how to efficiently
MultiPaxos both reconfigure, in some sense, “optimally”.                reconfigure across multiple instances of consensus simulta-
                                                                        neously. We also discovered that DPaxos’ garbage collection
Vertical Paxos Donut MultiPaxos significantly improves                  algorithm is unsafe. Donut MultiPaxos fixes the bug. See
the practicality of Vertical Paxos [19] in a number of ways.            Section D for details.
First, Vertical Paxos is a consensus protocol, not a state ma-             Cheap Paxos. Cheap Paxos [21] is a MultiPaxos variant
chine replication protocol, and it’s not easy to extend Vertical        that consists of a fixed set of f + 1 main acceptors and f
Paxos’ garbage collection protocol to a state machine replica-          auxiliary acceptors. During failure-free execution (the normal
tion protocol. Vertical Paxos garbage collects old configura-           case), only the main acceptors are contacted. The auxiliary
tions in situations similar to Scenario 1 and Scenario 2 from           acceptors perform MultiPaxos’ horizontal reconfiguration
Section 3.5. It does not include Scenario 3. Without this, old          protocol to replace failed main acceptors. As with Fast Paxos,
configurations cannot be garbage collected, which means that            we can deploy Donut MultiPaxos with only f + 1 acceptors,
it is never safe to shut down old configurations.                        f fewer than Cheap Paxos. Donut Paxos does require 2 f +
    Second, Vertical Paxos requires an external master but              1 matchmakers, but matchmakers do not act as acceptors
does not describe how to implement the master in an efficient           and have to process only a single message (i.e. a M ATCH A
way. We could implement the master using another state                  message) to perform a reconfiguration.

                                                                   11
Submitted to the Journal of Systems Research (JSys)                                                                                               2021

7     Evaluation                                                                                          1 client     4 clients      8 clients

We now evaluate Donut MultiPaxos. Donut MultiPaxos is                                                 2

                                                                                       Latency (ms)
implemented in Scala using the Netty networking library.
We deployed Donut MultiPaxos on m5.xlarge AWS EC2
instances within a single availability zone. We deploy Donut                                          1
MultiPaxos with f = 1, f + 1 proposers, 2 f + 1 acceptors,
2 f + 1 matchmakers, and 2 f + 1 replicas. For simplicity,
every node is deployed on its own machine, but in practice,                            20000

                                                                       (cmds/second)
                                                                         Throughput
nodes can be physically co-located. In particular, any two
logical roles can be placed on the same machine, so long                               10000
as the two roles are not the same. For example, we can co-
locate a leader, an acceptor, a replica, and a matchmaker,
but we can’t co-locate two acceptors (without reducing the                                            0
fault tolerance of the system). All of our results hold in a
                                                                                                           0:00 0:05 0:10 0:15 0:20 0:25 0:30 0:35
co-located deployment as well. For simplicity, we deploy                                                                   Time
Donut MultiPaxos with a trivial no-op state machine in which
every state machine command is a one byte no-op. All of our            Figure 9: Donut MultiPaxos’ latency and throughput ( f = 1).
results generalize to more complex state machines as well              Median latency is shown using solid lines, while the 95%
(the choice of state machine is orthogonal to reconfiguration).        latency is shown as a shaded region above the median latency.
                                                                       The vertical black lines show reconfigurations. The vertical
                                                                       dashed red line shows an acceptor failure.
7.1    Reconfiguration
Experiment Description. We run a benchmark with 1, 4, and
8 clients. Every client repeatedly proposes a state machine            Table 1. Figure 12 includes violin plots of the same data. The
command, waits to receive a response, and then immediately             white circles show the median values, while the thick black
proposes another command. Every benchmark runs for 35                  rectangles show the 25th and 75th percentiles. For latency, re-
seconds. During the first 10 seconds, we perform no recon-             configuration has little to no impact (roughly 2% changes) on
figurations. From 10 seconds to 20 seconds, the leader recon-          the medians, IQRs, or standard deviations. The one exception
figures the set of acceptors once every second. In practice,           is that the 8 client standard deviation is significantly larger.
we would reconfigure much less often. This is a worst case             This is due to a small number of outliers. Reconfiguration has
stress test for Donut MultiPaxos. For each of the ten reconfig-        little impact on median throughput, with all differences being
urations, the leader selects a random set of 2 f + 1 acceptors         statistically insignificant. The IQRs and standard deviations
from a pool of 2 × (2 f + 1) acceptors. At 25 seconds, we fail         sometimes increase and sometimes decrease. The IQR is al-
one of the acceptors. 5 seconds later, the leader performs a           ways less than 1% of the median throughput, and the standard
reconfiguration to replace the failed acceptor. The delay of 5         deviation is always less than 4%.
seconds is completely arbitrary. The leader can reconfigure                For every reconfiguration, the new acceptors become ac-
sooner if desired.                                                     tive within a millisecond. The old acceptors are garbage
   We also perform this experiment with an implementation of           collected within five milliseconds. This means that only one
MultiPaxos with horizontal reconfiguration. As with Donut              configuration is ever returned by the matchmakers. We imple-
MultiPaxos, we deploy MultiPaxos with f + 1 proposers,                 ment Donut MultiPaxos with an optimization called thrifti-
2 f + 1 acceptors, and 2 f + 1 replicas. We set α to 8. Because        ness [27]—where P HASE 2A messages are sent to a randomly
α is equal to the number of clients, MultiPaxos never stalls           selected Phase 2 quorum—so the throughput and latency ex-
because of an insufficiently large α.                                  pectedly degrade after we fail an acceptor. After we replace
   Results. The latency and throughput of Donut MultiPaxos             the failed acceptor, throughput and latency return to normal
are shown in Figure 9. Throughput and latency are both                 within two seconds.
computed using sliding one second windows. Median latency                  The latency and throughput of MultiPaxos is shown in Fig-
is shown using solid lines, while the 95% latency is shown as          ure 10. As with Donut MultiPaxos, MultiPaxos can perform a
a shaded region above the median latency. The black vertical           horizontal reconfiguration without any performance degrada-
lines denote reconfigurations, and the red dashed vertical line        tion. We include the comparison to MultiPaxos for the sake of
denotes the acceptor failure.                                          having some baseline against which we can compare Donut
   The medians, interquartile ranges (IQR), and standard de-           MultiPaxos, but the comparison is shallow. For this reason,
viations of the latency and throughput (a) during the first 10         we do not elaborate on the results much.
seconds and (b) between 10 and 20 seconds are shown in                     While Donut MultiPaxos does provide performance bene-

                                                                  12
You can also read