HotStu : BFT Consensus in the Lens of Blockchain

Page created by Mary Ramsey
 
CONTINUE READING
HotStuff: BFT Consensus in the Lens of Blockchain

                                            Maofan Yin1,2 , Dahlia Malkhi2 , Michael K. Reiter2,3 , Guy Golan Gueta2 , and Ittai Abraham2
                                                                1
                                                                    Cornell University, 2 VMware Research, 3 UNC-Chapel Hill

                                                                                                   Abstract
arXiv:1803.05069v4 [cs.DC] 2 Apr 2019

                                                   We present HotStuff, a leader-based Byzantine fault-tolerant replication protocol for the partially synchronous
                                              model. Once network communication becomes synchronous, HotStuff enables a correct leader to drive the pro-
                                              tocol to consensus at the pace of actual (vs. maximum) network delay—a property called responsiveness—and with
                                              communication complexity that is linear in the number of replicas. To our knowledge, HotStuff is the first par-
                                              tially synchronous BFT replication protocol exhibiting these combined properties. HotStuff is built around a novel
                                              framework that forms a bridge between classical BFT foundations and blockchains. It allows the expression of other
                                              known protocols (DLS, PBFT, Tendermint, Casper), and ours, in a common framework.
                                                   Our deployment of HotStuff over a network with over 100 replicas achieves throughput and latency comparable
                                              to that of BFT-SMaRt, while enjoying linear communication footprint during leader failover (vs. quadratic with BFT-
                                              SMaRt).

                                        1    Introduction
                                        Byzantine fault tolerance (BFT) refers to the ability of a computing system to endure arbitrary (i.e., Byzantine) failures
                                        of its components while taking actions critical to the system’s operation. In the context of state machine replication
                                        (SMR) [33, 46], the system as a whole provides a replicated service whose state is mirrored across n deterministic
                                        replicas. A BFT SMR protocol is used to ensure that non-faulty replicas agree on an order of execution for client-
                                        initiated service commands, despite the efforts of f Byzantine replicas. This, in turn, ensures that the n−f non-faulty
                                        replicas will run commands identically and so produce the same response for each command. As is common, we are
                                        concerned here with the partially synchronous communication model [23], whereby a known bound ∆ on message
                                        transmission holds after some unknown global stabilization time (GST). In this model, n ≥ 3f + 1 is required
                                        for non-faulty replicas to agree on the same commands in the same order (e.g., [10]) and progress can be ensured
                                        deterministically only after GST [25].
                                             When BFT SMR protocols were originally conceived, a typical target system size was n = 4 or n = 7, deployed
                                        on a local-area network. However, the renewed interest in Byzantine fault-tolerance brought about by its application
                                        to blockchains now demands solutions that can scale to much larger n. In contrast to permissionless blockchains such
                                        as the one that supports Bitcoin, for example, so-called permissioned blockchains involve a fixed set of replicas that
                                        collectively maintain an ordered ledger of commands or, in other words, that support SMR. Despite their permis-
                                        sioned nature, numbers of replicas in the hundreds or even thousands are envisioned (e.g., [41, 28]). Additionally,
                                        their deployment to wide-area networks requires setting ∆ to accommodate higher variability in communication
                                        delays.
                                        The scaling challenge. Since the introduction of PBFT [19], the first practical BFT replication solution in the
                                        partial synchrony model, numerous BFT solutions were built around its core two-phase paradigm. The practical
                                        aspect is that a stable leader can drive a consensus decision in just two rounds of message exchanges. The first phase
                                        guarantees proposal uniqueness through the formation of a quorum certificate (QC) consisting of (n − f ) votes. The
                                        second phase guarantees that the next leader can convince replicas to vote for a safe proposal.
                                            The algorithm for a new leader to collect information and propose it to replicas—called a view-change—is the
                                        epicenter of replication. Unfortunately, view-change based on the two-phase paradigm is far from simple [37], is
                                        bug-prone [3], and incurs a significant communication penalty for even moderate system sizes. It requires the new
                                        leader to relay information from (n − f ) replicas, each reporting its own highest known QC. Even counting just
                                        authenticators (digital signatures or message authentication codes), conveying a new proposal has a communication

                                                                                                       1
footprint of O(n3 ) authenticators in PBFT, and variants that combine multiple authenticators into one via threshold
digital signatures (e.g., [16, 28]) still send O(n2 ) authenticators. The total number of authenticators transmitted if
O(n) view-changes occur before a single consensus decision is reached is O(n4 ) in PBFT, and even with threshold
signatures is O(n3 ). This scaling challenge plagues not only PBFT, but many other protocols developed since then,
e.g., Prime [8], Zyzzyva [32], Upright [20], BFT-SMaRt [11], 700BFT [9], and SBFT [28].
     HotStuff revolves around a three-phase core, allowing a new leader to simply pick the highest QC it knows of.
It introduces a second phase that allows replicas to “change their mind” after voting in the phase, without requiring
a leader proof at all. This alleviates the above complexity, and at the same time considerably simplifies the leader
replacement protocol. Last, having (almost) canonized all the phases, it is very easy to pipeline HotStuff, and to
frequently rotate leaders.
     To our knowledge, only BFT protocols in the blockchain arena like Tendermint [13, 14] and Casper [15] fol-
low such a simple leader regime. However, these systems are built around a synchronous core, wherein proposals
are made in pre-determined intervals that must accommodate the worst-case time it takes to propagate messages
over a wide-area peer-to-peer gossip network. In doing so, they forego a hallmark of most practical BFT SMR solu-
tions (including those listed above), namely optimistic responsiveness [41]. Informally, responsiveness requires that a
non-faulty leader, once designated, can drive the protocol to consensus in time depending only on the actual mes-
sage delays, independent of any known upper bound on message transmission delays [7]. More appropriate for
our model is optimistic responsiveness, which requires responsiveness only in beneficial (and hopefully common)
circumstances—here, after GST is reached. Optimistic or not, responsiveness is precluded with designs such as Ten-
dermint/Casper. The crux of the difficulty is that there may exist an honest replica that has the highest QC, but the
leader does not know about it. One can build scenarios where this prevents progress ad infinitum. Indeed, failing to
incorporate necessary delays at crucial protocol steps can result in losing liveness outright, as has been reported in
several existing deployments, e.g., see [2, 17, 1].
Our contributions. To our knowledge, we present the first BFT SMR protocol, called HotStuff, to achieve the
following two properties:
    • Linear View Change: After GST, any correct leader, once designated, sends only O(n) authenticators to
      drive a consensus decision. This includes the case where a leader is replaced. Consequently, communication
      costs to reach consensus after GST is O(n2 ) authenticators in the worst case of cascading leader failures.
    • Optimistic Responsiveness: After GST, any correct leader, once designated, needs to wait just for the first
      n − f responses to guarantee that it can create a proposal that will make progress. This includes the case
      where a leader is replaced.
    Another feature of HotStuff is that the costs for a new leader to drive the protocol to consensus is no greater
than that for the current leader. As such, HotStuff supports frequent succession of leaders, which has been argued
is useful in blockchain contexts for ensuring chain quality [26].
    HotStuff achieves these properties by adding another phase to each view, a small price to latency in return
for considerably simplifying the leader replacement protocol. This exchange incurs only the actual network delays,
which are typically far smaller than ∆ in practice. As such, we expect this added latency to be much smaller than that
incurred by previous protocols that forgo responsiveness to achieve linear view-change. Furthermore, throughput
is not affected due to the efficient pipeline we introduce in Section 5.
    In addition to the theoretical contribution, HotStuff also provides insights in understanding BFT replication in
general and instantiating the protocol in practice (see Section 6):
    • A framework for BFT replication over graphs of nodes. Safety is specified via voting and commit graph rules.
      Liveness is specified separately via a Pacemaker that extends the graph with new nodes.
    • A casting of several known protocols (DLS [23], PBFT [18], Tendermint [13], Casper [15]) and one new (ours,
      HotStuff), in this framework.
HotStuff has the additional benefit of being remarkably simple, owing in part to its economy of mechanism: There
are only two message types and simple rules to determine how a replica treats each. Safety is specified via voting
and commit rules over graphs of nodes. The mechanisms needed to achieve liveness are encapsulated within a
Pacemaker, cleanly separated from the mechanisms needed for safety. At the same time, it is expressive in that it
allows the representation of several known protocols (DLS [23], PBFT [18], Tendermint [13], and Casper [15]) as

                                                          2
minor variations. In part this flexibility derives from its operation over a graph of nodes, in a way that forms a
bridge between classical BFT foundations and modern blockchains.
    We describe a prototype implementation and a preliminary evaluation of HotStuff. Deployed over a network with
over a hundred replicas, HotStuff achieves throughput and latency comparable to, and sometimes exceeding, those
of mature systems such as BFT-SMaRt, whose code complexity far exceeds that of HotStuff. We further demonstrate
that the communication footprint of HotStuff remains constant in face of frequent leader replacements, whereas
BFT-SMaRt grows quadratically with the number of replicas.

                                                                 Authenticator complexity
     Protocol                                                                                                          Responsiveness
                                            Correct leader     Leader failure (view-change)      f leader failures
     DLS [23]                                   O(n4 )                     O(n4 )                     O(n4 )
     PBFT [19]                                  O(n2 )                     O(n3 )                     O(f n3 )                X
     SBFT [28]                                  O(n)                       O(n2 )                     O(f n2 )                X
     Tendermint [14] / Casper [15]              O(n2 )                     O(n2 )                     O(f n2 )
     Tendermint* / Casper*                      O(n)                       O(n)                       O(f n)
     HotStuff                                   O(n)                       O(n)                       O(f n)                  !
        *
            Signatures can be combined using threshold signatures, though this optimization is not mentioned in their original works.

                                       Table 1: Performance of selected protocols after GST.

2    Related work
Reaching consensus in face of Byzantine failures was formulated as the Byzantine Generals Problem by Lamport et
al. [36], who also coined the term “Byzantine failures”. The first synchronous solution was given by Pease et al. [42],
and later improved by Dolev and Strong [22]. The improved protocol has O(n3 ) communication complexity, which
was shown optimal by Dolev and Reischuk [21]. A leader-based synchronous protocol that uses randomness was
given by Katz and Koo [30], showing an expected constant-round solution with (n − 1)/2 resilience.
     Meanwhile, in the asynchronous settings, Fischer et al. [25] showed that the problem is unsolvable determin-
istically in asynchronous setting in face of a single failure. Furthermore, an (n − 1)/3 resilience bound for any
asynchronous solution was proven by Ben-Or [10]. Two approaches were devised to circumvent the impossibility.
One relies on randomness, initially shown by Ben-Or [10], using independently random coin flips by processes until
they happen to converge to consensus. Later works used cryptographic methods to share an unpredictable coin and
drive complexities down to constant expected rounds, and O(n3 ) communication [16].
     The second approach relies on partial synchrony, first shown by Dolev, Lynch, and Stockmeyer (DLS) [23]. This
protocol preserves safety during asynchronous periods, and after the system becomes synchronous, DLS guarantees
termination. Once synchrony is maintained, DLS incurs O(n4 ) total communication and O(n) rounds per decision.
     State machine replication (SMR) [34, 46] relies on consensus at its core to order client requests so that correct
replicas execute them in this order. The recurring need for consensus in SMR led Lamport to devise Paxos [35], a
protocol that operates an efficient pipeline in which a stable leader drives decisions with linear communication and
one round-trip. A similar emphasis led Castro and Liskov [18] to develop an efficient leader-based Byzantine SMR
protocol named PBFT, whose stable leader requires O(n2 ) communication and two round-trips per decision, and the
leader replacement protocol incurs O(n3 ) communication. PBFT has been deployed in several systems, including
BFT-SMaRt [11]. Kotla et al. introduced an optimistic linear path into PBFT in a protocol named Zyzzyva [32], which
was utilized in several systems, e.g., Upright [20] and Byzcoin [31]. The optimistic path has linear complexity, while
the leader replacement protocol remains O(n3 ). Abraham et al. [3] later exposed a safety violation in Zyzzyva, and
presented fixes [4, 28]. On the other hand, to also reduce the complexity of the protocol itself, Song et al. proposed
Bosco [48], a simple one-step protocol with low latency on the optimistic path, requiring 5f + 1 replicas. SBFT [28]
further reduces the communication complexities throughput by factor n by harnessing two methods: A collector-
based communication paradigm by Reiter [44], and signature combining via threshold cryptography on protocol
votes by Cachin et al. [16].
     A leader-based Byzantine SMR protocol that employs randomization was presented by Cachin et al. [43], and
a leaderless variant named HoneyBadgerBFT was developed by Miller et al. [38]. At their core, these randomized

                                                                       3
Byzantine solutions employ randomized asynchronous Byzantine consensus, whose best known communication
complexity is O(n3 ) (see above), amortizing the cost via batching.
    Bitcoin’s core is a protocol known as Nakamoto Consensus [39], a synchronous protocol with only probabilistic
consensus guarantee and no finality (see analysis in [26, 40, 5]). It operates in a permissionless model where par-
ticipants are unknown, and resilience is kept via Proof-of-Work. As described above, recent blockchain solutions
hybridize Proof-of-Work solutions with classical BFT solutions in various ways [24, 31, 6, 15, 27, 29, 41]. The need
to address rotating leaders in these hybrid solutions and others provide the motivation behind HotStuff.

3    Model
We consider a system consisting of a fixed set of n = 3f + 1 replicas, indexed by i ∈ [n] where [n] = {1, . . . , n}. A
set F ⊂ [n] of up to f = |F | replicas are Byzantine faulty, and the remaining ones are correct. We will often refer
to the Byzantine replicas as being coordinated by an adversary, which learns all internal state held by these replicas
(including their cryptographic keys, see below).
    Network communication is point-to-point, authenticated and reliable: one correct replica receives a message
from another correct replica if and only if the latter sent that message to the former. When we refer to a “broadcast”,
it involves the broadcaster, if correct, sending the same point-to-point messages to all replicas, including itself.
We adopt the partial synchrony model of Dwork et al. [23], where there is a known bound ∆ and an unknown
Global Stabilization Time (GST), such that after GST, all transmissions between two correct replicas arrive within
time ∆. Our protocol will ensure safety always, and will guarantee progress within a bounded duration after GST.
(Guaranteeing progress before GST is impossible [25].) In practice, our protocol will guarantee progress if the system
remains stable (i.e., if messages arrive within ∆ time) for sufficiently long after GST, though assuming that it does
so forever simplifies discussion.
Cryptographic primitives. HotStuff makes use of threshold signatures [47, 16, 12]. In a (k, n)-threshold signa-
ture scheme, there is a single public key held by all replicas, and each of the n replicas holds a distinct private
key. The i-th replica can use its private key to contribute a partial signature ρi ← tsign i (m) on message m.
Partial signatures {ρi }i∈I , where |I| = k and each ρi ← tsign i (m), can be used to produce a digital signature
σ ← tcombine(m, {ρi }i∈I ) on m. Any other replica can verify the signature using the public key and the func-
tion tverify. We require that if ρi ← tsign i (m) for each i ∈ I, |I| = k, and if σ ← tcombine(m, {ρi }i∈I ), then
tverify(m, σ) returns true. However, given oracle access to oracles {tsign i (·)}i∈[n]\F , an adversary who queries
tsign i (m) on strictly fewer than k − f of these oracles has negligible probability of producing a signature σ for the
message m (i.e., such that tverify(m, σ) returns true). Throughout this paper, we use a threshold of k = 2f + 1.
Again, we will typically leave invocations of tverify implicit in our protocol descriptions.
    We also require a cryptographic hash function h (also called a message digest function), which maps an arbitrary-
length input to a fixed-length output. The hash function must be collision resistant [45], which informally requires
that the probability of an adversary producing inputs m and m0 such that h(m) = h(m0 ) is negligible. As such,
h(m) can serve as an identifier for a unique input m in the protocol.
Complexity measure. The complexity measure we care about is authenticator complexity, which specifically is
the sum, over all replicas i ∈ [n], of the number of authenticators received by replica i in the protocol to reach
a consensus decision after GST. (Again, before GST, a consensus decision might not be reached at all in the worst
case [25].) Here, an authenticator is either a partial signature or a signature. Authenticator complexity is a useful
measure of communication complexity for several reasons. First, like bit complexity and unlike message complexity,
it hides unnecessary details about the transmission topology. For example, n messages carrying one authenticator
count the same as one message carrying n authenticators. Second, authenticator complexity is better suited than bit
complexity for capturing costs in protocols like ours that reach consensus repeatedly, where each consensus decision
(or each view proposed on the way to that consensus decision) is identified by a monotonically increasing counter.
That is, because such a counter increases indefinitely, the bit complexity of a protocol that sends such a counter
cannot be bounded. Third, since in practice, cryptographic operations to produce or verify digital signatures and to
produce or combine partial signatures are typically the most computationally intensive operations in protocols that
use them, the authenticator complexity provides insight into the computational burden of the protocol, as well.

                                                          4
4     Basic HotStuff
HotStuff solves the State Machine Replication (SMR) problem. At the core of SMR is a protocol for deciding on a
growing log of command requests by clients. A group of state-machine replicas apply commands in sequence order
consistently. A client sends a command request to all replicas, and waits for responses from (f + 1) of them. For the
most part, we omit the client from the discussion, and defer to the standard literature for issues regarding numbering
and de-duplication of client requests.
    The Basic HotStuff solution is presented in Algorithm 2. The protocol works in a succession of views numbered
with monotonically increasing view numbers. Each viewNumber has a unique dedicated leader known to all. Each
replica stores a tree of pending commands as its local data structure. Each tree node contains a proposed command
(or a batch of them), metadata associated with the protocol, and a parent link. The branch led by a given node is
the path from the node all the way to the tree root by visiting parent links. During the protocol, a monotonically
growing branch becomes committed. To become committed, the leader of a particular view proposing the branch
must collect votes from a quorum of (n − f ) replicas in three phases, prepare, pre-commit, and commit.
    A key ingredient in the protocol is a collection of (n − f ) votes over a leader proposal, referred to as a quorum
certificate (or “QC” in short). The QC is associated with a particular node and a view number. The tcombine utility
employs a threshold signature scheme to generate a representation of (n − f ) signed votes as a single authenticator.
    Below we give an operational description of the protocol logic by phases, followed by a precise specification in
Algorithm 2, and conclude the section with safety, liveness, and complexity arguments.

4.1    Phases
prepare phase. The protocol for a new leader starts by collecting new-view messages from (n − f ) replicas.
The new-view message is sent by a replica as it transitions into viewNumber (including the first view) and carries
the highest prepareQC that the replica received (⊥ if none), as described below.
    The leader processes these messages in order to select a branch that has the highest preceding view in which
a prepareQC was formed. The leader selects the prepareQC with the highest view, denoted highQC , among the
new-view messages. Because highQC is the highest among (n − f ) replicas, no higher view could have reached a
commit decision. The branch led by highQC .node is therefore safe.
    Collecting new-view messages to select a safe branch may be omitted by an incumbent leader, who may simply
select its own highest prepareQC as highQC . We defer this optimization to Section 6 and only describe a single,
unified leader protocol in this section. Note that, different from PBFT-like protocols, including this step in the leader
protocol is straightforward, and it incurs the same, linear overhead as all the other phases of the protocol, regardless
of the situation.
    The leader uses the createLeaf method to extend the tail of highQC .node with a new proposal. The method
creates a new leaf node as a child and embeds a digest of the parent in the child node. The leader then sends the new
node in a prepare message to all other replicas. The proposal carries highQC for safety justification.
    Upon receiving the prepare message for the current view from the leader, replica r uses the safeNode predicate
to determine whether to accept it. If it is accepted, the replica sends a prepare vote with a partial signature (produced
by tsign r ) for the proposal to the leader.
safeNode predicate. The safeNode predicate is a core ingredient of the protocol. It examines a proposal message
m carrying a QC justification m.justify, and determines whether m.node is safe to accept. The safety rule to accept a
proposal is the branch of m.node extends from the currently locked node lockedQC .node. Additionally, the liveness
rule is the replica will accept m if m.justify has a higher view than the current lockedQC .
pre-commit phase. When the leader receives (n − f ) prepare votes for the current proposal curProposal , it
combines them into a prepareQC . The leader broadcasts prepareQC in pre-commit messages. A replica responds
to the leader with pre-commit vote having a signed digest of the proposal.
commit phase. The commit phase is similar to pre-commit phase. When the leader receives (n − f ) pre-commit
votes, it combines them into a precommitQC . The leader broadcasts precommitQC in commit messages; replicas
respond to it with a commit vote. Importantly, a replica becomes locked on the precommitQC at this point by setting
its lockedQC to precommitQC (Line 25 of Algorithm 2). This is crucial to guard the safety of the proposal in case it
becomes a consensus decision.

                                                           5
decide phase. When the leader receives (n − f ) commit votes, it combines them into a commitQC . Once
the leader has assembled a commitQC , it sends it in a decide message to all other replicas. Upon receiving a
decide message, a replica considers the proposal embodied in the commitQC a committed decision, and executes
the commands in the committed branch. The replica increments viewNumber and starts the next view.
nextView interrupt. In all phases, a replica waits for some message at view viewNumber for some timeout
period, determined by an auxiliary nextView(viewNumber ) utility. If nextView(viewNumber ) interrupts waiting,
the replica also increments viewNumber and starts the next view.

4.2   Data Structures
Messages. A message m in the protocol has a fixed set of fields that are populated using the Msg() utility shown
in Algorithm 1. m is automatically stamped with curView , the sender’s current view number. Each message has
a type m.type ∈ {new-view, prepare, pre-commit, commit, decide}. m.node contains a proposed node (the leaf
node of a proposed branch). There is an optional field m.justify. The leader always uses this field to carry the QC
for the different phases. Replicas use it in new-view messages to carry the highest prepareQC . Each message sent
in a replica role contains a partial signature m.partialSig by the sender over the tuple hm.type, m.viewNumber ,
m.nodei, which is added in the voteMsg() utility.
Quorum certificates. A Quorum Certificate (QC) over a tuple htype, viewNumber , nodei is a data type that com-
bines a collection of signatures for the same tuple signed by (n − f ) replicas. Given a QC qc, we use qc.type,
qc.viewNumber , qc.node to refer to the matching fields of the original tuple.
Tree and branches. Each command is wrapped in a node that additionally contains a parent link which could be
a hash digest of the parent node. We omit the implementation details from the pseudocode. During the protocol, a
replica delivers a message only after the branch led by the node is already in its local tree. In practice, a recipient
who falls behind can catch up by fetching missing nodes from other replicas. For brevity, these details are also
omitted from the pseudocode. Two branches are conflicting if neither one is an extension of the other. Two nodes
are conflicting if the branches led by them are conflicting.
Bookkeeping variables. A replica makes use of several additional local variables for bookkeeping the protocol
state. (i) a viewNumber , initially 1 and incremented either by finishing a decision or by a nextView interrupt. (ii)
a locked quorum certificate lockedQC , initially ⊥, storing the highest QC for which a replica voted commit. (iii) a
highest QC prepareQC , initially ⊥, storing the highest QC for which a replica voted pre-commit. Additionally, in
order to incrementally execute a committed log of commands, the replica maintains the highest node whose branch
has been executed. This is omitted below for brevity.

4.3   Protocol Specification
The protocol given in Algorithm 2 is described as an iterated view-by-view loop. In each view, a replica performs
phases in succession based on its role, described as a succession of “as” blocks. A replica can have more than one role.
For example, a leader is also a (normal) replica. Execution of as blocks across roles can be proceeded concurrently.
The execution of each as block is atomic. A nextView interrupt aborts all operations in any as block, and jumps to
the “Finally” block.

Algorithm 1 Utilities (for replica r).
 1: function Msg(type, node, qc)
 2:    m.type ← type
 3:    m.viewNumber ← curView
 4:    m.node ← node
 5:    m.justify ← qc
 6:    return m
 7: function voteMsg(type, node, qc)
 8:    m ← Msg(type, node, qc)
 9:    m.partialSig ← tsign r (hm.type, m.viewNumber , m.nodei)
10:    return m
11: procedure createLeaf(parent, cmd )

                                                           6
12:    b.parent ← parent
13:    b.cmd ← cmd
14:    return b
15: function QC(type, V )
16:    qc.type ← type
17:    qc.viewNumber ← m.viewNumber : m ∈ V
18:    qc.node ← m.node : m ∈ V
19:    qc.sig ← tcombine(hqc.type, qc.viewNumber , qc.nodei, {m.partialSig | m ∈ V })
20:    return qc
21: function matchingMsg(m, t, v) return (m.type = t) ∧ (m.viewNumber = v)
22: function matchingQC(qc, t, v) return (qc.type = t) ∧ (qc.viewNumber = v)
23: function safeNode(node, qc)
24:    return (node extends from lockedQC .node) ∨ // safety rule
25:          (qc.viewNumber > lockedQC .viewNumber ) // liveness rule

Algorithm 2 Basic HotStuff protocol (for replica r).
 1: for curView ← 1, 2, 3, . . . do
    . prepare phase
 2:     as a leader // r = leader(curView )
        // we assume all replicas send special new-view messages in view “0” to the first leader
 3:         wait for (n −f ) new-view messages: M ← {m    | matchingMsg(m, new-view, curView − 1)}
 4:         highQC ←     arg max{m.justify.viewNumber } .justify
                           m∈M
 5:           curProposal ← createLeaf(highQC .node, client’s command)
 6:           broadcast Msg(prepare, curProposal , highQC )
 7:       as a replica
 8:           wait for message m : matchingMsg(m, prepare, curView ) from leader(curView )
 9:           if m.node extends from m.justify.node ∧ safeNode(m.node, m.justify) then
10:               send voteMsg(prepare, m.node, ⊥) to leader(curView )
      . pre-commit phase
11:       as a leader
12:           wait for (n − f ) votes: V ← {v | matchingMsg(v, prepare, curView )}
13:           prepareQC ← QC(prepare, V )
14:           broadcast Msg(pre-commit, ⊥, prepareQC )
15:       as a replica
16:           wait for message m : matchingQC(m.justify, prepare, curView ) from leader(curView )
17:           prepareQC ← m.justify
18:           send voteMsg(pre-commit, m.justify.node, ⊥) to leader(curView )
      . commit phase
19:       as a leader
20:           wait for (n − f ) votes: V ← {v | matchingMsg(v, pre-commit, curView )}
21:           precommitQC ← QC(pre-commit, V )
22:           broadcast Msg(commit, ⊥, precommitQC )
23:       as a replica
24:           wait for message m : matchingQC(m.justify, pre-commit, curView ) from leader(curView )
25:           lockedQC ← m.justify
26:           send voteMsg(commit, m.justify.node, ⊥) to leader(curView )
      . decide phase
27:       as a leader
28:           wait for (n − f ) votes: V ← {v | matchingMsg(v, commit, curView )}
29:           commitQC ← QC(commit, V )
30:           broadcast Msg(decide, ⊥, commitQC )
31:       as a replica
32:           wait for message m : matchingQC(m.justify, commit, curView ) from leader(curView )
33:           execute new commands through m.justify.node, respond to clients
      . Finally

                                                           7
34:    nextView interrupt: goto this line if nextView(curView ) is called during “wait for” in any phase
35:    send Msg(new-view, ⊥, prepareQC ) to leader(curView + 1)

4.4    Basic HotStuff Safety, Liveness, and Complexity
Safety.    A quorum certificate qc is valid if tverify(hqc.type, qc.viewNumber , qc.nodei, qc.sig) is true.

Lemma 1. For any valid qc 1 , qc 2 in which qc 1 .type = qc 2 .type and qc 1 .node conflicts with qc 2 .node, we have
qc 1 .viewNumber 6= qc 2 .viewNumber .
Proof. For a contradiction, suppose qc 1 .viewNumber = qc 2 .viewNumber = v. Because a valid QC can be formed
only with n − f = 2f + 1 votes (i.e., partial signatures) for it, there must be a correct replica who voted twice in the
same phase of v. This is impossible because the pseudocode allows voting only once for each phase in each view.

Theorem 2. If w and b are conflicting nodes, then they cannot be both committed, each by a correct replica.
Proof. We prove this important theorem by contradiction. Let qc 1 denote a valid commitQC (i.e., qc 1 .type =
commit) such that qc 1 .node = w, and qc 2 denote a valid commitQC such that qc 2 .node = b. Denote v1 =
qc 1 .viewNumber and v2 = qc 2 .viewNumber . By Lemma 1, v1 6= v2 . W.l.o.g. assume v1 < v2 .
     We will now denote by vs the lowest view higher than v1 for which there is a valid prepareQC , qc s (i.e.,
qc s .type = prepare) where qc s .viewNumber = vs , and qc s .node conflicts with w. Formally, we define the fol-
lowing predicate for any prepareQC :

          E(prepareQC ) := (v1 < prepareQC .viewNumber ≤ v2 ) ∧ (prepareQC .node conflicts with w).

We can now set the first switching point qc s :

                qc s := arg min {prepareQC .viewNumber | prepareQC is valid ∧ E(prepareQC )}.
                       prepareQC

Note that, by assumption such a qc s must exist; for example, qc s could be the prepareQC formed in view v2 .
     Of the correct replicas that contributed tsign r (hqc 1 .type, qc 1 .viewNumber , qc 1 .nodei), let r be the first that
contributed tsign r (hqc s .type, qc s .viewNumber , qc s .nodei); such an r must exist since otherwise, one of qc 1 .sig
and qc s .sig could not have been created. During view v1 , replica r updates its lock lockedQC to a precommitQC on
w at Line 25 of Algorithm 2. Due to the minimality of vs , the lock that replica r has on w is not changed before qc s
is formed. Now consider the invocation of safeNode in the prepare phase of view vs by replica r, with a message
m carrying m.node = qc s .node. By assumption, m.node conflicts with lockedQC .node, and so the disjunct at
Line 24 of Algorithm 1 is false. Moreover, m.justify.viewNumber > v1 would violate the minimality of vs , and so
the disjunct in Line 25 of Algorithm 1 is also false. So, safeNode(m) must return false and r cannot cast a prepare
vote on the conflicting branch in view vs , a contradiction.
Liveness. There are two functions left undefined in the previous section: leader and nextView. Their definition
will not affect safety of the protocol, though they do matter to liveness. Before giving candidate definitions for them,
we first show that after GST, there is a bounded duration Tf such that if all correct replicas remain in view v during
Tf and the leader for view v is correct, then a decision is reached. Below, we say that QCs qc 1 and qc 2 match if qc 1
and qc 2 are valid, qc 1 .node = qc 2 .node, and qc 1 .viewNumber = qc 2 .viewNumber .
Lemma 3. If a correct replica is locked with lockedQC = precommitQC , then at least f + 1 correct replicas voted for
a prepareQC matching lockedQC .

Proof. Suppose replica r is locked on precommitQC . Then, (n − f ) votes were cast for the matching prepareQC in
the prepare phase (Line 10 of Algorithm 2), out of which at least f + 1 were from correct replicas.
Theorem 4. After GST, there exists a bounded time period Tf such that if all correct replicas remain in view v during
Tf and the leader for view v is correct, then a decision is reached.

                                                             8
Proof. Starting in a new view, the leader collects (n − f ) new-view messages and updates its highQC before
broadcasting a prepare messsage. Suppose among all replicas (including the leader itself), the highest kept lock
is lockedQC = precommitQC ∗ . By Lemma 3, we know there are at least f + 1 correct replicas that voted for a
prepareQC ∗ matching precommitQC ∗ , and have already sent them to the leader in their new-view messages. Thus,
the leader must learn a matching prepareQC ∗ in at least one of these new-view messages and use it as highQC in
its prepare message. By the assumption, all correct replicas are synchronized in their view and the leader is non-
faulty. Therefore, all correct replicas will vote in the prepare phase, since in safeNode, the condition on Line 25
of Algorithm 1 is satisfied (even if the node in the message conflicts with a replica’s stale lockedQC .node, and so
Line 24 is not). Then, after the leader assembles a valid prepareQC for this view, all replicas will vote in all the
following phases, leading to a new decision. After GST, the duration Tf for these phases to complete is of bounded
length.

    We now provide simple constructions for leader and nextView that suffice to ensure that after GST, eventually a
view will be reached in which the leader is correct and all correct replicas remain in this view for Tf time. It suffices
for leader to return some deterministic mapping from view number to a replica, eventually rotating through all
replicas. A possible solution for nextView is to utilize an exponential back-off mechanism that maintains a timeout
interval. Then a timer is set upon entering each view. When the timer goes off without making any decision, the
replica doubles the interval and calls nextView to advance the view. Since the interval is doubled at each time, the
waiting intervals of all correct replicas will eventually have at least Tf overlap in common, during which the leader
could drive a decision.
Complexity. In each phase of HotStuff, only the leader broadcasts to all replicas while the replicas respond to
the sender once with a partial signature to certify the vote. In the leader’s message, the QC consists of a proof of
(n − f ) votes collected previously, which can be encoded by a single threshold signature. In a replica’s response, the
partial signature from that replica is the only authenticator. Therefore, in each phase, there are O(n) authenticators
received in total. As there is a constant number of phases, the overall complexity per view is O(n).

5    Chained HotStuff
It takes three phases for a Basic HotStuff leader to commit a proposal. These phases are not doing “useful” work
except collecting votes from replicas, and they are all very similar. In Chained HotStuff, we improve the Basic
HotStuff protocol utility while at the same time considerably simplifying it. The idea is to change the view on every
prepare phase. Thus, every proposal has its own view.
     More specifically, in Chained HotStuff the votes over a prepare phase are collected in a view by the leader into a
prepareQC . Then the prepareQC is relayed to the leader of the next view, essentially delegating responsibility for
the next phase, which would have been pre-commit, to the next leader. However, the next leader does not actually
carry a pre-commit phase, but instead initiates a new prepare phase and adds its own proposal. This prepare
phase for view v + 1 simultaneously serves as the pre-commit phase for view v. The prepare phase for view v + 2
simultaneously serves as the pre-commit phase for view v + 1 and as the commit phase for view v. This is possible
because all the phases have identical structure.
     The pipeline of Basic HotStuff protocol phases embedded in a chain of Chained HotStuff proposals is depicted in
Figure 1. Views v1 , v2 , v3 of Chained HotStuff serve as the prepare, pre-commit, and commit Basic HotStuff phases
for cmd1 proposed in v1 . This command becomes committed by the end of v4 . Views v2 , v3 , v4 serve as the three
Basic HotStuff phases for cmd2 proposed in v2 , and it becomes committed by the end of v5 . Additional proposals
generated in these phases continue the pipeline similarly, and are denoted by dashed boxes. In Figure 1, a single
arrow denotes the b.parent field for a node b, and a double arrow denotes b.justify.node.
     Hence, there are only two types of messages in Chained HotStuff, a new-view message and generic-phase
generic message. The generic QC functions in all logically pipelined phases. We next explain the mechanisms
in the pipeline to take care of locking and committing, which occur only in the commit and decide phases of Basic
HotStuff.
Dummy nodes. The prepareQC used by a leader in some view viewNumber may not directly reference the pro-
posal of the preceding view (viewNumber −1). The reason is that the leader of a preceding view fails to obtain a QC,
either because there are conflicting proposals, or due to a benign crash. To simplify the tree structure, createLeaf
extends prepareQC .node with blank nodes up to the height (the number of parent links on a node’s branch) of the

                                                           9
v1                v2               v3                  v4             v5

            ···     QC cmd 1           QC v1 cmd 2       QC v2 cmd 3          QC v3 cmd 4     QC v4 cmd 5        ···

            cmd1   prepare            pre-commit        commit               decide     cmd5 prepare
                   decide        cmd2 prepare           pre-commit           commit          decide
                   commit             decide       cmd3 prepare              pre-commit      commit
                   pre-commit         commit            decide          cmd4 prepare         pre-commit

Figure 1: Chained HotStuff is a pipelined Basic HotStuff where a QC can serve in different phases simultaneously.
proposing view, so view-numbers are equated with node heights. As a result, the QC embedded in a node b may not
refer to its parent, i.e., b.justify.node may not equal b.parent.
One-Chain, Two-Chain, and Three-Chain. When a node b∗ carries a QC that refers to a direct parent, i.e.,
b∗ .justify.node = b∗ .parent, we say that it forms a “One-Chain”. Denote by b00 = b∗ .justify.node. Node b∗ forms a
Two-Chain, if in addition to forming a One-Chain, b00 .justify.node = b00 .parent. It forms a Three-Chain, if b00 forms
a Two-Chain.

                   b : v3              b0 : v 4         b00 : v5             b∗ : v6                             v8

      ···            cmd              QC cmd           QC cmd               QC cmd                            · QC
                                                                                                                · · cmd

Figure 2: The nodes at views v4 , v5 , v6 form a Three-Chain. The node at view v8 does not make a valid One-Chain
in Chained HotStuff (but it is a valid One-Chain after relaxation in the algorithm of Section 6).

    Looking at chain b = b0 .justify.node, b0 = b00 .justify.node, b00 = b∗ .justify.node, ancestry gaps might occur at
any one of the nodes. These situations are similar to a leader of Basic HotStuff failing to complete any one of three
phases, and getting interrupted to the next view by nextView.
    If b∗ forms a One-Chain, the prepare phase of b00 has succeeded. Hence, when a replica votes for b∗ , it should
remember prepareQC ← b∗ .justify. We remark that it is safe to update prepareQC even when a “One-Chain” is
not direct, so long as it is higher than the current prepareQC. In the implementation code described in Section 6,
we indeed update prepareQC in this case.
    If b∗ forms a Two-Chain, the pre-commit phase of b0 has succeeded. The replica should therefore update
lockedQC ← b00 .justify. Again, we remark that the lock can be updated even when a “Two-Chain” is not direct—
safety will not break—and indeed, this is given in the implementation code in Section 6.
    Finally, if b∗ forms a Three-Chain, the commit phase of b has succeeded, and b becomes a committed decision.
     Algorithm 3 shows the pseudocode for Chained HotStuff. The proof of safety given by Theorem 5 in Appendix A
is similar to the one for Basic HotStuff. We require the QC in a valid node always refers to its ancestor. For brevity,
we assume the constraint always holds and omit checking in the code.

Algorithm 3 Chained HotStuff protocol.
 1: procedure createLeaf(parent, cmd , qc)
 2:    b.parent ← branch extending with blanks from parent to height curView ;
 3:    b.cmd ← cmd ; b.justify ← qc; return b

 4: for curView ← 1, 2, 3, . . . do
      . generic phase
 5:      as a leader // r = leader(curView )
 6:          wait for (n − f ) new-view messages: M ← {m | matchingMsg(m, new-view, curView − 1)}
             // M includes the previous leader new-view message, if received while waiting in previous view

                                                                   10
                                 
 7:          prepareQC ←        arg max{m.justify.viewNumber } .justify
                                 m∈M
 8:           curProposal ← createLeaf(prepareQC .node, client’s command, prepareQC )
 9:           broadcast Msg(generic, curProposal , ⊥) // prepare phase (leader-half)
10:       as a replica
11:           wait for message m : matchingMsg(m, generic, curView ) from leader(curView )
12:           b∗ ← m.node; b00 ← b∗ .justify.node; b0 ← b00 .justify.node ; b ← b0 .justify.node ;
13:           if safeNode(m.node, m.node.justify) then
14:               send voteMsg(generic, b∗ , ⊥) to leader(curView )
15:           if b .parent = b00 then
                  ∗

16:               prepareQC ← b∗ .justify // start pre-commit phase on b∗ ’s parent
17:           if (b∗ .parent = b00 ) ∧ (b00 .parent = b0 ) then
18:               lockedQC ← b00 .justify // start commit phase on b∗ ’s grandparent
19:           if (b∗ .parent = b00 ) ∧ (b00 .parent = b0 ) ∧ (b0 .parent = b) then
20:               execute new commands through b, respond to clients // start decide phase on b∗ ’s great-grandparent
21:       as a leader // pre-commit phase (leader-half)
22:           wait for (n − f ) votes: V ← {v | matchingMsg(v, generic, curView )}
23:           prepareQC ← QC(generic, V )
24:       as the next leader // for liveness, the message here counts as (n − f ) at Line 6
25:           wait for message m : matchingMsg(m, new-view, curView ) from leader(curView )
      . Finally
26:       nextView interrupt: goto this line if nextView(curView ) is called during “wait for” in any phase
27:       send Msg(new-view, ⊥, prepareQC ) to leader(curView + 1)

6      Implementation
HotStuff is a practical protocol for building efficient SMR systems. Because of its simplicity, we can easily turn
Algorithm 3 into an event-driven-style specification that is almost like the code skeleton for a prototype.
    As shown in Algorithm 4, the code is further simplified and generalized by extracting the liveness mechanism
from the body into a module named Pacemaker. Instead of the next leader always waiting for a prepareQC at the
end of the generic phase before starting its reign, this logic is delegated to the Pacemaker. A stable leader can skip
this step and streamline proposals across multiple heights. Additionally, we relax the direct parent constraint for
maintaining the highest prepareQC and lockedQC , while still preserving the assumption that the QC in a valid node
always refers to its ancestor.
Data structures. Each replica u keeps track of the following main state variables:
• b0            // genesis node known by all correct replicas. To bootstrap, it contains a hard-coded QC for itself.
• V [ ] ← {}                                                                    // mapping from a node to its votes.
• vheight                                                                                 // height of last voted node.
• block                                                                       // locked node (similar to lockedQC ).
• bexec                                                                                          // last executed node.
• qc high                                     // highest known QC (similar to prepareQC ) kept by a Pacemaker.
• bleaf                                                                            // leaf node kept by a Pacemaker.
Pacemaker. A Pacemaker is a mechanism that guarantees progress after GST. The Pacemaker achieves this through
two ingredients.
    The first one is “synchronization”, bringing all correct replicas, and a unique leader, into a common height for a
sufficiently long period. The usual synchronization mechanism in the literature [23, 18, 14] is for replicas to increase
the count of ∆’s they spend at larger heights, until progress is being made. A common way to deterministically elect
a leader is to use a rotating leader scheme in which all correct replicas keep a predefined leader schedule and rotate
to the next one when the leader is demoted.
    Second, a Pacemaker needs to provide the leader with a way to choose a proposal that will be supported by
correct replicas. As shown in Algorithm 5, after a view change, in onReceiveNewView, the new leader collects
new-view messages sent by replicas through onNextSyncView to discover the highest QC to satisfy the second
part of condition in onReceiveProposal for liveness (Line 18 of Algorithm 4). During the same view, however, the

                                                               11
incumbent leader will chain the new node to the end of the leaf last proposed by itself, where no new-view message
or update to qc high is needed. Based on some application-specific heuristics (to wait until the previously proposed
node gets a QC, for example), the current leader invokes onBeat to propose a new node carrying the command to
be executed.
    It is worth noting that even if a bad Pacemaker invokes onPropose arbitrarily, or selects a parent and a QC capri-
ciously, and against any scheduling delays, safety is always guaranteed. Therefore, safety guaranteed by Algorithm 4
alone is entirely decoupled from liveness by any potential instantiation of Algorithm 5.

Algorithm 4 Event-driven HotStuff (for replica u).

 1: procedure createLeaf(parent, cmd , qc, height)                    17:      if bnew .height > vheight ∧ (bnew extends block ∨
 2:    b.parent ← parent; b.cmd ← cmd ;                               18:        bnew .justify.node.height > block .height) then
 3:    b.justify ← qc; b.height ← height; return b                    19:          vheight ← bnew .height
 4: procedure update(b∗ )                                             20:          send(getLeader(), voteMsgu (generic, bnew , ⊥))
 5:    b00 ← b∗ .justify.node; b0 ← b00 .justify.node                 21:      update(bnew )
 6:    b ← b0 .justify.node                                           22:   procedure onReceiveVote(m = voteMsgv (generic, b, ⊥))
         // pre-commit phase on b00                                   23:      if ∃hv, σ 0 i ∈ V [b] then return // avoid duplicates
 7:      updateQCHigh(b∗ .justify)                                    24:      V [b] ← V [b] ∪ {hv, m.partialSigi} // collect votes
 8:      if b0 .height > block .height then                           25:      if |V [b]| ≥ 2f + 1 then
 9:          block ← b0 // commit phase on b0                         26:          qc ← QC(generic, {σ : hv 0 , σi ∈ V [b]})
10:      if (b00 .parent = b0 ) ∧ (b0 .parent = b) then               27:          updateQCHigh(qc)
11:          onCommit(b)                                              28:   function onPropose(bleaf , cmd , qc high )
12:          bexec ← b // decide phase on b
                                                                      29:       bnew ← createLeaf(bleaf , cmd , qc high , bleaf .height + 1)
13:   procedure onCommit(b)                                                    // send to all replicas, including u itself
14:      if bexec .height < b.height then                             30:      broadcast(Msgu (generic, bnew , ⊥))
15:          onCommit(b.parent); execute(b.cmd )                      31:      return bnew
16:   procedure onReceiveProposal(Msgv (generic, bnew , ⊥))

Algorithm 5 Code skeleton for a Pacemaker (for replica u).

      // We assume Pacemaker in all correct replicas will have         6: procedure onBeat(cmd )
      synchronized leadership after GST.                               7:    if u = getLeader() then
 1:   function getLeader // . . . specified by the application         8:        bleaf ← onPropose(bleaf , cmd , qc high )
 2:   procedure updateQCHigh(qc 0high )                                9: procedure onNextSyncView
 3:       if qc 0high .node.height > qc high .node.height then        10:    send Msg(new-view, ⊥, qc high ) to getLeader()
 4:           qc high ← qc 0high
                                                                      11: procedure onReceiveNewView(Msg(new-view, ⊥, qc 0high ))
 5:           bleaf ← qc high .node
                                                                      12:    updateQCHigh(qc 0high )

Algorithm 6 update replacement for Two-Chain HotStuff (for replica u).

 1: procedure update(b∗ )
 2:    b0 ← b∗ .justify.node ; b ← b0 .justify.node                    5:      if (b = b0 .parent) then
 3:    updateQCHigh(b∗ .justify)                                       6:          onCommit(b)
 4:    if b0 .height > block .height then block ← b0                   7:          bexec ← b

Two-Chain HotStuff variant. To further demonstrate the flexibility of the HotStuff framework, Algorithm 6
shows the Two-Chain variant of HotStuff. Only the update procedure is affected, a Two-Chain is required for
reaching a commit decision, and a One-Chain determines the lock. The Two-Chain variant loses Optimistic Respon-
siveness and gains faster commitment, and it is similar to Tendermint/Casper. Thus, extra wait based on maximum
network delay must be incorporated in Pacemaker in order to ensure liveness. See Section 7.3 for further discussion.

                                                                 12
height         height        height
          height        height                  height        height        height
                                                                                                           k            k+1           k+2
             k           k+1                       k           k+1           k+2
           QC | b       QC | b*                 QC | b       QC | b’    QC | b*                         QC | b         QC | b’       QC | b*

                                                                                                    Δ              Δ             Δ

    (a) One-Chain (DLS, 1988)               (b) Two-Chain (PBFT, 1999)                   (c) Two-Chain w/ delay (Tendermint, 2016)

               height       height                  height                   height        height        height                       height
                  k          k+1                       ?                        k           k+1           k+2                            ?
               QC | b       QC | b’                 QC | b*                  QC | b       QC | b’       QC | b’’                     QC | b*

           Δ            Δ             Δ

          (d) Two-Chain w/ delay (Casper, 2017)                                      (d) Three-Chain (HotStuff, 2018)

                                          Figure 3: Commit rules for different BFT protocols.

7     One-Chain and Two-Chain BFT Protocols
In this section, we examine four BFT replication protocols spanning four decades of research in Byzantine fault
tolernace, casting them into a chained framework similar to Chained HotStuff.
    Figure 3 provides a birds-eye view of the commit rules of five protocols we consider, including HotStuff.
    In a nutshell, the commit rule in DLS [23] is One-Chain, allowing a node to be committed only by its own leader.
The commit rules in PBFT [18], Tendermint [13] and Casper [15] are almost identical, and consist of Two-Chains.
They differ in the mechanisms they introduce for liveness, PBFT has leader “proofs” of quadratic size (no Linearity),
Tendermint and Casper introduce a mandatory ∆ delay before each leader proposal (no Optimistic Responsiveness).
HotStuff uses a Three-Chain rule, and has a linear leader protocol without delay.

7.1     DLS
The simplest commit rule is a One-Chain. Modeled after Dwork, Lynch, and Stockmeyer (DLS) [23], the first known
asynchronous Byzantine Consensus solution, this rule is depicted in Figure 3(a). A replica becomes locked in DLS
on the highest node it voted for.
    Unfortunately, this rule would easily lead to a deadlock if at some height, a leader equivocates, and two correct
replicas became locked on the conflicting proposals at that height. Relinquishing either lock is unsafe unless there
are 2f + 1 that indicate they did not vote for the locked value.
    Indeed, in DLS only the leader of each height can itself reach a commit decision by the One-Chain commit rule.
Thus, only the leader itself is harmed if it has equivocated. Replicas can relinquish a lock either if 2f + 1 replicas did
not vote for it, or if there are conflicting proposals (signed by the leader). The unlocking protocol occurring at the
end of each height in DLS turns out to be fairly complex and expensive. Together with the fact that only the leader
for a height can decide, in the best scenario where no fault occurs and the network is timely, DLS requires n leader
rotations, and O(n4 ) message transmissions, per single decision. While it broke new ground in demonstrating a safe
asynchronous protocol, DLS was not designed as a practical solution.

7.2     PBFT
Modeled after PBFT [18, 19], a more practical appraoch uses a Two-Chain commit rule, see Figure 3(b). When a
replica votes for a node that forms a One-Chain, it becomes locked on it. Conflicting One-Chains at the same height
are simply not possible, as each has a QC, hence the deadlock situation of DLS is avoided.

                                                                       13
However, if one replica holds a higher lock than others, a leader may not know about it even if it collects informa-
tion from n − f replicas. This could prevent leaders from reaching decisions ad infinitum, purely due to scheduling.
To get “unstuck”, the PBFT unlocks all replicas by carrying a proof consisting of the highest One-Chain’s by 2f + 1
replicas. This proof is quite involved, as explained below.
     The original PBFT, which has been open-sourced [18] and adopted in several follow up works [11, 32], a leader
proof contains a set of messages collected from n − f replicas reporting the highest One-Chain each member voted
for. Each One-Chain contains a QC, hence the total communication cost is O(n3 ). Harnessing signature combining
methods from [44, 16], SBFT [28] reduces this cost to O(n2 ) by turning each QC to a single value.
     In the PBFT variant in [19], a leader proof contains the highest One-Chain the leader collected from the quorum
only once. It also includes one signed value from each member of the quorum, proving that it did not vote for a higher
One-Chain. Broadcasting this proof incurs communication complexity O(n2 ). Note that whereas the signatures on
a QC may be combined into a single value, the proof as a whole cannot be reduced to constant size because messages
from different members of the quorum may have different values.
     In both variants, a correct replica unlocks even it has a higher One-Chain than the leader’s proof. Thus, a correct
leader can force its proposal to be accepted during period of synchrony, and liveness is guaranteed. The cost is
quadratic communication per leader replacement.

7.3    Tendermint and Casper
Tendermint [13, 14] has a Two-Chain commit rule identical to PBFT, and Casper [15] has a Two-Chain rule in which
the leaf does not need to have a QC to direct parent. That is, in Casper, Figure 3(c,d) depicts the commit rules for
Tendermint and Casper, respectively.
    In both methods, a leader simply sends the highest One-Chain it knows along with its proposal. A replica unlocks
a One-Chain if it receives from the leader a higher one.
    However, because correct replicas may not vote for a leader’s node, to guarantee progress a new leader must
obtain the highest One-Chain by waiting the maximal network delay. Otherwise, if leaders only wait for the first
n − f messages to start a new height, there is no progress guarantee. Leader delays are inherent both in Tendermint
and in Casper, in order to provide liveness.
    This simple leader protocol embodies a linear leap in the communication complexity of the leader protocol,
which HotStuff borrows from. As already mentioned above, a QC could be captured in a single value using threshold-
signatures, hence a leader can collect and disseminate the highest One-Chain with linear communication complexity.
However, crucially, due to the extra QC step, HotStuff does not require the leader to wait the maximal network delay.

8     Evaluation
We have implemented HotStuff as a library in roughly 4K lines of C++ code. Most noticeably, the core consensus
logic specified in the pseudocode consumes only around 200 lines. In this section, we will first examine baseline
throughput and latency by comparing to a state-of-art system, BFT-SMaRt [11]. We then focus on the message cost
for view changes to see our advantages in this scenario.

8.1    Setup
We conducted our experiments on Amazon EC2 using c5.4xlarge instances. Each instance had 16 vCPUs supported
by Intel Xeon Platinum 8000 processors. All cores sustained a Turbo CPU clock speed up to 3.4GHz. We ran each
replica on a single VM instance, and so BFT-SMaRt, which makes heavy use of threads, was allowed to utilize 16
cores per replica, as in their original evaluation [11]. The network bandwidth on each machine was 3500Mbps, and
we did not limit bandwidth in any runs of either system.
    Our prototype implementation of HotStuff uses secp256k1 for all digital signatures in both votes and quorum
certificates. BFT-SMaRt uses hmac-sha1 for MACs (Message Authentication Codes) in the messages during normal
operation and uses digital signatures in addition to MACs during a view change.
    All results for HotStuff reflect end-to-end measurement from the clients. For BFT-SMaRt, we used the micro-
benchmark programs ThroughputLatencyServer and ThroughputLatencyClient from the BFT-SMaRt website
(https://github.com/bft-smart/library). The client program measures end-to-end latency but not throughput, while

                                                           14
the server-side program measures both throughput and latency. We used the throughput results from servers and
the latency results from clients.

8.2                 Base Performance
We first measured throughput and latency in a setting commonly seen in the evaluation of other BFT replication
systems. We ran 4 replicas in a configuration that tolerates a single failure, i.e., f = 1, while varying the operation
request rate until the system saturated. This benchmark used empty (zero-sized) operation requests and responses
and triggered no view changes; we expand to other settings below.
               25
                          BFT-SMaRt-b100                                                                         BFT-SMaRt-p0
                          BFT-SMaRt-b400                                                                         BFT-SMaRt-p128
               20         BFT-SMaRt-b800                                            30                           BFT-SMaRt-p1024
                          HotStuff-b100                                                                          HotStuff-p0

                                                                     Latency (ms)
Latency (ms)

                          HotStuff-b400                                                                          HotStuff-p128
               15         HotStuff-b800                                                                          HotStuff-p1024
                                                                                    20

               10
                                                                                    10
                5

                0                                                                    0
                    10   30   50    70 90 110 130 150 170 190                            10   30   50 70 90 110 130 150 170 190
                                   Throughput (Kops/sec)                                             Throughput (Kops/sec)

Figure 4: Throughput vs. latency with different choices              Figure 5: Throughput vs. latency with different choices
of batch size, 4 replicas, 0/0 payload.                              of payload size with 4 replicas, batch size 400.

     Figure 4 depicts three batch sizes for both systems, 100, 400, and 800, though because these systems have differ-
ent batching schemes, these numbers mean slightly different things for each system. BFT-SMaRt drives a separate
consensus decision for each operation, and batches the messages from multiple consensus protocols. Therefore, it
has a typical L-shaped latency/throughput performance curve. HotStuff batches multiple operations in each node,
and in this way, mitigates the cost of digital signatures per decision. However, above 400 operations per batch, the
latency incurred by batching becomes higher than the cost of the crypto. Despite these differences, HotStuff achieves
comparable performance to BFT-SMaRt for all three batch sizes.
     For batch sizes of 100 and 400, the lowest-latency HotStuff point provides latency and throughput that are com-
parable to the latency and throughput simultaneously achievable by BFT-SMaRT at its highest throughput. Recall
that HotStuff makes use of the first phase of the next decision to piggyback the final phase of the previous one, and
so requires a moderate rate of submitted operations to achieve its lowest-latency point. Or, in other words, reducing
the rate of submitted operations further only increases latency. In practice, this operation rate could be ensured by
generating dummy nodes to finalize a decision when there are insufficient outstanding operations to do so naturally.
However, for these runs we did not generate dummy nodes, but instead just ensured that operations were submitted
at rate sufficient to demonstrate low latency, which also happened to roughly maximize throughput, as well.
     Figure 5 depicts three client request/reply payload sizes (in bytes) of 0/0, 128/128, and 1024/1024, designated by
“p0”, “p128”, and “p1024”. At payload size of 1024 bytes, the reduced communication complexity of HotStuff due to
its linear protocol and use of signatures already (slightly) outperforms BFT-SMaRt.
     Below we evaluate both systems in more challenging situations, where the performance advantages of HotStuff
will become more pronounced.

8.3                 Scalability
To evaluate the scalability of HotStuff in various dimensions, we performed three experiments. For the baseline,
we used zero-size request/response payloads while varying the number of replicas. The second evaluation repeated
the baseline experiment with 1024-byte request/response payloads. The third test repeated the baseline (with empty
payloads) while introducing network delays between replicas that were uniformly distributed in 5ms ± 0.5ms or in
10ms ± 1.0ms, implemented using NetEm (see https://www.linux.org/docs/man8/tc-netem.html).

                                                                15
You can also read