SIMPLE HIGH-LEVEL CODE FOR CRYPTOGRAPHIC ARITHMETIC - WITH PROOFS, WITHOUT COMPROMISES - WIREGUARD

 
CONTINUE READING
Simple High-Level Code For Cryptographic
    Arithmetic – With Proofs, Without Compromises
                   Andres Erbsen          Jade Philipoom          Jason Gross       Robert Sloan       Adam Chlipala
                                             MIT CSAIL,
                                        Cambridge, MA, USA
         {andreser, jadep, jgross}@mit.edu, rob.sloan@alum.mit.edu, adamc@csail.mit.edu

   Abstract—We introduce a new approach for implementing                  where X25519 was the only arithmetic-based crypto primitive
cryptographic arithmetic in short high-level code with machine-           we need, now would be the time to declare victory and go
checked proofs of functional correctness. We further demonstrate          home. Yet most of the Internet still uses P-256, and the
that simple partial evaluation is sufficient to transform into the
fastest-known C code, breaking the decades-old pattern that the           current proposals for post-quantum cryptosystems are far from
only fast implementations are those whose instruction-level steps         Curve25519’s combination of performance and simplicity.
were written out by hand.                                                    A closer look into the development and verification process
   These techniques were used to build an elliptic-curve library          (with an eye towards replicating it for another function) reveals
that achieves competitive performance for 80 prime fields and             a discouraging amount of duplicated effort. First, expert time
multiple CPU architectures, showing that implementation and
proof effort scales with the number and complexity of concep-             is spent on mechanically writing out the instruction-level
tually different algorithms, not their use cases. As one outcome,         steps required to perform the desired arithmetic, drawing from
we present the first verified high-performance implementation             experience and folklore. Then, expert time is spent to optimize
of P-256, the most widely used elliptic curve. Implementations            the code by applying a carefully chosen combination of simple
from our library were included in BoringSSL to replace existing           transformations. Finally, expert time is spent on annotating
specialized code, for inclusion in several large deployments for
Chrome, Android, and CloudFlare.                                          the now-munged code with assertions relating values of run-
                                                                          time variables to specification variables, until the SMT solver
                          I. M OTIVATION                                  (used as a black box) is convinced that each assertion is
                                                                          implied by the previous ones. As the number of steps required
   The predominant practice today is to implement crypto-
                                                                          to multiply two numbers is quadratic in the size of those
graphic primitives at the lowest possible level of abstraction:
                                                                          numbers, effort grows accordingly: for example “Verifying
usually flat C code, ideally assembly language if engineering
                                                                          Curve25519 Software” [1] reports that multiplication modulo
time is plentiful. Low-level implementation has been the only
                                                                          2255 − 19 required 27 times the number of assertions used
way to achieve good performance, ensure that execution time
                                                                          for 2127 − 1. It is important to note that both examples use
does not leak secret information, and enable the code to
                                                                          the same simple modular multiplication algorithm, and the
be called from any programming language. Yet placing the
                                                                          same correctness argument should apply, just with different
burden of every detail on the programmer also leaves a lot
                                                                          parameters. This is just one example of how manual repetition
to be desired. The demand for implementation experts’ time
                                                                          of algorithmic steps inside primitive cryptographic arithmetic
is high, resulting in an ever-growing backlog of algorithms to
                                                                          implementations makes them notoriously difficult to write,
be reimplemented in specialized code “as soon as possible.”
                                                                          review, or prove correct.
Worse, bugs are also common enough that it is often not
                                                                             Hence we suggest a rather different way of implementing
economical to go beyond fixing the behavior for problematic
                                                                          cryptographic primitives: to implement general algorithms in
inputs and understand the root cause and impact of the defect.
                                                                          the style that best illustrates their correctness arguments and
   There is hope that these issues can eventually be overcome             then derive parameter-specific low-level code through partial
by redoubling community efforts to produce high-quality                   evaluation. For example, the algorithm used for multiplication
crypto code. However, even the most renowned implementors                 modulo 2127 − 1 and 2255 − 19 is first encoded as a purely
are not infallible, and if a typo among thousands of lines of             functional program operating on lists of digit-weight pairs: it
monotonous code can only be recognized as incorrect by top                generates a list of partial products, shifts down the terms with
experts,1 it is reasonable to expect diminishing returns for              weight greater than the modulus, and scales them accordingly.
achieving correctness through manual review. A number of                  It is then instantiated with lists of concrete weights and length,
tools for computer-aided verification have been proposed [1]–             for example ten limbs with weights 2d25.5ie , and partially
[6], enabling developers to eliminate incorrect-output bugs               evaluated so that no run-time manipulation of lists and weights
conclusively. Three solid implementations of the X25519                   remains, resulting in straight-line code very similar to what
Diffie-Hellman function were verified. In the beautiful world             an expert would have written. As the optimization pipeline
  1 We refer to the ed25519-amd64-64-24k bug that is discussed along      is proven correct once and for all, correctness of all code
with other instructive real-world implementation defects in Appendix A.   generated from a high-level algorithm can be proven directly
on its high-level implementation. A nice side benefit of this         Input:
approach is that the same high-level algorithm and proof can          Definition wwmm_step A B k S ret :=
be used for multiple target configurations with very little effort,     divmod_r_cps A (λ ’(A, T1),
                                                                        @scmul_cps r _ T1 B _ (λ aB,
enabling the use of fast implementations in cases that were             @add_cps r _ S aB _ (λ S,
previously not worth the effort.                                        divmod_r_cps S (λ ’(_, s),
   In practice, we use the fast arithmetic implementations              mul_split_cps’ r s k (λ ’(q, _),
                                                                        @scmul_cps r _ q M _ (λ qM,
reported on here in conjunction with a library of verified              @add_longer_cps r _ S qM _ (λ S,
cryptographic algorithms, the rest of which deviates much less          divmod_r_cps S (λ ’(S, _),
from common implementation practice. All formal reasoning               @drop_high_cps (S R_numlimbs) S _ (λ S,
                                                                        ret (A, S) ))))))))).
is done in the Coq proof assistant, and the overall trusted
computing base also includes a simple pretty-printer and the          Fixpoint wwmm_loop A B k len_A ret :=
                                                                        match len_A with
C language toolchain. The development, called Fiat Cryptog-             | O => ret
raphy, is available under the MIT license at:                           | S len_A’ => λ ’(A, S),
                                                                          wwmm_step A B k S (wwmm_loop A B k len_A’ ret)
   https://github.com/mit-plv/fiat-crypto                               end.

   The approach we advocate for is already used in widely de-         Output:
ployed code. Most applications relying on the implementations         void wwmm_p256(u64 out[4], u64 x[4], u64 y[4]) {
described in this paper are using them through BoringSSL,               u64 x17, x18 = mulx_u64(x[0], y[0]);
                                                                        u64 x20, x21 = mulx_u64(x[0], y[1]);
Google’s OpenSSL-derived crypto library that includes our               u64 x23, x24 = mulx_u64(x[0], y[2]);
Curve25519 and P-256 implementations. A high-profile user               u64 x26, x27 = mulx_u64(x[0], y[3]);
is the Chrome Web browser, so today about half of HTTPS                 u64 x29, u8 x30 = addcarryx_u64(0x0, x18, x20);
                                                                        u64 x32, u8 x33 = addcarryx_u64(x30, x21, x23);
connections opened by Web browsers worldwide use our fast               u64 x35, u8 x36 = addcarryx_u64(x33, x24, x26);
verified code (Chrome versions since 65 have about 60%                  u64 x38, u8 _ = addcarryx_u64(0x0, x36, x27);
market share [7], and 90% of connections use X25519 or P-               u64 x41, x42 = mulx_u64(x17, 0xffffffffffffffff);
                                                                        u64 x44, x45 = mulx_u64(x17, 0xffffffff);
256 [8]). BoringSSL is also used in Android and on a majority           u64 x47, x48 = mulx_u64(x17, 0xffffffff00000001);
of Google’s servers, so a Google search from Chrome would               u64 x50, u8 x51 = addcarryx_u64(0x0, x42, x44);
likely be relying on our code on both sides of the connection.          // 100 more lines...
Maintainers appreciate the reduced need to worry about bugs.          Fig. 1: Word-by-word Montgomery Multiplication, output
                                                                      specialized to P-256 (2256 − 2224 + 2192 + 296 − 1)
                         II. OVERVIEW
   We will explain how we write algorithm implementations
in a high-level way that keeps the structural principles as           this parameter-agnostic code is written in continuation-passing
clear and readable as possible, proving functional correctness.       style where functions do not return normally but instead take
Our explanation builds up to showing how other security               as arguments callbacks to invoke when answers become avail-
constraints can be satisfied without sacrificing the appeal of        able. The notation λ ’(x, y), ... encodes an anonymous
our new strategy. Our output code follows industry-standard           function with arguments x and y.
“constant-time” coding practices and is only bested in through-          The bottom half of the figure shows the result of spe-
put by platform-specific assembly-language implementations            cializing the general algorithm to P-256, the elliptic curve
(and only by a modest margin). This section will give a               most commonly used in TLS and unfortunately not supported
tour of the essential ingredients of our methodology, and the         by any of the verified libraries discussed earlier. Note that
next section will explain in detail how to implement field            some operations influence the generated code by creating
arithmetic for implementation-friendly primes like 2255 − 19.         explicit instruction sequences (highlighted in matching colors),
Building on this intuition (and code!), Section III-F culmi-          while other operations instead control which input expressions
nates in a specializable implementation of the word-by-word           should be fed into others. The number of low-level lines
Montgomery reduction algorithm for multiplication modulo              per high-level line varies between curves and target hardware
any prime. Section IV will explain the details of the compiler        architectures. The generated code is accompanied by a Coq
that turns the described implementations into the kind of code        proof of correctness stated in terms of the straightline-code
that implementation experts write by hand for each arithmetic         language semantics and the Coq standard-library definitions
primitive. We report performance measurements in Section V            of integer arithmetic.
and discuss related work and future directions in Sections VI
and VII.                                                              A. High-Level Arithmetic Library Designed For Specialization
   Fig. 1 gives a concrete example of what our framework                 At the most basic level, all cryptographic arithmetic algo-
provides. The algorithm in the top half of the figure is              rithms come down to implementing large mathematical objects
a transcription of the word-by-word Montgomery reduction              (integers or polynomials of several hundred bits) in terms of
algorithm, following Gueron’s Algorithm 4 [9]. As is required         arithmetic provided by the hardware platform (for example
by the current implementation of our optimization pipeline,           integers modulo 264 ). To start out with, let us consider the
example of addition, with the simplifying precondition that              where the ai s and bi s are unknown program inputs. While we
no digits exceed half of the hardware-supported maximum.                 cannot make compile-time simplifications based on the values
      type num = list N
                                                                         of the digits, we can reduce away all the overhead of dynamic
                                                                         allocation of lists. We use Coq’s term-reduction machinery,
      add : num → num → num
                                                                         which allows us to choose λ-calculus-style reduction rules to
      add (a :: as) (b :: bs) = let x = a + b in x :: add as bs
                                                                         apply until reaching a normal form. Here is what happens with
      add as [] = as                                                     our example, when we ask Coq to leave let and + unreduced
      add [] bs = bs                                                     but apply other rules.
To give the specification of this function acting on a base-264             add [a1 , a2 , a3 ] [b1 , b2 , b3 ]   ⇓   let n1 = a1 + b1 in n1 ::
little-endian representation, we define an abstraction function
evaluating each digit list ` into a single number.                                                                    let n2 = a2 + b2 in n2 ::
                                                                                                                      let n3 = a3 + b3 in n3 :: []
                      b c : num → N
                      ba :: asc = a + 264 basc                           We have made progress: no run-time case analysis on lists
                      b[]c = 0                                           remains. Unfortunately, let expressions are intermixed with
                                                                         list constructions, leading to code that looks rather different
To prove correctness, we show (by naive induction on a) that             than assembly. To persuade Coq’s built-in term reduction to
evaluating the result of add behaves as expected:                        do what we want, we first translate arithmetic operations to
                  ∀a, b. badd a bc = bac + bbc                           continuation-passing style. Concretely, we can rewrite add.

Using the standard list type in Coq, the theorem becomes                             add0 : ∀α. num → num → (num → α) → α
                                                                                     add0 (a :: as) (b :: bs) k =
Lemma eval_add : ∀a b, badd a bc = bac+bbc.
                                                                                      let n = a + b in
 induction a,b;cbn; rewrite ?IHa;nia. Qed.
                                                                                       add0 as bs (λ`. k (n :: `))
   The above proof is only possible because we are inten-                            add0 as [] k = k as
tionally avoiding details of the underlying machine arithmetic                       add0 [] bs k = k bs
– if the list of digits were defined to contain 64-bit words
instead of N, repeated application of add would eventually               Reduction turns this function into assembly-like code.
result in integer overflow. This simplification is an instance of
the pattern of anticipating low-level optimizations in writing              add0 [a1 , a2 , a3 ] [b1 , b2 , b3 ] (λ`. `)   ⇓   let n1 = a1 + b1 in
high-level code: we do expect to avoid overflow, and our                                                                       let n2 = a2 + b2 in
choice of a digit representation is motivated precisely by that                                                                let n3 = a3 + b3 in
aim. It is just that the proofs of overflow-freedom will be                                                                    [n1 , n2 , n3 ]
injected in a later stage of our pipeline, as long as earlier
stages like our current one are implemented correctly. There             When add0 is applied to a particular continuation, we can
is good reason for postponing this reasoning: generally we care          also reduce away the result list. Chaining together sequences
about the context of higher-level code calling our arithmetic            of function calls leads to idiomatic and efficient assembly-style
primitives. In an arbitrary context, an add implemented using            code, based just on Coq’s normal term reduction, preserving
64-bit words would need to propagate carries from each word              sharing of let-bound variables. This level of inlining is com-
to the next, causing an unnecessary slowdown when called                 mon for the inner loops of crypto primitives, and it will also
with inputs known to be small.                                           simplify the static analysis described in the next subsection.

B. Partial Evaluation For Specific Parameters                            C. Word-Size Inference
   It is impossible to achieve competitive performance with                 Up to this point, we have derived code that looks almost
arithmetic code that manipulates dynamically allocated lists             exactly like the assembly code we want to produce. The
at runtime. The fastest code will implement, for example, a              code is structured to avoid overflows when run with fixed-
single numeric addition with straightline code that keeps as             precision integers, but so far it is only proven correct for
much state as possible in registers. Expert implementers today           natural numbers. The final major step is to infer a range of
write that straightline code manually, applying various rules of         possible values for each variable, allowing us to assign each
thumb. Our alternative is to use partial evaluation in Coq to            one a register or stack-allocated variable of the appropriate bit
generate all such specialized routines, beginning with a single          width.
library of high-level functional implementations that general-              The bounds-inference pass works by standard abstract inter-
ize the patterns lurking behind hand-written implementations             pretation with intervals. As inputs, we require lower and upper
favored today.                                                           bounds for the integer values of all arguments of a function.
   Consider the case where we know statically that each num-             These bounds are then pushed through all operations to infer
ber we add will have 3 digits. A particular addition in our top-         bounds for temporary variables. Each temporary is assigned
level algorithm may have the form add [a1 , a2 , a3 ] [b1 , b2 , b3 ],   the smallest bit width that can accommodate its full interval.
prime                              architecture    # limbs     base    representation (distributing large x into x0 ...xn )
 2256 − 2224 + 2192 + 296 − 1       64-bit          4           264     x = x0 + 264 x1 + 2128 x2 + 2192 x3
 (P-256)
 2255 − 19 (Curve25519)             64-bit          5           251     x = x0 + 251 x1 + 2102 x2 + 2153 x3 + 2204 x4
 2255 − 19 (Curve25519)             32-bit          10          225.5   x = x0 + 226 x1 + 251 x2 + 277 x3 + ... + 2204 x8 + 2230 x9
 2448 − 2224 − 1 (p448)             64-bit          8           256     x = x0 + 256 x1 + 2112 x2 + ... + 2392 x7
 2127 − 1                           64-bit          3           242.5   x = x0 + 243 x1 + 285 x2
                     Fig. 2: Examples of big-integer representations for different primes and integer widths

   As an artificial example, assume the input bounds                               III. A RITHMETIC T EMPLATE L IBRARY
a1 , a2 , a3 , b1 ∈ [0, 231 ]; b2 , b3 ∈ [0, 230 ]. The analysis con-      Recall Section II-A’s toy add function for little-endian
cludes n1 ∈ [0, 232 ]; n2 , n3 ∈ [0, 230 + 231 ]. The first             big integers. We will now describe our full-scale library,
temporary is just barely too big to fit in a 32-bit register,           starting with the core unsaturated arithmetic subset, the foun-
while the second two will fit just fine. Therefore, assuming            dation for all big-integer arithmetic in our development. For
the available temporary sizes are 32-bit and 64-bit, we can             those who prefer to read code, we suggest src/Demo.v
transform the code with precise size annotations.                       in the framework’s source code, which contains a succinct
                     let n1 : N264 = a1 + b1 in                         standalone development of the unsaturated-arithmetic library
                                                                        up to and including prime-shape-aware modular reduction.
                     let n2 : N232 = a2 + b2 in
                                                                        The concrete examples derived in this section are within
                     let n3 : N232 = a3 + b3 in
                                                                        established implementation practice, and an expert would be
                     [n1 , n2 , n3 ]
                                                                        able to reproduce them given an informal description of the
                                                                        strategy. Our contribution is to encode this general wisdom
Note how we may infer different temporary widths based on
                                                                        in concrete algorithms and provide these with correctness
different bounds for arguments. As a result, the same primitive
                                                                        certificates without sacrificing the programmer’s sanity.
inlined within different larger procedures may get different
bounds inferred. World-champion code for real algorithms                A. Multi-Limbed Arithmetic
takes advantage of this opportunity.
                                                                           Cryptographic modular arithmetic implementations dis-
   This phase of our pipeline is systematic enough that we
                                                                        tribute very large numbers across smaller “limbs” of 32- or 64-
chose to implement it as a certified compiler. That is, we define
                                                                        bit integers. Fig. 2 shows a small sample of fast representations
a type of abstract syntax trees (ASTs) for the sorts of programs
                                                                        for different primes. Notice that many of these implementa-
that earlier phases produce, we reify those programs into our
                                                                        tions use bases other than 232 or 264 , leaving bits unused in
AST type, and we run compiler passes written in Coq’s Gallina
                                                                        each limb: these are called unsaturated representations. Con-
functional programming language. Each pass is proved correct
                                                                        versely, the ones using all available bits are called saturated.
once and for all, as Section IV explains in more detail.
                                                                           Another interesting feature shown in the examples is that
                                                                        the exponents of some bases, such as the original 32-bit
D. Compilation To Constant-Time Machine Code
                                                                        Curve25519 representation [10], are not whole numbers. In the
   What results is straightline code very similar to that written       actual representation, this choice corresponds to an alternating
by hand by experts, represented as ASTs in a simple lan-                pattern, so “base 225.5 ” uses 26 bits in the first limb, 25 in the
guage with arithmetic and bitwise operators. Our correctness            second, 26 in the third, and so on. Because of the alternation,
proofs connect this AST to specifications in terms of integer           these are called mixed-radix bases, as opposed to uniform-
arithmetic, such as the one for add above. All operations               radix ones.
provided in our lowest-level AST are implemented with input-               These unorthodox big integers are fast primarily because
independent execution time in many commodity compilers                  of a specific modular-reduction trick, which is most efficient
and processors, and if so, our generated code is trivially              when the number of bits in the prime corresponds to a limb
free of timing leaks. Each function is pretty-printed as C              boundary. For instance, reduction modulo 2255 − 19 is fastest
code and compiled with a normal C compiler, ready to be                 when the bitwidths of a prefix of limbs sum to 255. Every
benchmarked or included in a library. We are well aware that            unsaturated representation in our examples is designed to
top implementation experts can translate C to assembly better           fulfill this criterion.
than the compilers, and we do not try to compete with them:
while better instruction scheduling and register allocation             B. Partial Modular Reduction
for arithmetic-heavy code would definitely be valuable, it is             Suppose we are working modulo a k-bit prime m. Multiply-
outside the scope of this project. Nevertheless, we are excited         ing two k-bit numbers can produce up to 2k bits, potentially
to report that our library generates the fastest-known C code           much larger than m. However, if we only care about what
for all operations we have benchmarked (Section V).                     the result is mod m, we can perform a partial modular
s × t = 1 × s0 t0 + 243 × s0 t1          + 285 × s0 t2
                               + 243 × s1 t0         + 286 × s1 t1                    + 2128 × s1 t2
                                                          85
                                                     +2        × s2 t0                + 2128 × s2 t1               + 2170 × s2 t2
                  = s0 t0      + 243 (s0 t1 + s1 t0 ) + 285 (s0 t2 + 2s1 t1 + s2 t0 ) + 2127 (2s1 t2 + 2s2 t1 ) + 2170 × s2 t2

                                    Fig. 3: Distributing terms for multiplication mod 2127 − 1

reduction to reduce the upper bound while preserving modular                We format the first intermediate term suggestively: down
equivalence. (The reduction is “partial” because the result is           each column, the powers of two are very close together,
not guaranteed to be the minimal residue.)                               differing by at most one. Therefore, it is easy to add down
   The most popular choices of primes in elliptic-curve                  the columns to form our final answer, split conveniently into
cryptography are of the form m = 2k − cl 2tl − . . . −                   digits with integral bit widths.
c0 2t0 , encompassing what have been called “generalized                    At this point we have a double-wide answer for multiplica-
Mersenne primes,” “Solinas primes,” “Crandall primes,”                   tion, and we need to do modular reduction to shrink it down
“pseudo-Mersenne primes,” and “Mersenne primes.” Although                to single-wide. For our example, the last two digits can be
any number could be expressed this way, and the algorithms               rearranged so that the modular-reduction rule applies:
we describe would still apply, choices of m with relatively
few terms (l  k) and small ci facilitate fast arithmetic.                     2127 (2s1 t2 + 2s2 t1 ) + 2170 s2 t2       (mod 2127 − 1)
   Set s = 2k and c = cl 2tl + . . . + c0 2t0 , so m = s − c. To            = 2127 ((2s1 t2 + 2s2 t1 ) + 243 s2 t2 )      (mod 2127 − 1)
reduce x mod m, first split x by finding a and b such that                  = 1((2s1 t2 + 2s2 t1 ) + 243 s2 t2 )          (mod 2127 − 1)
x = a + sb. Then, a simple derivation yields a division-free
procedure for partial modular reduction.                                    As a result, we can merge the second-last digit into the first
                                                                         and merge the last digit into the second, leading to this final
        x mod m = (a + sb) mod (s − c)
                                                                         formula specifying the limbs of a single-width answer:
                   = (a + (s − c)b + cb) mod (s − c)
                   = (a + cb) mod m                                           (s0 t0 + 2s1 t2 + 2s2 t1 ) + 243 (s0 t1 + s1 t0 + s2 t2 )
                                                                                                         + 285 (s0 t2 + 2s1 t1 + s2 t0 )
  The choice of a and b does not further affect the correctness
of this formula, but it does influence how much the input is             D. Associational Representation
reduced: picking a = x and b = 0 would make this formula
                                                                            As is evident by now, the most efficient code makes use
a no-op. One might pick a = x mod s, although the formula
                                                                         of sophisticated and specific big-number representations, but
does not require it. Here is where careful choices of big-integer
                                                                         many concrete implementations operate on the same set of
representation help: having s be at a limb boundary allows for
                                                                         underlying principles. Our strategy of writing high-level tem-
a good split without any computation!
                                                                         plates allows us to capture the underlying principles in a fully
  Our Coq proof of this trick is reproduced here:
                                                                         general way, without committing to a run-time representation.
Lemma reduction_rule a b s c (_: s-c  0)                               Abstracting away the implementation-specific details, like the
 : (a + s * b) mod (s - c) =                                             exact number of limbs or whether the base system is mixed-
   (a + c * b) mod (s - c).                                              or uniform-radix, enables simple high-level proofs of all
Proof.                                                                   necessary properties.
 replace (a+s*b) with ((a+c*b)+b*(s-c)).                                    A key insight that allows us to simplify the arithmetic
 rewrite add_mod,mod_mult,add_0_r,mod_mod.                               proofs is to use multiple different representations for big
 all: auto; nsatz.                                                       integers, as long as changes between these representations can
Qed.                                                                     be simplified away once the prime shape is known. We have
                                                                         two main representations of big integers in the core library:
C. Example: Multiplication Modulo 2127 − 1                               associational and positional. A big integer in associational
   Before describing the general modular-reduction algorithm             form is represented as a list of pairs; the first element in each
implemented in our library, we will walk through multiplica-             pair is a weight, known at compile time, and the second is
tion specialized to just one modulus and representation. To              a runtime value. The decimal number 95 might be encoded
simplify matters a bit, we use the (relatively) small modulus            as [(16, 5); (1, 15)], or equivalently as [(1, 5);
2127 −1. Say we want to multiply 2 numbers s and t in its field,         (10, 9)] or [(1, 1); (1, 6); (44, 2)]. As long
with those inputs broken up as s = s0 + 243 s1 + 285 s2 and              as the sum of products is correct, we do not care about
t = t0 + 243 t1 + 285 t2 . Distributing multiplication repeatedly        the order, whether the weights are powers of each other, or
over addition gives us the answer form shown in Fig. 3.                  whether there are multiple pairs of equal weight.
Definition mul (p q:list(Z*Z)) : list(Z*Z)                            Fixpoint place (t:Z*Z) (i:nat) : nat*Z :=
:= flat_map (λt,                                                       if dec (fst t mod weight i = 0) then
    map (λt’,(fst t*fst t’, snd t*snd t’))                              let c:=fst t / weight i in (i, c*snd t)
   q) p.                                                               else match i with
                                                                            | S i’ => place t i’
Lemma eval_map_mul (a x:Z) (p:list (Z*Z))
                                                                            | O => (O, fst t*snd t) end.
  : eval (map (λt, (a*fst t, x*snd t)) p)
  = a*x * eval p.                                                     Definition from_associational n p :=
Proof. induction p; push; nsatz. Qed.                                   List.fold_right (λt,
                                                                            let wx := place t (pred n) in
Lemma eval_mul p q
                                                                            add_to_nth (fst wx) (snd wx) )
  : eval (mul p q) = eval p * eval q.
                                                                          (zeros n) p.
Proof.
induction p; cbv [mul]; push; nsatz. Qed.                             Lemma place_in_range (t:Z*Z) (n:nat)
                                                                      : (fst (place t n) < S n)%nat.
  Fig. 4: Definition and correctness proof of multiplication          Proof.
                                                                       cbv [place]; induction n; break_match;
                                                                       autorewrite with cancel_pair; omega. Qed.
   In associational format, arithmetic operations are extremely
easy to reason about. Addition is just concatenation of the           Lemma weight_place t i
lists. Schoolbook multiplication (see Fig. 4) is also simple:         : weight (fst(place t i)) * snd(place t i)
(a1 · x1 + . . .)(b1 · y1 + . . .) = (a1 b1 · x1 y1 + . . .), where     = fst t * snd t.
a1 b1 is the new compile-time-known weight. Because of the            Proof. (* ... 4 lines elided ... *) Qed.
flexibility allowed by associational representation, we do not        Lemma eval_from_associational {n} p
have to worry about whether the new compile-time weight               : eval (from_associational (S n) p)
produced is actually in the base. If we are working with the            = Associational.eval p.
base 225.5 in associational representation, and we multiply two       Proof.
pairs that both have weight 226 , we can just keep the 252 rather      cbv [from_associational]; induction p;
than the base’s 251 weight.                                            push;
   Positional representation, on the other hand, uses lists of         try pose proof place_in_range a n;
runtime values with weight functions. There is only one                try omega; try nsatz.                Qed.
runtime value for each weight, and order matters. This format
is closer to how a number would actually be represented at run-
time. Separating associational from positional representations        Fig. 5: Coq definition and selected correctness proofs of
lets us separate reasoning about arithmetic from reasoning            conversion from associational to positional representation
about making weights line up. Especially in a system that must
account for mixed-radix bases, the latter step is nontrivial and
is better not mixed with other reasoning.                             E. Carrying
   Fig. 5 shows the code and proofs for conversion from                  In unsaturated representations, it is not necessary to carry
associational to positional format, including the fixing of           immediately after every addition. For example, with 51-bit
misaligned weights. The helper function place takes a pair            limbs on a 64-bit architecture, it would take 13 additions to
t = (w, x) (compile-time weight and runtime value) and                risk overflow. Choosing which limbs to carry and when is
recursively searches for the highest-index output of the weight       part of the design and is critical for keeping the limb values
function that w is a multiple of. After finding it, place returns     bounded. Generic operations are easily parameterized on carry
an index i and the runtime value to be added at that index:           strategies, for example “after each multiplication carry from
(w/weighti ·x). Then, from_associational simply adds                  limb 4 to limb 5, then from limb 0 to limb 1, then from limb
each limb at the correct index.                                       5 to limb 6,” etc. The library includes a conservative default.
   The modular-reduction trick we explained in Section III-B
can also be succinctly represented in associational format (Fig.      F. Saturated Arithmetic
6). First we define split, which partitions the list, separating         The code described so far was designed with unsaturated
out pairs with compile-time weights that are multiples of the         representations in mind. However, unsaturated-arithmetic per-
number at which we are splitting (s, in the code). Then it            formance degrades rapidly when used with moduli that it
divides all the weights that are multiples of s by s and returns      is not a good match for, so older curves such as P-256
both lists. The reduce function simply multiplies the “high”          need to be implemented using saturated arithmetic. Basic
list by c and adds the result to the “low” list. Using the mul we     arithmetic on saturated digits is complicated by the fact that
defined earlier recreates the folklore modular-reduction pattern      addition and multiplication outputs may not fit in single words,
for prime shapes in which c is itself multi-limb.                     necessitating the use of two-output addition and multiplication
Definition split (s:Z) (p:list (Z*Z))                              Lemma S2_mod
  : list(Z*Z)*list(Z*Z) :=                                         : eval (sat_add S1 (scmul q m)) mod m
 let hl :=                                                           = eval S1 mod m.
   partition (λt, fst t mod s =? 0) p in                           Proof. push; zsimplify; auto. Qed.
 (snd hl,
  map (λt, (fst t / s, snd t)) (fst hl)).                          Fig. 8: Proof that adjusting S to be 0 mod m preserves the
                                                                   correct value mod m
Definition reduce(s:Z)(c:list _)(p:list _)
  : list(Z*Z) :=
 let lh := split s p in                                            ones are lists of integers (list Z), columns representations
 fst lh ++ mul c (snd lh).                                         are lists of lists of integers (list (list Z)). Columns
Lemma eval_split s p (s_nz:s0) :                                 representations are very similar to positional ones, except
    eval (fst (split s p))                                         that the values at each position have not yet been added
     + s * eval (snd (split s p))                                  together. Starting from the columns representation, we can
  = eval p.                                                        add terms together in a specific order so that carry bits
Proof. (* ... (7 lines elided) *) Qed.                             are used right after they are produced, enabling them to
                                                                   be stored in flag registers in assembly code. The code to
Lemma eval_reduce s c p                                            convert from associational to columns is very similar to
       (s_nz:s0)(m_nz:s-eval c0) :                             associational-to-positional conversion (Fig. 7).
    eval (reduce s c p) mod (s - eval c)                              With these changes in place, much of the code shown
  = eval p mod (s - eval c).                                       before for associational works on positional representa-
Proof. cbv [reduce]; push.                                         tions as well. For instance, the partial modular-reduction
 rewrite
Given eval S < eval m + eval B ;                                     is often a tuple of base-type values. Our let construct bakes
To show:                                                             in destructuring of tuples, in fact using typing to ensure that
eval (div_r (add (add S (scmul a B)) (scmul q m)))                   all tuple structure is deconstructed fully, with variables bound
                                      < eval m + eval          B
                                                                     only to the base values at a tuple’s leaves. Our deep embedding
(eval S + a ∗ eval B + q ∗ eval m)/r < eval m + eval           B
                                                                     of this language in Coq uses dependent types to enforce that
      (eval m ∗ r + r ∗ eval B − 1)/r < eval m + eval          B     constraint, along with usual properties like lack of dangling
                eval m + eval B − 1 < eval m + eval            B     variable references and type agreement between operators and
Fig. 9: Intermediate goals of the proof that the word-by-            their arguments.
word Montgomery reduction state value remains bounded by                Several of the key compiler passes are polymorphic in the
a constant, generated by single-stepping our proof script            choices of base types and operators, but bounds inference is
                                                                     specialized to a set of operators. We assume that each of the
                                                                     following is available for each type of machine integers (e.g.,
more significant limbs of a commutes with adding multiples           32-bit vs. 64-bit).
of m. Even better, adding a multiple of m and a multiple of          Integer literals: n
b can only increase the length of S by one limb, balancing           Unary arithmetic operators: − e
out the loss of one limb when dividing S by r and allowing           Binary arithmetic operators: e1 + e2 , e1 − e2 , e1 × e2
for an implementation that loops through limbs of a while            Bitwise operators: e1  e2 , e1  e2 , e1 & e2 , e1 | e2
operating in-place on S. The proof of the last fact is delicate      Conditionals: if e1 6= 0 then e2 else e3
– for example, it is not true that one of the additions always       Carrying: addWithCarry(e1 , e2 , c), carryOfAdd(e1 , e2 , c)
produces a newly nonzero value in the most significant limb          Borrowing: subWithBorrow(c, e1 , e2 ), borrowOfSub(c, e1 , e2 )
and the other never does. In our library, this subtlety is handled   Two-output multiplication: mul2(e1 , e2 )
by a 15-line/100-word proof script; the high-level steps, shown
in Fig. 9, admittedly fail to capture the amount of thought that        We explain only the last three categories, since the earlier
was required to find the invariant that makes the bounds line        ones are familiar from C programming. To chain together
up exactly right.                                                    multiword additions, as discussed in the prior section, we need
                                                                     to save overflow bits (i.e., carry flags) from earlier additions, to
            IV. C ERTIFIED B OUNDS I NFERENCE                        use as inputs into later additions. The addWithCarry operation
   Recall from Section II-B how we use partial evaluation            implements this three-input form, while carryOfAdd extracts
to specialize the functions from the last section to particular      the new carry flag resulting from such an addition. Analo-
parameters. The resulting code is elementary enough that it          gous operators support subtraction with borrowing, again in
becomes more practical to apply relatively well-understood           the grade-school-arithmetic sense. Finally, we have mul2 to
ideas from certified compilers. That is, as sketched in Section      multiply two numbers to produce a two-number result, since
II-C, we can define an explicit type of program abstract syntax      multiplication at the largest available word size may produce
trees (ASTs), write compiler passes over it as Coq functional        outputs too large to fit in that word size.
programs, and prove those passes correct.                               All operators correspond directly to common assembly
                                                                     instructions. Thus the final outputs of compilation look very
A. Abstract Syntax Trees                                             much like assembly programs, just with unlimited supplies of
   The results of partial evaluation fit, with minor massaging,      temporary variables, rather than registers.
into this intermediate language:
                                                                        Operands O         ::= x | n
       Base types b                                                   Expressions e        ::= let (x1 , . . . , xn ) = o(O, . . . , O) in e
           Types τ ::= b | unit | τ × τ                                                        | (O, . . . , O)
        Variables x
                                                                        We no longer work with first-class tuples. Instead, programs
        Operators o
                                                                     are sequences of primitive operations, applied to constants
      Expressions e ::= x | o(e) | () | (e, e)
                                                                     and variables, binding their (perhaps multiple) results to new
                        | let (x1 , . . . , xn ) = e in e
                                                                     variables. A function body, represented in this type, ends in
   Types are trees of pair-type operators × where the leaves         the function’s (perhaps multiple) return values.
are one-element unit types and base types b, the latter of which        Such functions are easily pretty-printed as C code, which is
come from a domain that is a parameter to our compiler. It           how we compile them for our experiments. Note also that
will be instantiated differently for different target hardware       the language enforces the constant time security property
architectures, which may have different primitive integer types.     by construction: the running time of an expression leaks no
When we reach the certified compiler’s part of the pipeline,         information about the values of the free variables. We do
we have converted earlier uses of lists into tuples, so we can       hope, in follow-on work, to prove that a particular compiler
optimize away any overhead of such value packaging.                  preserves the constant-time property in generated assembly.
   Another language parameter is the set of available primitive      For now, we presume that popular compilers like GCC and
operators o, each of which takes a single argument, which            Clang do preserve constant time, modulo the usual cycle
of occasional bugs and bug fixes. (One additional source-          a backward transformation: calculate intervals for variables,
level restriction is important, forcing conditional expressions    then rewrite the program bottom-up with precise base types
to be those supported by native processor instructions like        for all variables. We could not find a way with PHOAS to
conditional move.)                                                 write a recursive function that returns both bounds information
                                                                   and a new term, taking better than quadratic time, while
B. Phases of Certified Compilation
                                                                   it was trivial to do with first-order terms. We also found
   To begin the certified-compilation phase of our pipeline, we    that the established style of term well-formedness judgment
need to reify native Coq programs as terms of this AST type.       for PHOAS was not well-designed for large, automatically
To illustrate the transformations we perform on ASTs, we walk      generated terms like ours: proving well-formedness would
through what the compiler does to an example program:              frequently take unreasonably long, as the proof terms are
      let (x1 , x2 , x3 ) = x in                                   quadratic in the size of the syntax tree. The fix was to
      let (y1 , y2 ) = ((let z = x2 × 1 × x3 in z + 0), x2 ) in    switch well-formedness from an inductive definition into an
      y1 × y2 × x1                                                 executable recursive function that returns a linear number of
                                                                   simple constraints in propositional logic.
The first phase is linearize, which cancels out all intermediate
uses of tuples and immediate let-bound variables and moves         D. Extensibility with Nonobvious Algebraic Identities
all lets to the top level.                                            Classic abstract interpretation with intervals works sur-
                      let (x1 , x2 , x3 ) = x in                   prisingly well for most of the cryptographic code we have
                      let z = x2 × 1 × x3 in                       generated. However, a few spans of code call for some algebra
                      let y1 = z + 0 in
                                                                   to establish the tightest bounds. We considered extending
                                                                   our abstract interpreter with general algebraic simplification,
                      y1 × x2 × x1
                                                                   perhaps based on E-graphs as in SMT solvers [12], but in
Next is constant folding, which applies simple arithmetic          the end we settled on an approach that is both simpler and
identities and inlines constants and variable aliases.             more accommodating of unusual reasoning patterns. The best-
                      let (x1 , x2 , x3 ) = x in
                                                                   known among our motivating examples is Karatsuba’s multi-
                                                                   plication, which expresses a multi-digit multiplication in a way
                      let z = x2 × x3 in
                                                                   with fewer single-word multiplications (relatively slow) but
                      z × x2 × x1
                                                                   more additions (relatively fast), where bounds checking should
At this point we run the core phase, bounds inference, the         recognize algebraic equivalence to a simple multiplication.
one least like the phases of standard C compilers. The phase       Naive analysis would treat (a + b)(c + d) − ac equivalently
is parameterized over a list of available fixed-precision base     to (a + b)(c + d) − xy, where x and y are known to be
types with their ranges; for our example, assume the hardware      within the same ranges as a and c but are not necessarily
supports bit sizes 8, 16, 32, and 64. Intervals for program        equal to them. The latter expression can underflow. We add a
inputs, like x in our running example, are given as additional     new primitive operator equivalently such that, within the high-
inputs to the algorithm. Let us take them to be as follows:        level functional program, we can write the above expression
x1 ∈ [0, 28 ], x2 ∈ [0, 213 ], x3 ∈ [0, 218 ]. The output of the   as equivalently ((a + b)(c + d) − ac, ad + bc + bd). The latter
algorithm has annotated each variable definition and arithmetic    argument is used for bounds analysis. The soundness theorem
operator with a finite type.                                       of bounds analysis requires, as a side condition, equality for
                                                                   all pairs of arguments to equivalently. These side conditions
           let (x1 : N216 , x2 : N216 , x3 : N232 ) = x in
                                                                   are solved automatically during compilation using Coq tactics,
           let z : N232 = x2 ×N232 x3 in
                                                                   in this case ring algebra. Compilation then replaces any
           z ×N264 x2 ×N264 x1                                     equivalently (e1 , e2 ) with just e1 .
Our biggest proof challenge here was in the interval rules for
                                                                                   V. E XPERIMENTAL R ESULTS
bitwise operators applied to negative numbers, a subject mostly
missing from Coq’s standard library.                                  The first purpose of this section is to confirm that imple-
                                                                   menting optimized algorithms in high-level code and then sep-
C. Important Design Choices                                        arately specializing to concrete parameters actually achieves
   Most phases of the compiler use a term encoding called          the expected performance. Given the previous sections, this
parametric higher-order abstract syntax (PHOAS) [11]. Briefly,     conclusion should not be surprising: as code generation is
this encoding uses variables of the metalanguage (Coq’s            extremely predictable, it is fair to think of the high-level
Gallina) to encode variables of the object language, to avoid      implementations as simple templates for low-level code. Thus,
most kinds of bookkeeping about variable environments; and         it would be possible write new high-level code to mimic every
for the most part we found that it lived up to that promise.       low-level implementation. Such a strategy would, of course,
However, we needed to convert to a first-order representation      be incredibly unsatisfying, although perhaps still worthwhile
(de Bruijn indices) and back for the bounds-inference phase,       as a means to reduce effort required for correctness proofs.
essentially because it calls for a forward analysis followed by    Thus, we wish additionally to demonstrate that the two simple
generic-arithmetic implementations discussed in Sections III-B   multiplications by the modular-reduction coefficient c = 19
and III-F are sufficient to achieve good performance in all      with one of the inputs of the multiplication instead of the
cases. Furthermore, we do our best to call out differences       output (19 × (a × b) −→ (19 × a) × b). This algebra pays off
that enable more specialized implementations to be faster,       because 19 × a can be computed using a 64-bit multiplication,
speculating on how we might modify our framework in the          whereas 19×(ab) would require a significantly more expensive
future based on those differences.                               128-bit multiplication. We currently implement this trick for
                                                                 just 2255 − 19 as an ad-hoc code replacement after partial
A. X25519 Scalar Multiplication                                  evaluation but before word-size inference, using the Coq
   We measure the number of CPU cycles different imple-          ring algebra tactic to prove that the code change does not
mentations take to multiply a secret scalar and a public         change the output. We are planning on incorporating a more
Curve25519 point (represented by the x coordinate in Mont-       general version of this optimization into the main pipeline,
gomery coordinates). Despite the end-to-end task posed for       after a planned refactoring to add automatic translation to
this benchmark, we believe the differences between implemen-     continuation-passing style (see Section VII).
tations we compare against lie in the field-arithmetic imple-       The results of a similar benchmark on an earlier version of
mentation – all use the same scalar-multiplication algorithm     our pipeline were good enough to convince the maintainers
and two almost-identical variants of the Montgomery-curve        of the BoringSSL library to adopt our methodology, resulting
x-coordinate differential-addition formulas.                     in this Curve25519 code being shipped in Chrome 64 and
   The benchmarks were run on an Intel Broadwell i7-5600U        used by default for TLS connection establishment in other
processor in a kernel module with interrupts, power manage-      Google products and services. Previously, BoringSSL included
ment, Hyper Threading, and Turbo Boost features disabled.        the amd64-64 assembly code and a 32-bit C implementation
Each measurement represents the average of a batch of 15,000     as a fallback, which was the first to be replaced with our
consecutive trials, with time measured using the RDTSC           generated code. Then, the idea was raised of taking advantage
instruction and converted to CPU cycles by multiplying by        of lookup tables to optimize certain point ECC multiplications.
the ratio of CPU and timestamp-counter clock speeds. C code      While the BoringSSL developers had not previously found
was compiled using gcc 7.3 with -O3 -march=native                it worthwhile to modify 64-bit assembly code and review
-mtune=native -fwrapv (we also tried clang 5.0, but              the changes, they made use of our code-generation pipeline
it produced ∼10% slower code for the implementations here).      (without even consulting us, the tool authors) and installed
    Implementation          CPU cycles                           a new 64-bit C version. The new code (our generated code
    amd64-64, asm             151586                             linked with manually written code using lookup tables) was
    this work B, 64-bit       152195                             more than twice as fast as the old version and was easily
    sandy2x, asm              154313                             chosen for adoption, enabling the retirement of amd64-64
    hacl-star, 64-bit         154982                             from BoringSSL.
    donna64, 64-bit C         168502
    this work A, 64-bit       174637
    this work, 32-bit         310585
    donna32, 32-bit C         529812
                                                                 B. P-256 Mixed Addition
   In order, we compare against amd64-64 and sandy2x,
the fastest assembly implementations from SUPERCOP [13]
that use scalar and vector instructions respectively; the ver-      Next, we benchmark our Montgomery modular arithmetic
ified X25519 implementation from the HACL∗ project [4];          as used for in-place point addition on the P-256 elliptic curve
and the best-known high-performance C implementation             with one precomputed input (Jacobian += affine). A scalar-
curve25519-donna, in both 64-bit and 32-bit variants.            multiplication algorithm using precomputed tables would use
The field arithmetic in both amd64-64 and hacl-star has          some number of these additions depending on the table
been verified using SMT solvers [1], [3]. We do not claim        size. The assembly-language implementation nistz256 was
that the ranking in this table represents inherent differences   reported on by Gueron and Krasnov [14] and included in
between the 5 fastest implementations – we would be entirely     OpenSSL; we also measure its newer counterpart that makes
unsurprised if a seemingly irrelevant change in the CPU,         use of the ADX instruction-set extension. The 64-bit C code
compiler, or source code rearranged them.                        we benchmark is also from OpenSSL and uses unsaturated-
   We report on our code generated using the standard repre-     style modular reduction, carefully adding a couple of multiples
sentations for both 32-bit and 64-bit, though we are primarily   of the prime each time before performing a reduction step
interested in the latter, since we benchmark on a 64-bit         with a negative coefficient to avoid underflow. These P-
processor. We actually report on two of our 64-bit variants,     256 implementations here are unverified. The measurement
both of which have correctness proofs of the same strength.      methodology is the same as for our X25519 benchmarks,
Variant A is the output of the toolchain described in Section    except that we did not manage to get nistz256 running in
III. Variant B contains an additional implementation grouping    a kernel module and report userspace measurements instead.
Implementation               fastest    clang          gcc        icc           One measurement in Fig. 10 corresponds to 1000 sequential
  nistz256 +ADX                ˜550                                            computations of a 256-bit Montgomery ladder, aiming to sim-
  nistz256 AMD64               ˜650                                            ulate cryptographic use accurately while still constraining the
  this work A                  1143         1811       1828       1143         source of performance variation to field arithmetic. A single
  OpenSSL, 64-bit C            1151         1151       2079       1404         source file was used for all mpn_sec-based implementations;
  this work B                  1343         1343       2784       1521         the modulus was specified as a preprocessor macro to allow
   Saturated arithmetic is a known weak point of current com-                  compiler optimizations to try their best to specialize GMP
pilers, resulting in implementors either opting for alternative                code. Our arithmetic functions were instantiated using standard
arithmetic strategies or switching to assembly language. Our                   heuristics for picking the compile-time parameters based on
programs are not immune to these issues: when we first ran                     the shorthand textual representation of the prime, picking a
our P-256 code, it produced incorrect output because gcc                       champion configuration out of all cases where correctness
7.1.1 had generated incorrect code2 ; clang 4.0 exited with                    proof succeeds, with two unsaturated and one Montgomery-
a mere segmentation fault.3 Even in later compiler versions                    based candidates. The 64-bit trials were run on an x86 Intel
where these issues have stopped appearing, the code generated                  Haswell processor and 32-bit trials were run on an ARMv7-A
for saturated arithmetic varies a lot between compilers and is                 Qualcomm Krait (LG Nexus 4).
obviously suboptimal: for example, there are ample redundant                      We see a significant performance advantage for our code,
moves of carry flags, perhaps because the compilers do not                     even compared to the GMP version that “cheats” by leaking
consider flag registers during register allocation. Furthermore,               secrets through timing. Speedups range between 1.25X and
expressing the same computation using intrinsics such as                       10X. Appendix B includes the full details, with Tables I and
_mulx_u64 (variant A in the table) or using uint128                            II recording all experimental data points, including for an
and a bit shift (variant B) can produce a large performance                    additional comparison implementation in C++. In our current
difference, in different directions on different compilers.                    experiments, compilation of saturated arithmetic in Coq times
   The BoringSSL team had a positive enough experience with                    out for a handful of larger primes. There is no inherent reason
adopting our framework for Curve25519 that they decided to                     for this slowness, but our work-in-progress reimplementation
use our generated code to replace their P-256 implementation                   of this pipeline (Section VII) is designed to be executable
as well. First, they replaced their handwritten 64-bit C im-                   using the Haskell extraction feature of Coq, aiming to sidestep
plementation. Second, while they had never bothered to write                   such performance unpredictability conclusively.
a specialized 32-bit P-256 implementation before, they also                                        VI. R ELATED W ORK
generated one with our framework. nistz256 remains as an
                                                                               A. Verified Elliptic-Curve Cryptography
option for use in server contexts where performance is critical
and where patches can be applied quickly when new bugs                            Several past projects have produced verified ECC imple-
are found. The latter is not a purely theoretical concern –                    mentations. We first summarize them and then give a unified
Appendix A contains an incomplete list of issues discovered                    comparison against our new work.
in previous nistz256 versions.                                                    Chen et al. [1] verified an assembly implementation of
   The two curves thus generated with our framework for Bor-                   Curve25519, using a mix of automatic SAT solving and
ingSSL together account for over 99% of ECDH connections.                      manual Coq proof for remaining goals. A later project by
Chrome version 65 is the first release to use our P-256 code.                  Bernstein and Schwabe [2], described as an “alpha test,”
                                                                               explores an alternative workflow using the Sage computer-
C. Benchmarking All the Primes                                                 algebra system. This line of work benefits from the chance
                                                                               for direct analysis of common C and assembly programs for
   Even though the primes supported by the modular-reduction                   unsaturated arithmetic, so that obviously established standards
algorithms we implemented vastly outnumber those actually                      of performance can be maintained.
used for cryptography, we could not resist the temptation to                      Vale [5] supports compile-time metaprogramming of as-
do a breadth-oriented benchmark for plausible cryptographic                    sembly code, with a cleaner syntax to accomplish the same
parameters. The only other implementation of constant-time                     tasks done via Perl scripts in OpenSSL. There is a superficial
modular arithmetic for arbitrary modulus sizes we found                        similarity to the flexible code generation used in our own
is the mpn_sec section of the GNU Multiple Precision                           work. However, Vale and OpenSSL use comparatively shallow
Arithmetic Library4 . We also include the non-constant-time                    metaprogramming, essentially just doing macro substitution,
mpn interface for comparison. The list of primes to test was                   simple compile-time offset arithmetic, and loop unrolling. Vale
generated by scraping the archives of the ECC discussion                       has not been used to write code parameterized on a prime
list curves@moderncrypto.org, finding all supported                            modulus (and OpenSSL includes no such code). A verified
textual representations of prime numbers.                                      static analysis checks that assembly code does not leak secrets,
  2 https://gcc.gnu.org/bugzilla/show bug.cgi?id=81300, https://gcc.gnu.org/
                                                                               including through timing channels.
bugzilla/show bug.cgi?id=81294
                                                                                  HACL∗ [4] is a cryptographic library implemented and
  3 https://bugs.llvm.org/show bug.cgi?id=24943                                verified in the F∗ programming language, providing all the
  4 https://gmplib.org/                                                        functionality needed to run TLS 1.3 with the latest primitives.
64-Bit Field Arithmetic Benchmarks
                 1.5
Time (seconds)

                              GMP mpn sec API   GMP mpn API
                                 this work
                  1

                 0.5

                  0
                       100          150          200          250            300           350           400      450          500          550
                                                                                log2(prime)
                                                                    32-Bit Field Arithmetic Benchmarks
                 15
Time (seconds)

                              GMP mpn sec API   GMP mpn API
                                 this work
                 10

                  5

                  0
                       100          150          200          250            300           350           400      450          500          550
                                                                                log2(prime)

                             Fig. 10: Performance comparison of our generated C code vs. handwritten using libgmp

Primitives are implemented in the Low∗ imperative subset                         be generated automatically, at low developer cost; and indeed
of F∗ [15], which supports automatic semantics-preserving                        the BoringSSL team took advantage of that flexibility.
translation to C. As a result, while taking advantage of                           Advantage of our work: small trusted code base. Every
F∗ ’s high-level features for specification, HACL∗ performs                      past project we mentioned includes either an SMT solver or
comparably to leading C libraries. Additionally, abstract types                  a computer-algebra system in the trusted code base. Further-
are used for secret data to avoid side-channel leaks.                            more, each past project also trusts some program-verification
   Jasmin [6] is a low-level language that wraps assembly-                       tool that processes program syntax and outputs symbolic
style straightline code with C-style control flow. It has a Coq-                 proof obligations. We trust only the standard Coq theorem
verified compiler to 64-bit x86 assembly (with other targets                     prover, which has a proof-checking kernel significantly smaller
planned), along with support for verification of memory safety                   than the dependencies just mentioned, and the (rather short,
and absence of information leaks, via reductions to Dafny. A                     whiteboard-level) statements of the formal claims that we
Dafny reduction for functional-correctness proof exists but has                  make (plus, for now, the C compiler; see below).
not yet been used in a significant case study.
   We compare against these past projects in a few main                             Disadvantage of our work: stopping at C rather than
claims.                                                                          assembly. Several past projects apply to assembly code rather
   Advantage of our work: first verified high-performance                        than C code. As a result, they can achieve higher perfor-
implementation of P-256. Though this curve is the most                           mance and remove the C compiler from the trusted base. Our
widely used in TLS today, no prior projects had demonstrated                     generated code is already rather close to assembly, differing
performance-competitive versions with correctness proofs. A                      only in the need for some register allocation and instruction
predecessor system to HACL∗ [3] verified more curves, includ-                    scheduling. In fact, it seems worth pointing out that every
ing P-256, but in high-level F∗ code, incurring performance                      phase of our code-generation pipeline is necessary to get code
overhead above 100X.                                                             low-level enough to be accepted as input by Jasmin or Vale.
   Advantage of our work: code and proof reuse across                            We have to admit to nontrivial frustration with popular C
algorithms and parameters. A more fundamental appeal of                          compilers in the crypto domain, keeping us from using them to
our approach is that it finally enables the high-performance-                    bridge this gap with low performance cost. We wish compiler
cryptography domain to benefit from pleasant software-                           authors would follow the maxim that, if a programmer takes
engineering practices of abstraction and code reuse. The code                    handwritten assembly and translates it line-by-line to the
becomes easier to understand and maintain (and therefore to                      compiler’s input language, the compiler should not generate
prove), and we also gain the ability to compile fast versions of                 code slower than the original assembly. Unfortunately, no
new algorithm variants automatically, with correctness proofs.                   popular compiler seems to obey that maxim today, so we
Considering the cited past work, it is perhaps worth emphasiz-                   are looking into writing our own (with correctness proof) or
ing how big of a shift it is to do implementation and proving                    integrating with one of the projects mentioned above.
at a parameterized level: as far as we are aware, no past                           Disadvantage of our work: constant-time guarantee mech-
project on high-performance verified ECC has ever verified                       anized only for straightline code. Our development also in-
implementations of multiple elliptic curves or verified multiple                 cludes higher-level cryptographic code that calls the primitives
implementations of one curve, targeted at different hardware                     discribed here, with little innovation beyond comprehensive
characteristics. With our framework, all of those variants can                   functional-correctness proofs. That code is written following
You can also read