Polymorphisation: Improving Rust compilation times through intelligent monomorphisation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Polymorphisation: Improving Rust compilation times
through intelligent monomorphisation
David Wood (2198230W)
April 15, 2020
MSci Software Engineering with Work Placement (40 Credits)
ABSTRACT In addition, Rust’s compilation model is different from
Rust is a new systems programming language, designed to other programming languages. In C++, the compilation
provide memory safety while maintaining high performance. unit is a single file, whereas Rust’s compilation unit is a
One of the major priorities for the Rust compiler team is crate1 . While compilation times of entire C++ and entire
to improve compilation times, which is regularly requested Rust projects are generally comparable, modification of a
by users of the language. Rust’s high compilation times are single C++ file requires far less recompilation than modifi-
a result of many expensive compile-time analyses, a com- cation of a single Rust file. In recent years, implementation
pilation model with large compilation units (thus requiring of incremental compilation into rustc has improved recom-
incremental compilation techniques), use of LLVM to gen- pilation times after small modifications to code.
erate machine code, and the language’s monomorphisation Rust generates machine code using LLVM, a collection of
strategy. This paper presents a code size optimisation, im- modular and reusable compiler and toolchain technologies.
plemented in rustc, to reduce compilation times by intelli- At LLVM’s core is a language-independent optimiser, and
gently reducing monomorphisation through “polymorphisa- code-generation support for multiple processors. Languages
tion” - identifying where functions could remain polymor- including Rust, Swift, Julia and C/C++ have compilers us-
phic and producing fewer monomorphisations. By reduc- ing LLVM as a “backend”. By de-duplicating the effort of
ing the quantity of LLVM IR generated, and thus reduc- efficient code generation, LLVM enables compiler engineers
ing code generation time spent in LLVM by the compiler, to focus on implementing the parts of their compiler that are
this project achieves 3-5% compilation time improvements specific to the language (in the “frontend” of the compiler).
in benchmarks with heavy use of closures in highly-generic, Compiler frontends are expected to transform their internal
frequently-instantiated functions. representation of the user’s source code into LLVM IR, an
intermediate representation from LLVM which enables opti-
misation and code-generation to be written without knowl-
1. INTRODUCTION edge of the source language.
While LLVM enables Rust to have world-class runtime
Rust is a multi-paradigm systems programming language, performance, LLVM is a large framework which is not fo-
designed to be syntactically similar to C++ but with an ex- cused on compile-time performance. This is exacerbated by
plicit goal of providing memory safety and safe concurrency technical debt in rustc which results in the generation of
while maintaining high performance. The first stable release low-quality LLVM IR. Recent work on MIR2 optimisations
of the language was in 2015 and each year since 2016, Rust have improved the quality of rustc’s generated LLVM IR,
has been the “most loved programming language” in Stack and reduced the work required by LLVM to optimise and
Overflow’s Developer Survey [11, 12, 13, 14]. In the years produce machine code.
since, Rust has been used by many well-known companies, Finally, rustc monomorphises generic functions. Monomor-
such as Atlassian, Braintree, Canonical, Chef, Cloudflare, phisation is a strategy for compiling generic code, which du-
Coursera, Deliveroo, Dropbox, Google and npm. plicates the definition of generic functions for every instan-
Each year, the Rust project runs a “State of Rust” sur- tiation, producing significantly more generated code than
vey, asking users of the language on their opinions to help other translation strategies. As a result, LLVM has to per-
establishment development priorities for the project. Since form significant work to eliminate or optimise redundant IR.
2017 [22, 19, 20], faster compilation times have consistently This project improves the compilation times of Rust projects
been an area where users of the language say Rust needs to by implementing a code-size optimisation in rustc to elim-
improve. inate redundant monomorphisation, which will reduce the
There are a handful of reasons why rustc, the Rust com- quantity of generated LLVM IR and eliminate some of the
piler, is slow. Compared to other systems programming lan- work previously required of LLVM and thus improve compi-
guages, Rust has a moderately-complex type system and lation times. In particular, the analysis implemented by this
has to enforce the constraints that make Rust safe at com-
pilation time - analyses which take more time than those 1
Crates are entire Rust projects, including every Rust file
required for a language with a simpler type system. Lots in the project
of effort is being invested in optimising rustc’s analyses to 2
MIR is rustc’s “middle intermediate representation” and is
improve performance. introduced in Section 3.1.3
1project will detect where generic parameters to functions, compared to homogeneous translation, with only modest in-
closures and generators are unused and fewer monomorphi- creases in code size.
sations could be generated as a result. Dragos et al. [3] contribute an approach for compila-
Detection of unused generic parameters, while relatively tion of polymorphic code which allows the user to decide
simple, was deliberately chosen, as the primary contribution which code should be specialised (monomorphised). This
of this work is the infrastructure within rustc to support approach allows for specialised code to be compiled sepa-
polymorphic MIR during code generation, which will enable rately and mixed with generic code. Their approach, again
more complex extensions to the initial analysis (see further implemented in Scala, introduces an annotation which can
discussion in Section 5.1). Despite this, there is ample ev- be applied to type parameters. This annotation instructs the
idence in the Rust ecosystem, predating this project, that compiler to specialise code for that type parameter. Generic
detection of unused generic parameters could yield signifi- class definitions require that the approach presented must be
cant compilation time improvements. able to specialise entire class definitions while maintaining
Within Rayon, a data-parallelism library, a pull request subclass relationships so specialised and generic code can
[17] was submitted to reduce the usage of closures by moving interoperate. Evaluation is performed on the Scala stan-
them into nested functions, which do not inherit generic pa- dard library and a series of benchmarks where runtime per-
rameters (and thus would have fewer unnecessary monomor- formance is improved by over 20x for code size increases
phisations). Similarly, in rustc, the same author submitted of between 16% and 161%, however this approach requires
a pull request [16] to apply the same optimisation manually programmers correctly identify functions to annotate.
to the language’s standard library. Stucki et al. [18] identify that specialisation of generic
Likewise, in Serde, a popular serialisation library, it was code, which can improve performance by an order of magni-
observed [21] that a significant proportion of LLVM IR gen- tude, is only effective when called from monomorphic sites
erated was the result of a tiny closure within a generic func- or other specialised code. Performance of specialised code
tion (which inherited and did not use generic parameters of within these “islands” regresses to that of generic code when
the parent function, resulting in unnecessary monomorphisa- invoked from generic code. Stucki et al’s implementation
tions). In this particular case, the closure contributed more provides a Scala macro which creates a “specialised body”
LLVM IR than all but five significantly larger functions. containing the original generic code, where type parameters
This project makes all of these changes to rustc, Rayon have Scala’s specialisation annotation. Dispatch code is then
and Serde unnecessary, as the compiler will apply this opti- added, which matches on the incoming type and invokes the
misation automatically. correct specialisation. While this approach induces runtime
In Section 2, this paper reviews the existing research on overhead, this is made up for in the performance improve-
optimisations which aim to reduce compilation times or code ments achieved through specialisation. Their approach is
size through monomorphisation strategies. Next, Section 3 implementable in a library without compiler modifications
describes the polymorphisation analysis implemented by this and achieves speedups of up to 30x, averaging 12x when
project in more detail and Section 4 presents a thorough benchmarked with ScalaMeter.
analysis of the compilation time impact of polymorphisa- Ureche et al. [23] discuss efforts by the Java platform ar-
tion implemented in rustc. Finally, Section 5 summarises chitects in Project Valhalla to update the Java Virtual Ma-
the result of this research and describes future work that it chine (JVM)’s bytecode format to allow load-time class spe-
enables. cialisation, which will enable opt-in specialisation in Java.
Java is contrasted to Scala, which has three generic compila-
2. BACKGROUND tion schemes (erasure, specialisation and Ureche et al’s mini-
boxing). The interaction of these schemes can cause subtle
2.0.1 Monomorphisation-related Optimisations performance issues which also affect Project Valhalla. In
particular, miniboxed functions are still erased when invoked
Optimisation of monomorphisation to improve compila-
from erased functions, and boxing is required at generic com-
tion time is a subject that has received little attention. How-
pilation scheme boundaries. This paper presents three tech-
ever, within the Java ecosystem, lots of literature exists on
niques to eliminate the slowdowns associated with minibox-
addressing runtime performance impacts that result from
ing. The authors validate their technique in the miniboxing
monomorphisation (or a lack thereof).
plugin for Scala, producing performance improvements of
Ureche et al. [24] present a similar optimisation, “Mini-
2-4x.
boxing”, designed to target the Java Virtual Machine and
These papers focus on languages based on the JVM due to
implemented in Scala. Homogeneous and heterogeneous ap-
the performance impact that using primitives in generic code
proaches to generic code compilation are presented. Hetero-
can have. In order to preserve bytecode backwards com-
geneous approaches duplicate and adapt code for each type
patibility, the JVM requires that value types be converted
individually, increasing code size (also known as monomor-
to heap objects, typically through boxing, when interacting
phisation, the approach taken in Rust and C++). Ho-
with generic code. Scala has been of particular focus in this
mogeneous translation generates a single function but re-
research because of the inclusion of specialisation annota-
quires data to have a common representation, irrespective of
tions as a language feature.
type, which is typically achieved through pointers to heap-
allocated objects. Ureche et al. identify that larger value
types (e.g. integers) can contain smaller value types (e.g 2.0.2 Code Size Optimisations
bytes) and that this can be exploited to reduce the dupli- In order to improve Rust’s compilation times, this project
cation necessary in homogeneous translations. When im- implements an optimisation to reduce the quantity of gen-
plemented in the Scala compiler, performance matched het- erated LLVM IR. It is therefore relevant to review research
erogeneous translation and obtained speedups of over 22x into optimisations which aim to reduce code-size.
2Edler von Koch et al. [4] discuss a compiler optimisation code as when hand-optimised with annotations.
technique for reducing code size by exploiting the structural
similarity of functions. Structurally similar functions have 2.0.3 Compilation Time Optimisations
equivalent signatures and equivalent control flow graphs. Much like research which aims to reduce code-size, re-
Edler von Koch et al. introduce a technique for merging search into optimisations to reduce compilation time in more
two structurally similar functions by combining the bodies varied scenarios can be valuable to identify optimisation op-
of both functions and inserting control flow constructs to se- portunities being missed.
lect the correct instructions where the functions differ. The Han et al. [5] discuss a technique for reducing compi-
original functions are replaced with stubs which then call lation time in parallelising compilers. In this paper, the
the merged function with an additional argument to select authors inspect the OSCAR compiler, which parallelises C
the correct code path. The authors implemented this opti- or Fortran77 programs. OSCAR repeatedly applies multiple
misation in LLVM and achieved code size reductions of an program analysis passes and restructuring passes. After OS-
average 4% in Spec CPU2006 benchmarks. CAR’s second iteration, Han et al. identify that the OSCAR
Debray et al. [2] present a series of transformations which will apply an analysis to a function irrespective of whether
can be performed by the compiler to reduce code size. Trans- the restructuring passes changed the function. Han et al.
formations are split into two categories, classical analyses propose an optimisation to reduce compilation time by re-
and optimisations and techniques for code factoring. Most moving redundant analyses. Compilation time reductions of
relevant to the techniques employed by this project is 27%, 13% and 19% are achieved for three large-scale propri-
redundant-code elimination. Redundant computations ex- etary real applications.
ist where the same computation has been completed pre- Leopoldseder et al. [9] present a technique to determine
viously and that result is guaranteed to still be available. when it is beneficial to duplicate instructions from merge
In their implementation, redundant code elimination tech- blocks to their predecessors in order to attain further optimi-
niques were particularly effective at removing redundant sation. The authors implement their approach, Dominance-
computation of global pointer values (on the Alpha proces- based duplication simulation (DBDS), in GraalVM JIT com-
sor where the techniques were implemented) when enter- piler, which introduces a constraint that the optimisation
ing functions. Debray et al. evaluate the optimisations by must consider the impacts on compilation time. Leopoldseder
implementing them in a post-link-time optimiser, alto, and et al. identify that constant folding, conditional elimina-
comparing the code size of a set of benchmarks before-and- tion, partial escape analysis/scalar replacement and read
after optimisation by alto, achieving over 30% reductions. elimination are achievable after code duplication has taken
Kim et al. [7] present an iterative approach to the proce- place. Other code duplication approaches use backtracking
dural abstraction techniques described by Debray et al. Pro- to tentatively perform duplication while maintaining the op-
cedural abstraction identifies sequences of instructions (in- tion to reverse the duplication if it is determined not prof-
struction patterns) which are repeated and creates a proce- itable. Backtracking approaches are compile-time intensive
dure containing the pattern, replacing the original instances and thus not suitable for JIT compilation. Leopoldseder et
with a call to the procedure. In traditional procedural ab- al. choose to simulate duplication opportunities and then
straction approaches, such as those discussed in the orig- fit those opportunities into a cost model which maximises
inal paper [2], pattern identification is followed by a sav- peak performance while minimising compilation time and
ing estimation before patterns are replaced. In contrast, code size. The authors achieve performance improvements
the approach presented by Kim et al. iterates global sav- of up to 40%, mean performance improvements of 5.89%,
ing estimation and pattern replacement following pattern mean code size increases of 9.93% and mean compilation
identification. In eight benchmarks, an implementation tar- time increases of 18.84% across their benchmarks.
geting a CalmRISC8 embedded processor, averaged a 14.9% Unfortunately, none of these projects address the prob-
reduction in code size compared to traditional procedural lem of reducing code-size when monomorphisation is used
abstraction techniques. However, due to the small number exclusively as a strategy for compiling generic code.
of instructions and registers on the CalmRISC8 processor,
generated code would present many opportunities to the op- 3. POLYMORPHISATION
timisation which could skew results higher than might be
Polymorphisation, a concept introduced by this paper, is
achieved on more common instruction sets.
an optimisation which determines when functions, closures
Petrashko et al. [15] present context-sensitive call-graph
and generators could remain polymorphic during code gen-
construction algorithms which exploits the types provided to
eration.
type parameters as contexts. They demonstrate through a
Rust supports parameterising functions by constants or
motivational example that context-sensitive call graph anal-
types - these are known as generic functions and this feature
yses result in many calls being considered highly polymor-
is similar to generics in other languages, like Java’s gener-
phic and not able to be inlined. Through running analyses
ics or C++’s templates. Generic functions and types are
separately in the context of each possible type argument
a desirable feature for programmers as they enable greater
(possible through their context-sensitive call graph), the au-
code reuse. When generating machine code, there are two
thors are able to remove all boxing and unboxing in ad-
approaches to dealing with generic functions - monomorphi-
dition to producing monomorphic calls which enable Java’s
sation and boxing.
JIT to inline and aggressively optimise functions. Petrashko
C++ and Rust perform monomorphisation, where multi-
et al. find that through use of the context-sensitive analy-
ple copies of a function are generated for each of the types or
sis, performance improves by 1.4x and discovers 20% more
constants that the function was instantiated with. In con-
monomorphic call sites. Generated bytecode size was re-
trast, Java performs dynamic dispatch, where each object is
duced by up to 5x and achieved the same performance on
heap-allocated and a single copy of a function is generated,
3which takes an address. tion of polymorphisation.
In addition, Rust compiles to LLVM IR, the intermediate
representation of LLVM. LLVM IR doesn’t have any concept 3.1.1 Query System
of generics, so Rust must perform either dynamic dispatch Traditionally, compilers are implemented in passes, where
or monomorphisation. source code, an abstract syntax tree or intermediate repre-
The initial polymorphisation analysis implemented in this sentation is processed multiple times. Each pass takes the
project determines when a type or constant parameter to a result of the previous pass and performs an analysis or op-
function, closure or generator is unused, and thus when this eration - such as lexical analysis, parsing, semantic analysis,
would result in multiple redundant copies of the function optimisation and code generation. Multi-pass compilers are
being generated. By generating fewer redundant monomor- well-established in the literature, but the expectations on
phisations of functions in the LLVM IR, there would be less compilers have changed in recent years.
work for LLVM to do, reducing compilation times and code Modern programming languages are expected to have high-
size. quality integration into development environments [10], and
Types with unused generic parameters are disallowed by assist in powering code completion engines and convenience
rustc, but there are no checks for unused generic parameters features such as jump-to-definition. Moreover, users ex-
in functions. Despite this, it is assumed that it is rare for pect this information quickly and while they are writing
programmers to write functions which have unused generic their code (when it might not parse correctly or type-check).
parameters. However, closures inherit the generic parame- In addition, many programming languages’ designs make it
ters of their parent functions and often don’t make use of hard or impossible to statically determine the correct pro-
these parameters. cessing order, as expected by traditional multi-pass compiler
For example, consider the code shown in Listing 1, which architectures.
was taken from Serde, a popular Rust serialisation library. rustc’s query system enables incremental and “demand-
driven” compilation, and is integral to satisfying these expec-
1 fn parse_value( tations. Polymorphisation, as implemented in this project,
2 &mut self, uses the query system and limitations of the query system
3 visitor: V, are reflected in its design (see Section 3.1.4).
4 ) -> Result In the query system, the compiler can be thought of as
5 where having a “database” of knowledge about the current crate,
6 V: de::Visitoractual parsing from input data. As such, queries form a parameter is an integer which determines its order in the
directed-acyclic graph. list of generic parameters in scope. To perform a substitu-
Queries could form a cyclic graph by invoking themselves. tion, the subst function on Ty is invoked with a SubstsRef
To prevent this, the query engine checks for cyclic invoca- which replaces each TyKind::Param with the type from the
tions and aborts execution. SubstsRef with the corresponding index.
As all queries are invoked through the query context, The details of how generic parameters are implemented
accesses can be recorded and a dependency graph can be within rustc are directly leveraged by the polymorphisation
produced. To avoid unnecessary computation when inputs implemented in this project, as well as the TypeFoldable
change, the query utilises the dependency graph in the “red- infrastructure used to implement susbts.
green algorithm” used for query re-evaluation. TypeFoldable is a trait (Rust’s equivalent of a Java in-
try_mark_green is the primary operation in the red-green terface) implemented by types that embed type informa-
algorithm. Before evaluating a query, the red-green algo- tion, allowing the recursive processing of the contents of the
rithm checks each of the dependencies of the query. Each TypeFoldable. TypeFolder is another trait used in con-
dependent query will have a colour: red, green or unknown. junction with TypeFoldable. TypeFolder is implemented
Green queries’ inputs haven’t changed, and the cached re- on types (like SubstsFolder, which is used by subst) and
sult can be re-used. Red queries’ inputs have changed, and is used to implement behaviour when a type, const, re-
the result must be recomputed. gion or binder is encountered in a type. Used together, the
If the dependency’s colour is unknown then TypeFolder and TypeFoldable traits allow for traversal and
try_mark_green is invoked recursively. Should the depen- modification of complex recursive type structures.
dency be successfully marked as green, then the algorithm
continues with the next of the current query’s dependencies. 3.1.3 MIR
However, if the dependency was marked as red, then the de- MIR (mid-level IR) is an intermediate representation of
pendent query must be recomputed. When the result of the functions, closures, generators and statics used in rustc, con-
dependent query was the same as before re-execution then structed from the HIR and type inference results. It is a sim-
it would not invalidate the current node. plified version of Rust that is better suited to flow-sensitive
analyses. For example, all loops are simplified through use
3.1.2 Types and Substitutions of a “goto” construct; match expressions are simplified by
rustc represents types with the ty::Ty type, which repre- introducing a “discriminant” statement; method calls are re-
sents the semantics of a type (in contrast to placed with explicit calls to trait functions; and drops/panics
hir::Ty from the HIR (high-level IR), rustc’s desugared are made explicit.
AST, which represents the syntax of a type). MIR is based on a control-flow graph, composed of basic
ty::Ty is produced during type inference on the HIR, and blocks. Each basic block contains zero or more statements
then used for type-checking. Ty in the HIR assumes that any with a single successor, and end with a terminator, which
two types are distinct until proven otherwise, u32 (Rust’s 32- can have multiple successors.
bit unsigned integer type) at line 10, column 20 would be Memory locations on the stack are known as “locals” in
considered different to u32 at line 20, column 4. the MIR. Locals include function arguments, temporaries
ty::Ty is used to represent the “substitutions” for 8 }
a type (conceptually, a list of types that are to be substituted
with the generic parameters of the type). SubstsRef>, where
GenericArg is a space-efficient wrapper around Within rustc, the MIR is a set of data structures, but
GenericArgKind. GenericArgKind represents what kind of can be dumped textually for debugging. Consider the MIR
generic parameter is present - type, lifetime or const. for the Rust code in Listing 2, it starts with the name and
Ty contains TyKind::Param instances, each of which has signature of the function in a pseudo-Rust syntax, shown in
a name and index. Only the index is required, the name is Listing 3.
for debugging and error reporting. The index of the generic Following the function signature, the variable declarations
51 // WARNING: This output format is intended for 1 bb0: {
2 // human consumers only and is subject to 2 StorageLive(_1);
3 // change without notice. Knock yourself out. 3 _1 = const
4 fn main() -> () { ,→ std::collections::HashSet::::new() ->
5 ... ,→ bb2;
6 } 4 }
Listing 3: Function prelude in MIR Listing 6: bb0 in MIR
1 bb2: {
for all locals are shown (Listing 4). Locals are printed with 2 StorageLive(_3);
a leading underscore, followed by their index. Temporaries 3 StorageLive(_3);
are intermingled with user-defined variables. 4 _3 = &mut _1;
5 _2 = const
1 let mut _0: (); ,→ std::collections::HashSet::::insert(
2 let _2: bool; ,→ move _3, const 1i32) -> [return: bb3,
3 let mut _3: &mut std::collections::HashSet; ,→ unwind: bb4];
4 let _4: bool; 6 }
5 let mut _5: &mut std::collections::HashSet;
6 let _6: bool; Listing 7: bb2 in MIR
7 let mut _7: &mut std::collections::HashSet;
Listing 4: Variables in MIR There are other basic blocks in the MIR for Listing 2, but
bb0 and bb2 are sufficient for understanding the key concepts
To distinguish which locals are temporaries and which are of the MIR.
user-defined variables, the MIR output then displays debug- In addition to StorageLive, Assign, StorageDead, there
info for user-defined variables, such as set, shown in Listing are a selection of other statements:
5.
Within each scope block, user-defined variables are listed • Assign statements write an rvalue into a place.
with the user’s name from the variable and the variable’s
• FakeRead statements represent the reading that a pat-
place. In this example, the mapping is direct, but the com-
tern match might do, and exists to improve diagnos-
piler can store multiple user variables in a single local, and
tics.
perform field accesses or dereferences.
Scope blocks represent the lexical structure of the source • SetDiscriminant statements write the discriminant of
program (used in generating debuginfo). Each statement an enum variant (its index) to the representation of an
and terminator in the MIR output (omitted in these snip- enum.
pets) is followed by a comment which states which scope it
exists in. • StorageLive and StorageDead starts and ends the live
range for storage of a local.
1 scope 1 {
2 debug set => _1; • InlineAsm executes inline assembly.
3 }
• Retag retags references in a place - this is part of the
“Stacked Borrows” aliasing model by Jung et al. [6].
Listing 5: Scope in MIR
• AscribeUserType encodes a user’s type ascription so
All of the remaining MIR output is used by the basic that they can be respected by the borrow checker.
blocks of the current MIR body, starting with bb0, shown
in Listing 6. Basic blocks are numbered from zero upwards, • Nop is a no-op, and is useful for deleting instructions
and contain statements followed by terminator on the last without affecting statement indices.
line.
bb0 starts with a StorageLive statement, which indi- Likewise, in addition to Call, there are various other ter-
cates that the local _1 is live (until a matching StorageDead minators:
statement, not included in this listing). Code generation
uses StorageLive statements to allocate stack space. At • Goto jumps to a single successor block.
the end of bb0, there is a Call terminator, which invokes
• SwitchInt evaluates an operand to an integer and jumps
HashSet::new and resumes execution in bb2. While termi-
to a target depending on the value.
nators can have multiple successors, because HashSet::new
does not require cleanup if it panicked, there is no panic • Resume and Abort indicate that a landing pad is fin-
edge. ished and unwinding should continue or the process
bb2 has two StorageLive statements followed by an Assign should be aborted.
statement on line 4 into a temporary, creating a mutable bor-
row of _1, before ending the block with a Call terminator • Return indicates a return from the function, assumes
invoking HashSet::insert. that _0 has had an assignment.
6• Drop and DropAndReplace drops a place (and option- During code generation, the MIR being translated into
ally replaces it). LLVM IR (or Cranelift IR) has no substitutions applied,
therefore all of the generic parameters in the function are
• Call invokes a function. still present. Where required, substitutions (those collected
during monomorphisation, as will be discussed in Section
• Assert panics if a condition doesn’t hold. 3.1.5) are applied to types. However, for shims, these sub-
stitutions are typically redundant as the MIR of the shim
• Yield indicates a suspend point in a generator.
is generated specifically for a type, and do not have any
• GeneratorDrop indicates the end of dropping a gener- remaining generic parameters.
ator.
3.1.5 Monomorphisation
• FalseEdges and FalseUnwind are used for control flow
In order to determine which items will be included in the
that is impossible, but required for borrow check to be
generated LLVM IR for a crate, rustc performs monomor-
conservative.
phisation collection. Non-generic, non-const functions map
to one LLVM artefact, whereas generic functions can con-
Closures - similar to anonymous functions or lambdas in
tribute zero to N artefacts depending on the number of in-
other languages - which can capture variables from the par-
stantiations of the function. Monomorphisations can be
ent environment, are also represented in the MIR. Functions
produced from instantiations of functions defined in other
and closures are represented almost identically in the MIR.
crates.
Closures inherit the generic parameters of the parent func-
“Mono items” are anything that results in a function or
tion, and have additional generic parameters tracking the
global in the LLVM IR of a codegen unit. Mono items can
closure’s inferred kind, signature and upvars (captured vari-
reference other mono items, for example, if a function baz
ables).
references a function quux then the mono item for baz will
Generators are suspendable functions, which operate sim-
reference the mono item for quux (typically this results in the
ilarly to closures, but can yield values in addition to return-
generated LLVM IR for baz referencing the generated LLVM
ing them. Like closures, generators are represented almost
IR for quux too). Therefore, mono items form a directed
identically to functions in the MIR, but have extra generic
graph, known as the “mono item graph”, which contains all
parameters which track the generator’s inferred resume type,
items necessary for codegen of the entire program.
yield type, return type and witness (an inferred type which
In order to compute the mono item graph, the collector
is a tuple of all types that could up in a generator frame).
starts with the roots - non-generic syntactic items in the
rustc provides a pair of visitor traits (depending on whether
source code. Roots are found by walking the HIR of the
mutability is required) which types can implement to sim-
crate, and whenever a non-generic function, method or static
plify traversal of the MIR.
item is found, a mono item is created with the DefId of
3.1.4 Shims the item and an empty SubstsRef (since the item is non-
generic).
There are some circumstances where rustc will generate From the roots, neighbours are discovered by inspecting
functions during compilation, these functions are known as the MIR of each item. Whenever a reference to another
shims, and they are often specialised for specific types. mono item is found, it is added to the set of all mono items.
Drop glue is an example of a shim generated by the com- This process is repeated until the entire mono item graph
piler. Rust doesn’t require that the user write a destruc- has been discovered. By starting with non-generic roots,
tor for their type, any type will be dropped recursively in the current mono item will always be monomorphic, so all
a specified order. To implement this, rustc transforms all generic parameters of neighbours will always be known. Ref-
drops into a call to drop_in_place for the given type. erences to other mono items can take the form of function
drop_in_place’s implementation is generated by the com- calls, taking references to functions, closures, drop glue, un-
piler for each T. There are a variety of shims generated by sizing casts and boxes.
the compiler:
• Drop shims implements the drop glue required to deal- 1 fn main() {
locate a given type. 2 bar(2u64);
3 }
• Clone shims implement cloning for a builtin type, like 4
arrays, tuples and closures. 5 fn foo() {
6 bar(2u32);
• Closure-once shims generate a call to an associated 7 }
method of the FnMut trait. 8
• Reify shims are for fn() pointers where the function 9 fn bar(t: T) {
cannot be turned into a pointer. 10 baz(t, 8u16);
11 }
• FnPtr shims generate a call after a dereference for a 12
function pointer. 13 fn baz(f: F, g: G) {
14 }
• Vtable shims generate a call after a dereference and
move for a vtable. Listing 8: Monomorphisation Example
7For example, consider the code in Listing 8. Monomorphi- 1 query used_generic_params(key: DefId) ->
sation will start by walking the HIR of the crate and looking ,→ BitSet {
for non-generic syntactic items. After this step, the set of 2 desc {
mono items will contain the functions main and foo, both 3 |tcx| "determining which generic
of which have no generic parameters. ,→ parameters are used by `{}`",
Next, monomorphisation will traverse the MIR of each ,→ tcx.def_path_str(key)
root, looking for neighbours. In main, the terminator of bb0 4 }
(shown in Listing 9) references the bar function, with the 5 }
substitutions u64.
Listing 12: Query definition
1 bb0: {
2 StorageLive(_1); 1 fn used_generic_params(
3 _1 = const bar::(const 2u64) -> bb1; 2 tcx: TyCtxt1 fn mark_used_by_default_parameters, is invoked recursively on the closure or generator, and any
3 def_id: DefId, parameters from the parent which were used in the closure
4 generics: &'tcx ty::Generics, or generator are marked as used in the parent.
5 used_parameters: &mut BitSet, The analysis was significantly more complex during imple-
6 ) { mentation and had to special-case some casts, call and drop
7 if !tcx.is_trait(def_id) terminators to avoid marking parameters as used unneces-
8 && (tcx.is_closure(def_id) sarily, often in order to avoid parts of the compiler which
9 || tcx.type_of(def_id).is_generator()) were previously always monomorphic. However, during in-
10 { tegration of the analysis, as support for polymorphic MIR
11 for param in &generics.params { during codegen was improved, and these special-cases were
12 used_parameters.insert(param.index); removed.
13 } Some generic parameters are only used in the bounds of
14 } else { another generic parameter. After the analysis has detected
15 for param in &generics.params { which generic parameters are used in the body of the item,
16 if param.is_lifetime() { the predicates of the item are checked for uses of unused
17 used_parameters.insert( generic parameters in the bounds on used generic parame-
18 param.index); ters.
19 }
20 } 1 fn bar() {}
21 } 2
22 3 fn foo(_: I)
23 if let Some(parent) = generics.parent { 4 where
24 mark_used_by_default_parameters( 5 I: Iterator,
25 tcx, parent, tcx.generics_of(parent), 6 {
26 used_parameters); 7 bar::()
27 } 8 }
28 }
Listing 15: Example of generic parameter use in predicate
Listing 14: Used-by-default parameters
Consider the example shown in Listing 15. foo has two
generic parameters, I and T, and uses I in the substitutions
the MIR body for the current item and uses a MIR visitor of a call to bar. If only the body of foo was analysed then
to traverse it. Only three components of the MIR need to T would be considered unused despite it being clear that T
be visited by the analysis: locals, types and constants. is used in the iterator bound on I.
visit_local_decl is invoked for each local variable in the
item’s MIR before the MIR is traversed. This analysis’ im-
3.2.1 Testing
plementation of visit_local_decl calls super_local_decl To test of the analysis, a debugging flag, -Z polymorphize-
(which proceeds with traversal as normal) for all but one errors, is added, which emits compilation errors on items
case: If the current item is a closure or generator and if the with unused generic parameters, an example of which is
local currently being visited is the first local, then it returns shown in Listings 16 and 17.
early. By emitting compilation “errors”, the project leverages the
In the MIR, the first local for a closure or generator is existing error diagnostics infrastructure to enable introspec-
a reference to the closure of generator itself. If the anal- tion into the compiler internals from a test written at the
ysis proceeded to visit this local as normal, then it would Rust source-code level.
eventually visit the substitutions of the closure or genera-
tor. Unfortunately, this would result in all inherited generic 1 struct Foo(F);
2
parameters for a closure always being considered used, de-
feating the point of the analysis. 3 impl Foo {
visit_const and visit_ty are invoked for each constant 4 // Function has an unused generic
and type in the MIR, and are where the visitor “bottoms 5 // parameter from impl.
out”. This analysis’ implementation of visit_const and 6 pub fn unused_impl() {
visit_ty call a visit_foldable function (which isn’t part 7 //~^ ERROR item has unused generic parameters
of the MIR visitor). 8 }
visit_foldable takes any type which implements 9 }
TypeFoldable and checks if the type has any type or con-
stant parameters before visiting it with this analysis’ type Listing 16: Example of an analysis test
visitor.
In the type visitor, the analysis traverses the structure of a In addition, this approach integrates well into rustc’s ex-
type or constant and looks for ty::Param and isting test infrastructure, which writes all of the error output
ConstKind::Param, setting the index from each within the for a test into a stderr file and checks that the output hasn’t
BitSet, marking the parameter as used. In addition, the changed by comparison with that file. In addition, anywhere
type visitor also special-cases closures and generators. In- an error is expected is annotated in the test source.
91 error: item has unused generic parameters • foo::
2 --> $DIR/functions.rs:38:12
3 | • foo::
4 LL | impl Foo { • foo::
5 | - generic parameter `F` is unused
6 LL | // Function has an unused generic However, with the analysis, only a single mono item would
7 LL | // parameter from impl. be produced: foo::.
8 LL | pub fn unused_impl() { rustc’s code generation iterates over the mono items col-
9 | ^^^^^^^^^^^ lected and generates LLVM IR for each, adding the result
to an LLVM module. By de-duplicating during monomor-
Listing 17: Example of an analysis test error phisation collection, polymorphisation “falls out” of the way
the compiler is structured (though some more changes are
required).
3.3 Implementing polymorphisation in rustc In particular, various changes to code generation were re-
quired wherever another Instance is referenced. For exam-
Integration of the analysis’ results into the compiler re-
ple, when generating LLVM IR for a TerminatorKind::Call
quired significant debugging and many targeted modifica-
(a call or invoke instruction in LLVM), an Instance is
tions where previously only monomorphic types and MIR
generated to determine the exact item which is being in-
were expected.
voked and its mangled symbol name. Without applying the
Throughout rustc’s code generation infrastructure, the
Instance::polymorphize function, the instruction would
ty::Instance type is used. Instance is constructed through
refer to the mangled symbol for the unpolymorphised func-
use of Instance::resolve which takes a
tion, which no longer exists.
(DefId, SubstsRef, the matching in-
is handled by a should_codegen_locally function which
dex in the analysis’ BitSet is checked to determine whether
checked whether upstream monomorphisations where avail-
the parameter was used. If the parameter was used, then
able for a given Instance. should_codegen_locally is in-
the current parameter is kept. If the parameter was unused,
voked before a mono item is created (and polymorphisation
then the parameter is replaced with an identity substitu-
occurs), so it looks for an unpolymorphised monomorphi-
tion. During implementation, replacing the parameter with
sation in the upstream crate. However, if the item would
a new ty::Unused type was considered, but it was hypoth-
be polymorphised then the upstream crate would only con-
esised that this would require more changes to the compiler
tain a polymorphised copy, which would result in unnec-
than an identity substitution.
essary code generation when compiling the current crate.
For example, given an unsubstituted type containing
should_codegen_locally was changed to always check for
ty::Param(0), if the parameter were considered used by the
a polymorphised monomorphisation in the upstream crate.
analysis, then it would be substituted by the real type, but,
if the parameter were considered unused, then it would be 3.3.2 Specialisation
substituted with ty::Param(0) and remain the same. In Rust, data structures are defined separately from their
During monomorphisation collection, as described in Sec- methods. Multiple impl blocks can exist which implement
tion 3.1.5, when a mono item is identified, it is added to a methods on a type. For generic types, impl blocks can pick
set. This project concerns itself with one kind of mono item, concrete types for parameters or remain generic, but mul-
a MonoItem::Fn which contains an Instance type. When a tiple impls cannot overlap. Specialisation is an unstable
new MonoItem::Fn is created, Instance::polymorphize is feature in Rust which enables impl blocks to overlap if one
invoked before it is added to the final mono item set. This block is clearly “more specific” than another.
has the effect of reducing the number of mono items. Instance::resolve can only resolve specialised functions
of traits under certain circumstances and detects whether
1 fn foo(_: B) { } further specialisation can occur by whether the item needs
2 substitution (i.e. has unsubstituted type or constant param-
3 fn main() { eters). After integration of polymorphisation, this check had
4 foo::(1); to be more robust. In particular, polymorphisation meant
5 foo::(2); that closure types in Instance::resolve could need sub-
6 foo::(3); stitution (by containing a ty::Param from an unused par-
7 } ent parameter) and would thus “need substitution” for the
purposes of the specialisation check, despite this having no
Listing 18: Example of an polymorphisation deduplication impact on whether the item was further specialisable.
Initially, this was resolved by implementing a fast type
Consider the example in Listing 18, monomorphisation visitor which checked for generic parameters, but would not
would normally produce three mono items (and three copies visit the parent substitutions of a closure or generator. How-
of the function): ever, this check is in the “fast path” and a visitor would
10have been too expensive, so this was replaced by a addition generated to drop. InstanceDef::DropGlue’s DefId always
to the TypeFlags infrastructure within the compiler (after referred to the drop_in_place function.
some refactoring [1] enabled this change), shown partially in With polymorphisation, the Ty which contains the same identity param-
2 &ty::Closure(_, ref substs) => { eters would be applied to the Ty and a InstanceDef. For normal functions,
InstanceDef::Item(DefId) is the variant of InstanceDef
which is used. However, for drop glue, a 4. EVALUATION
InstanceDef::DropGlue(DefId, Option is the type that the shim is being on a set of benchmarks is compared with a build of the
11compiler without this project’s changes applied. mono items generated for these benchmarks was also com-
For the purposes of evaluation, the compiler is built in pared and the results are shown in Table 2.
release configuration from the Git commit with the hash
2f2ccd2 [26] (PR #69749 [28] at 9ef1d94 [27] merged with • script-servo is the script crate from Servo, the par-
master, 150322f [8]). 150322f is also built in release config- allel browser engine used by Firefox.
uration to act as a baseline for comparisons.
• regression-31157 is a small program which caused
Benchmarking has been performed two times previously
large performance regressions in earlier versions of the
and performance improvements identified from those earlier
Rust compiler.
runs are implemented in the versions being compared in this
section, as discussed in Section 4.1. • ctfe-stress-4 is a small program which is used to
There are 40 benchmarks that are run in rustc’s stan- stress compile-time function evaluation.
dard performance benchmarking suite [29]. Benchmarks are
either real Rust programs from the ecosystem which are im- In applicable benchmarks, compilation time performance
portant; real Rust programs from the ecosystem which are on clean runs generally improves by 3-5%. Incremental runs
known to stress the compiler; or artificial stress tests. generally have lower performance improvements, if any. script-
All benchmarks are compiled with both debug and release servo in the release configuration’s results show a 11.4%
profiles, some benchmarks are run with the check profile. slowdown in patched incremental runs. Query profiling shows
Each benchmark is run at least four times: that this slowdown comes from LLVM LTO’s optimisations
- this could be due to LTO being more effective and thus
• clean: a non-incremental build. more optimisation taking place or that the polymorphised
functions make LTO more expensive to perform. However,
• baseline incremental: an incremental build starting
LTO does not appear to impact performance in the debug
with empty cache.
configuration or on other benchmarks. script-servo in the
• clean incremental: an incremental build starting debug configuration shows a 5.1% performance improvement
with complete cache, and clean source directory – the which corresponds to an 11 second decrease in compilation
“perfect” scenario for incremental. time.
ctfe-stress-4’s mono-item count shows no reduction.
• patched incremental: an incremental build starting This isn’t surprising as the test contains mostly compile-
with complete cache, and an altered source directory. time evaluated functions (as expected for a test which aims
The typical variant of this is “println” which represents to stress compile-time function evaluation). This suggests
the addition of a println! macro somewhere in the that that polymorphisation may have reduced the number
source code. of functions which had to be evaluated at compile-time.
Memory usage across the benchmarks was very varied,
perf is used to collect information on the execution of each
ranging from 19% increases to 15% decreases, depending on
benchmark, notably: CPU clock; clock cycles; page faults;
the benchmark. Like instructions executed, memory usage
memory usage; instructions executed; task clock and wall
increases typically happened when incremental compilation
time. In addition, for each run, the execution count and
was enabled, while decreases happened when incremental
total time spent on each query in rustc’s query system is
compilation was disabled.
also captured - this allows for detailed comparisons between
Earlier benchmarking revealed that re-execution of the
runs to better determine why there has been a performance
used_generic_params query resulted in increased loads from
impact.
crate metadata (intermediate data from compilation of de-
Furthermore, for a subset of the benchmarks which had an
pendencies) so the version of the query benchmarked in
appreciable performance impact, the number of mono-items
this paper’s results were added to crate metadata and in-
generated are compared.
cremental compilation storage. In addition, the BitSet re-
Benchmarks were performed on a Linux host with an
turned from the query was replaced with a u64 as a micro-
AMD Ryzen 5 3600 6-Core Processor and 64GB of RAM
optimisation.
with ASLR disabled in both the kernel and userspace.
4.1 Discussion 5. CONCLUSIONS AND FUTURE WORK
For a majority of the benchmarks, there was no apprecia- This paper has presented a compilation time optimisation
ble difference in compilation time performance. This result in the Rust compiler which reduces redundant monomor-
is expected, the optimisation would have limited impact in phisation. Currently, this implementation from this project
programs which do not contain highly generic code contain- is being reviewed as PR #69749 [28] in the upstream Rust
ing closures. project and preliminary work, such as PR #69935 [25], has
In those benchmarks that did not show any significant landed.
improvement or regression, any performance differences be- There are no aspects of the Rust language itself which
tween the compiler versions are considered noise. The thresh- makes implementation of this optimisation particularly chal-
old used for noise is determined by comparing compiler ver- lenging or complex, most of the difficulty arises from imple-
sions that had no functional changes (“noise runs”) and ob- mentation details of rustc. With that said, rustc is well
serving the impact on performance results on each bench- suited to having this optimisation implemented.
mark. Due to the structure of the compiler, in an ideal scenario,
Three benchmarks were appreciably impacted by the opti- only modifications to monomorphisation and call genera-
misation implemented in this PR, results for these in Table tion are necessary to enable polymorphisation. Alternate
1 (complete results available online [29]). The number of approaches, such as changing the the MIR of closures and
12generators to only inherit the parameters that they use from [4] T. J. Edler von Koch, B. Franke, P. Bhandarkar, and
their parents, would be significantly more invasive and chal- A. Dasgupta. Exploiting function similarity for code
lenging, as the current system has emergent properties which size reduction. In Proceedings of the 2014
enables simplifications in other areas of the compiler. SIGPLAN/SIGBED Conference on Languages,
However, rustc will not always fail eagerly when an invalid Compilers and Tools for Embedded Systems, LCTES
case is encountered, which can significantly lengthen debug- ’14, page 85–94, New York, NY, USA, 2014.
ging sessions as the root cause of a failure is determined. Association for Computing Machinery.
This project has provided an excellent opportunity to learn [5] J. Han, R. Fujino, R. Tamura, M. Shimaoka,
more about the structure of the Rust compiler, and the tech- H. Mikami, M. Takamura, S. Kamiya, K. Suzuki,
niques used in modern, production compilers to implement T. Miyajima, K. Kimura, and et al. Reducing
LLVM IR generation. parallelizing compilation time by removing redundant
Implementation of polymorphisation resulted in 3-5% com- analysis. In Proceedings of the 3rd International
pilation time improvements in some benchmarks. While Workshop on Software Engineering for Parallel
there are further improvements that could be made to im- Systems, SEPS 2016, page 1–9, New York, NY, USA,
prove the integration and improve performance when incre- 2016. Association for Computing Machinery.
mental compilation is enabled (see Section 5.1), this is a [6] R. Jung, H.-H. Dang, J. Kang, and D. Dreyer. Stacked
promising result. borrows: An aliasing model for rust. Proc. ACM
Program. Lang., 4(POPL), Dec. 2019.
5.1 Future Work [7] D.-H. Kim and H. J. Lee. Iterative procedural
Investigation into the interaction between polymorphisa- abstraction for code size reduction. In Proceedings of
tion and LTO should be performed to determine the source the 2002 International Conference on Compilers,
of the compile-time regression noticed in the script-servo Architecture, and Synthesis for Embedded Systems,
benchmark. In addition, tweaking of the circumstances un- CASES ’02, page 277–279, New York, NY, USA, 2002.
der which the query’s results are cached (both incremen- Association for Computing Machinery.
tally and cross-crate) should be performed to impact on [8] T. R. P. Language. commit 150322f, 2020 (accessed
performance experienced when incremental compilation is March 25, 2020).
enabled. https://github.com/rust-lang/rust/commit/
With infrastructure in place to support polymorphic code 150322f86d441752874a8bed603d71119f190b8b.
generation, the polymorphisation analysis can be extended [9] D. Leopoldseder, L. Stadler, T. Würthinger, J. Eisl,
to detect more advanced cases. For example, when only the D. Simon, and H. Mössenböck. Dominance-based
size of a type is used then monomorphisation can occur for duplication simulation (dbds): Code duplication to
the distinct sizes of instantiated type; or when the memory enable compiler optimizations. In Proceedings of the
representation of a type is independent of the representation 2018 International Symposium on Code Generation
of its generic parameters, such as Vec (because the T is and Optimization, CGO 2018, page 126–137, New
behind indirection), functions like Vec::len can be poly- York, NY, USA, 2018. Association for Computing
morphised. Machinery.
Furthermore, the integration currently doesn’t fully work [10] Microsoft. Language server protocol, 2020 (accessed
with Rust’s new symbol mangling scheme, which isn’t yet March 29, 2020). https://microsoft.github.io/
enabled by default - support for this could be added. language-server-protocol/.
[11] S. Overflow. Stack overflow developer survey 2016,
6. ACKNOWLEDGEMENTS 2016 (accessed March 25, 2020).
Thanks to Eduard-Mihai Burtescu, Niko Matsakis and ev- https://insights.stackoverflow.com/survey/
eryone else in the Rust compiler team for their assistance 2016#technology-most-loved-dreaded-and-wanted.
and time during this project. In addition, thanks to Michel [12] S. Overflow. Stack overflow developer survey 2017,
Steuwer for his comments and guidance throughout the year. 2017 (accessed March 25, 2020).
https://insights.stackoverflow.com/survey/
2017#most-loved-dreaded-and-wanted.
7. REFERENCES [13] S. Overflow. Stack overflow developer survey 2018,
[1] E.-M. Burtescu. rustc: keep upvars tupled in 2018 (accessed March 25, 2020).
Closure,Generatorsubsts., 2020 (accessed March 25, https://insights.stackoverflow.com/survey/
2020). 2018/#most-loved-dreaded-and-wanted.
https://github.com/rust-lang/rust/pull/69968. [14] S. Overflow. Stack overflow developer survey 2019,
[2] S. K. Debray, W. Evans, R. Muth, and B. De Sutter. 2019 (accessed March 25, 2020). https://insights.
Compiler techniques for code compaction. ACM Trans. stackoverflow.com/survey/2019#technology-_
Program. Lang. Syst., 22(2):378–415, Mar. 2000. -most-loved-dreaded-and-wanted-languages.
[3] I. Dragos and M. Odersky. Compiling generics through [15] D. Petrashko, V. Ureche, O. Lhoták, and M. Odersky.
user-directed type specialization. In Proceedings of the Call graphs for languages with parametric
4th Workshop on the Implementation, Compilation, polymorphism. In Proceedings of the 2016 ACM
Optimization of Object-Oriented Languages and SIGPLAN International Conference on
Programming Systems, ICOOOLPS ’09, page 42–47, Object-Oriented Programming, Systems, Languages,
New York, NY, USA, 2009. Association for and Applications, OOPSLA 2016, page 394–409, New
Computing Machinery. York, NY, USA, 2016. Association for Computing
13You can also read