Reticle: A Virtual Machine for Programming Modern FPGAs

 
CONTINUE READING
Reticle: A Virtual Machine
 for Programming Modern FPGAs
 Luis Vega Joseph McMahan Adrian Sampson
 University of Washington University of Washington Cornell University
 USA USA USA
 vegaluis@cs.washington.edu jmcmahan@cs.washington.edu asampson@cs.cornell.edu

 Dan Grossman Luis Ceze
 University of Washington University of Washington
 USA USA
 djg@cs.washington.edu luisceze@cs.washington.edu

Abstract ACM Reference Format:
Modern field-programmable gate arrays (FPGAs) have re- Luis Vega, Joseph McMahan, Adrian Sampson, Dan Grossman,
 and Luis Ceze. 2021. Reticle: A Virtual Machine for Programming
cently powered high-profile efficiency gains in systems from
 Modern FPGAs. In Proceedings of the 42nd ACM SIGPLAN Inter-
datacenters to embedded devices by offering ensembles of
 national Conference on Programming Language Design and Imple-
heterogeneous, reconfigurable hardware units. Programming mentation (PLDI ’21), June 20ś25, 2021, Virtual, Canada. ACM, New
stacks for FPGAs, however, are stuck in the pastÐthey are York, NY, USA, 16 pages. https://doi.org/10.1145/3453483.3454075
based on traditional hardware languages, which were appro-
priate when FPGAs were simple, homogeneous fabrics of 1 Introduction
basic programmable primitives. We describe Reticle, a new Field-programmable gate arrays (FPGAs) have emerged as
low-level abstraction for FPGA programming that, unlike a relief to stagnating performance on CPUs and GPUs [16,
existing languages, explicitly represents the special-purpose 17, 38]. Their key advantage is their ASIC-like ability to cus-
units available on a particular FPGA device. Reticle has two tomize data paths, control logic, and memory hierarchies for
levels: a portable intermediate language and a target-specific specific applications. Unlike an ASIC, however, deploying an
assembly language. We show how to use a standard instruc- FPGA-based accelerator merely requires buying off-the-shelf
tion selection approach to lower intermediate programs to partsÐand not the astronomical investment that manufac-
assembly programs, which can be both faster and more ef- turing custom silicon entails.
fective than the complex metaheuristics that existing FPGA Early FPGAs were simple, homogeneous fabrics mostly
toolchains use. We use Reticle to implement linear algebra based on lookup tables (LUTs), and their toolchains could
operators and coroutines and find that Reticle compilation treat them as fluidly reconfigurable circuits. Modern FPGAs,
runs up to 100 times faster than current approaches while however, no longer resemble those simple, homogeneous
producing comparable or better run-time and utilization. architectures. Because real, specialized hardware remains far
 more efficient than reconfigurable emulated circuits, mod-
CCS Concepts: · Software and its engineering → Soft-
 ern FPGAs incorporate an array of heterogeneous, special-
ware notations and tools; Compilers;
 purpose łhardened” units that implement commonplace func-
 tionality: memories, arithmetic units, and complex intercon-
Keywords: compilers, FPGAs
 nects [21, 48]. To make these modern FPGAs perform well,
 it is critical to exploit this fixed-function logic as much as
 possibleÐprograms that underutilize it can consume signifi-
Permission to make digital or hard copies of all or part of this work for cantly more area and power [40].
personal or classroom use is granted without fee provided that copies
 The mainstream approach to program FPGA today is by
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights using behavioral hardware description languages (HDLs), ei-
for components of this work owned by others than the author(s) must ther written by hand or emitted by higher-level languages [4,
be honored. Abstracting with credit is permitted. To copy otherwise, or 12, 22, 34, 35, 49] as shown in Figure 1. These languages, how-
republish, to post on servers or to redistribute to lists, requires prior specific ever, rely on behavioral HDLs as an ad hoc IRs, because pro-
permission and/or a fee. Request permissions from permissions@acm.org. grams can be ported to multiple targets without the burden
PLDI ’21, June 20ś25, 2021, Virtual, Canada
 of directly programming low-level and target-specific primi-
© 2021 Copyright held by the owner/author(s). Publication rights licensed
to ACM.
 tives. Instead, the complex task of compiling traditional hard-
ACM ISBN 978-1-4503-8391-2/21/06. . . $15.00 ware languages to these primitives is normally performed
https://doi.org/10.1145/3453483.3454075 by proprietary vendor toolchains.

 756
PLDI ’21, June 20ś25, 2021, Virtual, Canada Luis Vega, Joseph McMahan, Adrian Sampson, Dan Grossman, and Luis Ceze

 This work
 FPGAs fast. The representation should offer both a high-
 IR
 Instruction
 selection
 ASM Place level abstraction for portability and a low-level abstraction
 High-level High-level
 to directly address model-specific hardware resourcesÐand
 Languages compilers Traditional approach
 come with an infrastructure that provides analyses and opti-
 Behavioral
 HDLs
 Synthesis Place Route Bitstream
 mizations to target specific devices.
 We describe Reticle, a low-level language for efficiently
 programming modern FPGAs. The goal is to provide an inter-
Figure 1. Overview of the traditional compilation pipeline mediate language and a compilation target that can replace
compared to Reticle’s. Independent of the source language, the use of traditional behavioral HDLs for targeting FPGAs,
FPGA programs today are funneled down to the same ab- as shown in Figure 1. Reticle is a compilation target for
straction (hardware description languages). This abstraction higher-level languages that are substantially different from
is not expressive enough to capture high-performance oper- traditional HDLs.
ations available in DSPs, such as SIMD. Reticle has two layers: a portable intermediate language
 that abstracts over specific hardware units while providing
 high-level control over resource binding and placement, and
 Moreover, behavioral HDLs like Verilog and VHDL do not a parameterized assembly language that explicitly addresses
have a way to represent modern FPGA’s primitives. Alterna- a device’s hardware resources.
tively, vendor toolchains use heuristics that attempt to guess The contributions in this paper are:
when a program’s logic can efficiently map to a device’s • We identify and measure the challenges of using be-
available hardened units. For example, a Verilog expression havioral HDLs for programming modern FPGAs in
a + b would need to compile to an adder circuit when gener- terms of result quality and compiler speed.
ating a custom ASIC; in an FPGA toolchain, it might instead • We design Reticle, an intermediate language that can
map onto a specific configuration of an FPGA’s digital signal describe the binding of high-performance hardware
processing slice (DSP) that includes a range of built-in integer primitives in modern FPGAs (i.e., DSPs) as well as
arithmetic units. classic reconfigurable logic (i.e., LUTs).
 Relying on heuristics for performing this mapping results • We implement a compiler for Reticle that optimizes
in unpredictable and poor performance. As authors from and lowers hardware programs and then emits struc-
Xilinx, a major FPGA vendor, observed [31]: tural, device-specific Verilog to bypass the majority of
 The necessity of breadth coverage by commer- traditional vendor FPGA toolchains.
 cial tools often leads to implementations that do • We demonstrate that the Reticle compiler runs up to
 not take full advantage of the underlying hard- 100× faster than an existing vendor FPGA toolchain
 ware. For example, UltraScale+ devices employ while producing comparable or better run-time and
 DSP blocks that are rated at 891MHz for the utilization results on linear algebra benchmarks.
 fastest speed grade. Nonetheless, large designs
 implemented on FPGAs typically achieve system This paper describes a language design aiming to efficiently
 frequencies lower than 400MHz. program modern FPGAs, but it leaves important avenues for
 future work to build on. Currently, the intermediate language
We corroborate this finding in Section 7, which demonstrates focuses on programming DSP and LUT slices; it does not sup-
that HDL-based FPGA programming leads to extremely slow port memory primitives, such as BRAMs. Additionally, the
compilation and unpredictable, suboptimal performance re- paper focuses on designing and implementing an assembly
sults. In practice, programmers must resort to using vendor- language capable of capturing layout semantics that enable
specific Verilog annotations to extract the best performance layout optimizations such as instruction cascading. The com-
from modern FPGAs, or even building ad hoc Verilog post- piler, however, uses only a simple solver-based approach for
processing scripts to insert directives for efficiency [3]. Al- placement, limiting the search of optimal layouts for any
though it is fragile and not portable, this approach has no given program. There is plenty of exploration needed in the
viable alternative: vendors keep lower-level representations layout space i.e., incorporating timing information that is
secret and the accompanying toolchains are closed source. beyond the scope of this work. Lastly, the Reticle compiler,
 This paper’s thesis is that we need a new low-level pro- as of today, relies on routing and bitstream generation from
gramming abstraction for modern FPGAs. Innovation in traditional toolchains as shown in Figure 1.
high-level FPGA programming models is accelerating [15,
18, 28, 34], and these new compilers need a better target than
Verilog. Where traditional HDLs hide the complexity of mod- 2 Background
ern FPGA resources, a better abstraction would explicitly The first major program transformation that FPGA compilers
represent the model-specific capabilities that make modern perform today is hardware synthesis. This transformation

 757
Reticle: A Virtual Machine for Programming Modern FPGAs PLDI ’21, June 20ś25, 2021, Virtual, Canada

1 module bit_and ( input a , input b , output y ); 1 (* use_dsp = " yes " *)
2 assign y = a & b; 2 module dsp_add (...);
3 endmodule 3 genvar i;
 4 for (i =0; i
PLDI ’21, June 20ś25, 2021, Virtual, Canada Luis Vega, Joseph McMahan, Adrian Sampson, Dan Grossman, and Luis Ceze

 behavioral, scalar
 3 Overview
 350
 structural, vectorized (hand-optimized)
 Reticle is an intermediate representation (IR) and compiler
 300
 for FPGAs that addresses these challenges. Reticle aims to
 250 directly represent and optimize for the heterogeneous pro-
 grammable resources available in modern FPGAs: specifi-
 DSPs used

 200
 cally, LUTs and DSPs. Its goal is to target the efficiency of
 150 structural FPGA implementations while adding abstraction
 and portability. Reticle is an instruction-based IR that decou-
 100
 ples the low-level details of the underlying hardware from a
 50 higher-level instruction set that can generate code for differ-
 0
 ent hardware targets. More importantly, Reticle presents an
 8 16 32 64 128 256 512 1024 alternative approach for programming FPGAs and is not a
 Loop bound (N)
 drop-in replacement for any stage in the traditional compi-
 (a) DSP utilization. lation flow (i.e., the goal is not to support traditional HDLs
 like Verilog). Alternatively, higher-level languages can use
 behavioral, scalar
 5.0K structural, vectorized (hand-optimized)
 Reticle as a compiler target.
 Reticle addresses the challenges from Section 2 by using
 4.0K
 a more expressive type system that supports vector types,
 which enable programs to promote particular hardware re-
 sources over others when they are available. Additionally,
 LUTs used

 3.0K

 the intermediate language makes primitive constraints part
 2.0K of the language semantics, so the Reticle compiler can reject
 programs with unsatisfiable constraints instead of silently
 1.0K ignoring them as in HDL hints. Therefore, programs are
 more predictable in terms of resource usage and performance.
 0.0 We show how to use instruction selection to map a portable
 8 16 32 64 128 256 512 1024 representation onto a device-specific representation, while
 Loop bound (N)
 achieving the same optimization results as target-specific
 (b) LUT utilization. structural implementations.
 The rest of the paper describes the Reticle language design
Figure 4. Resource utilization for multiple loop bounds ( ) and compiler implementation. Section 4 describes the two
of the behavioral program described in Figure 3 versus a forms of Reticle: a portable, high-level intermediate language
hand-optimized and structural version of the same program. and a low-level, device-specific assembly language that can
Even though a compiler hint in the behavioral program re- be parameterized for a specific FPGA device. Section 5 de-
quests the use of DSPs, a more optimal DSP configuration scribes how the Reticle compiler lowers from the intermedi-
exists, leading to under-utilization of the resource potential ate language to an assembly language. We show how to use
(the behavioral program runs out of DSP resources and must standard instruction selection to efficiently and deterministi-
resort to LUTs). cally lower intermediate programs to assembly programsÐa
 sharp departure from traditional FPGA toolchains, which
 must resort to expensive, often randomized metaheuristics
 A second challenge is that, in HDLs, hints are merely to perform similar lowering [33]. Sections 6 and 7 show how
suggestionsÐnot constraints. Behavioral-to-structural syn- our compiler implementation emits structural hardware de-
thesis heuristics make it difficult to deterministically respect scriptions for a specific FPGA target and results comparable
constraints, so toolchains silently ignore hints that they are to or better than a traditional HDL toolchain while running
unable to fulfill. The consequence is that programming with many times faster. Finally, Section 8 discusses the responsibil-
hints is unpredictable in both area and performance. ities and optimization opportunities for front-end languages
 The third challenge is that directly programming at the when compiling to Reticle.
structural level, while necessary for peak efficiency, is im-
practical. It requires understanding the complex, device-
specific semantics of DSP configuration parameters. Struc-
turally programming the DSP in Xilinx UltraScale+ FPGAs, 4 The Language
for example, can entail setting up to 96 parameters. This This section describes the Reticle language. Reticle has two
representation is verbose, brittle, and vendor-specificÐno variants: the high-level intermediate language, where opera-
structural representation is portable across FPGA families. tions are abstract and portable across FPGA devices, and a

 759
Reticle: A Virtual Machine for Programming Modern FPGAs PLDI ’21, June 20ś25, 2021, Virtual, Canada

 t0 : i8 = const [5];
 ∈ F ( : ) ∗ → ( : ) + { + } t1 : i8 = sll [1]( t0 );
 ∈ F | t2 : i8 = add (t0 , t1 ) @ ?? ;

 ∈ F : = ⊗[ ∗ ] ( ∗ )
 Figure 6. Reticle instructions to compute the expression
 ∈ F : = ⊞[ ∗ ] ( + ) @ 
 5 × 2 + 5. The constant 5 and shift-left-logical operation
 (a) The Intermediate Language consume no compute resources (wire operations), while the
 add instruction does (compute operation).
 ∈ F ( : ) ∗ → ( : ) + { + }
 ∈ F | Table 1. The intermediate instruction set.
 ∗ ∗
 ∈ F : = ⊗[ ] ( )
 Instruction Type Operation
 ∈ F : = ⊠[ ∗ ] ( + ) @ 
 Arithmetic add, sub, mul
 (b) The Assembly Language Bitwise not, and, or, xor,
 Compute Comparison eq, neq, lt, gt, le, ge
 ∈ F ?? | Control mux
 ∈ F ( , ) Memory reg
 ∈ F ?? | Shift sll, srl, sra
 Wire
 ∈ F lut | dsp Misc slice, cat, id, const
 ∈ F | | + 
 by a primitive . Therefore, compute instructions are candi-
 ⊗ ∈ ⊞ ∈ ⊠ ∈ 
 date for optimizations. Figure 6 shows an example of wire
 ?? ∈ ∈ ∈ and compute instructions.
 −→ Compute instructions also have an annotation @ that
 ∈ , , ∈ Z
 can optionally control which kind of resource to use on
 Figure 5. The Intermediate and Assembly Languages. the target device: either LUTs or DSPs. The annotation
 may be the wildcard ??, in which case the compiler has the
 freedom to choose which resource to use for the instruction.
 Interestingly, other operations besides simple bit extrac-
low-level assembly language, where operations correspond
 tion and slicing can be implemented as wire instructions i.e.,
to physical primitives available on a specific device.
 static shift instructions. Consider the implementation of the
 Figure 5 lists the syntax for the two languages, which share
 logical left shift instruction sll described in Figure 6, which
a common structure and differ in the kinds of operations that
 consist of taking the lower 7-bit wires of t0 and appending a
are available. We first describe the intermediate language
 1-bit wire assigned to the value zero in order to produce t1.
and then show how the assembly language differs.
 Curiously, single-bit constant values such as zero and one
4.1 The Intermediate Language can be created with electrical ground and voltage available
 throughout the device without consuming any LUT or DSP.
Figure 5a lists the Reticle intermediate language. A program Therefore, we leverage this knowledge to define these and
is a function with a name n, a number of inputs and outputs other operations i.e., constants as wire instructions.
( : ), and a sequence of instructions ins. Function bodies
are in A-normal form (ANF) [41]: they consist of a flat list of Instruction set. Table 1 lists the full set of compute and
instructions whose arguments are always variables . wire instructions in the intermediate language. Most instruc-
 tions are pure, i.e., they have no side effects. The only excep-
 Wire & compute instructions. There are two types of tion is the register instruction, reg. In the absence of register
instructions in the language: wire and compute instructions. instructions, programs can leverage referential transparency.
While both share a common format, compute instructions are An add instruction, for example, takes two arguments and
the ones that consume device resources and therefore con- writes to an output of a given type, such as:
sume area; wire instructions are area-free and only involve
 c: i8 = add (a ,b) @ ?? ;
wiring. Both kinds of instructions support static integer at-
tributes , argument variables , and always produce a single A reg instruction looks similar to add , but it is stateful in
output value ( : ). its operation. Furthermore, the add instruction will write a
 Wire instructions consist of operations ⊗, while compute new value to c each cycle (based on inputs a and b ), whereas
instructions are based on an operation ⊞ that are performed the reg instruction will hold its value until overwritten. For

 760
PLDI ’21, June 20ś25, 2021, Virtual, Canada Luis Vega, Joseph McMahan, Adrian Sampson, Dan Grossman, and Luis Ceze

 layout
 optimizations

 IR instruction assembly instruction assembly code structural routing and
 bitstream
 (a) selection (c) placement (e) generation verilog bitgen

 target device
 description layout
 (b) (d)

 target-independent family-specific device-specific

 Reticle Traditional tools

Figure 7. Reticle compilation pipeline. (a) The intermediate program (Figure 5a). (b) The target description specification
(Figure 9). (c) The compiled assembly program (Figure 5b). (d) The device layout specification. (e) The placed assembly program
with known locations (Figure 5b).

example, the following register instruction will produce a 0 as multiply-add, with known implementation costs (area
as long as b is False. Similarly, once b is True, then the value and latency.) Although instructions are considered spe-
of a will be bound to c every cycle. cialized instructions, they are still portable within an FPGA
 family. Devices within a family share the same primitives,
c: i8 = reg [0]( a ,b) @ ?? ;
 varying only the total number of primitives available in them.
 The stateful reg instruction is essential for allowing cycles To capture the semantics of these varied operations, each
in programs. Registers łbreak up” combinational cycles by is defined in terms of a sequence of intermediate language
stopping them from looping back within the same cycle. As operations, which are then automatically composed in the
we discuss in Section 6.1, a program with a cycle and without compilation process. (This means that assembly operations
register is considered ill-formed and will be rejected. ⊠ can be composed of one or more intermediate operations
 in a single instruction.) Therefore, the number of in-
 Semantics. The primary goal behind the intermediate structions is far greater than instructions, allowing
language is to capture the semantics of operations available different FPGA architectures to be targeted with a simpler
in modern FPGA, while removing details of the primitives intermediate language. For example, two intermediate oper-
used to implement such instructions. This is accomplished by ations consisting of a multiplication followed by an addition
using dataflow and synchronous semantics [7]. The dataflow can be fused into a muladd (if it supported by the hardware
semantics are used to describe the behavior of pure combi- target) and it can be expressed in assembly as:
national instructions [45], whereas the synchronous model
 y: i8 = muladd (a ,b ,c) @ dsp (?? , ?? );
abstracts away the details about how register instructions
are updated. For example, a synchronous design is defined as Assembly instructions also differ from compute instructions
a hardware program in which all stateful elements (registers), because they support location semantics . A location in-
can only be updated on a single event trigger, i.e., a positive cludes not only a primitive kind (LUTs or DSPs), but also
clock edge. Therefore, the syntax for describing such timing a Cartesian , coordinate describing the physical place-
details is not required for programming FPGAs, resulting in ment of the operation. Each coordinate can be either a con-
a more compact representation. crete expression or a wildcard ??, indicating that the com-
 piler is responsible for determining the placement. While the
4.2 The Assembly Language wildcard gives the compiler the greatest flexibility, placing
The Reticle assembly language resembles the intermediate explicit constraints on coordinates with expressions gives
language, but it replaces high-level, abstract operators like front-end tools greater control over programs (and its ul-
add with target-specific primitives available on a particular timate performance). An expression can refer to variables
FPGA device. Figure 5b lists the syntax for the assembly defined in other coordinate expressions to place constraints
language, in which compute instructions are replaced between the placement of the two instructions. For exam-
with assembly instructions . (Reticle assembly retains ple, an instruction location could be specified using uncon-
the same wire instructions as the intermediate language.) As strained variables like ( x0_loc , y0_loc ) , while another in-
we lower to hardware targets, special-purpose hardware be- struction is constrained to always be adjacent (right after)
comes available that can handle specialized operations, such within the same column: ( x0_loc , y0_loc +1) . Because we

 761
Reticle: A Virtual Machine for Programming Modern FPGAs PLDI ’21, June 20ś25, 2021, Virtual, Canada

 t0 : i8 = mul (a ,b) @ ?? ; ∈ F +
 t1 : i8 = add (t0 ,c) @ ?? ;
 ∈ F [ , , ] ( : ) ∗ → ( : ){ + }
 (a) Intermediate program
 ∈ F : = ⊞ | ⊗ [ ∗ ] ( + )
 t0 : i8 = mul (a ,b) @ dsp (?? , ?? );
 ∈ F lut | dsp
 t1 : i8 = add (t0 ,c) @ dsp (?? , ?? );

 (b) Assembly program, cost=2 ⊗ ∈ ⊞ ∈ 
 ∈ ∈ ∈ ∈Z
 t0 : i8 = muladd (a ,b ,c) @ dsp (?? , ?? );

 (c) Assembly program, cost=1
 Figure 9. The Target Description Language.

Figure 8. Example of an intermediate program (a) and two reg [ lut ,1 ,2]( a: i8 , en : bool ) -> (y: i8 ) {
equivalent assembly programs (b,c) with different costs. y: i8 = reg [0]( a , en );
 }

used the same variables, the two instructions have a place- add [ lut ,1 ,2]( a: i8 ,b: i8 ) -> (y: i8 ) {
 y: i8 = add (a ,b );
ment relationship: they have the same x_loc (column), and
 }
the second instruction is right after the first. Anecdotally,
we were inspired by the Lava hardware language [5], on the add_reg [ lut ,1 ,2]( a: i8 ,b: i8 , en : bool ) -> (y: i8 ) {
benefits of incorporating layout semantics to perform layout t0 : i8 = add (a ,b );
optimizations (discussed further in related work, Section 9). y: i8 = reg [0]( t0 , en );
 }

5 Compilation
Our hardware compiler performs a series of transformations Figure 10. Example of an FPGA target described using the
to convert and optimize a source intermediate program into target description language (Figure 9). This hypothetical
a target structural representation. The transformations in target supports three assembly instructions (reg, add, and
the compiler are described in Figure 7, including instruc- add_reg), which are implemented using LUTs and have area
tion selection, layout optimizations, instruction placement, and latency cost of 1 and 2 respectively.
and code generation. Each of these program transformations
progressively increases the level of detail of the compiler tar-
 the best implementation depends on the target-specific costs
get such as: target-independent, family-specific, and device-
 of these instructions.
specific transformations.
 Reticle’s instruction selector uses a target definition that
5.1 Instruction Lowering describes the instructions available for a specific FPGA fam-
 ily. The target description gives the area and latency costs
The Reticle compiler is responsible for lowering the abstract for each assembly instruction along with its semantics in
intermediate language to the concrete assembly language. terms of intermediate language instructions.
The core problem is instruction selection, i.e., choosing a high-
quality sequence of assembly instructions that have the same Target Description Language. Because the availability
semantics as the original intermediate instructions. A key of different low-level hardware operations (and their costs)
consequence of Reticle’s design is that the problem is similar can vary across FPGA families, the Reticle compiler needs a
to instruction selection in a traditional software compiler [2] mechanism for describing a target platform. We designed a
but applied to the hardware domain. This is not the first target description language that allows succinct specification
time instruction selection has been proposed for hardware of assembly instructions supported by a given FPGA target;
compilation [9, 26, 27]. Whereas today’s RTL toolchains rely it is specified in Figure 9.
on slow, unpredictable metaheuristics to do a similar logical- In FPGA terms, a target is defined as a set of devices that
to-physical mapping [33], the Reticle compiler can leverage support the same kinds of primitives, and it is often referred
the large body of work on efficient, deterministic instruction as an FPGA family or series. Devices within a family can be
selection algorithms to achieve the same effect. programmed with the same set of assembly instructions, and
 Figure 8a shows an example of instruction selection in only differ on the number of instructions that are capable to
Reticle. The intermediate-language program in Figure 8a is accommodate spatially. Moreover, devices only differ on the
semantically equivalent to both assembly programs in Fig- number of DSPs and LUTs supported (columns and rows)
ures 8b and 8c, assuming a target architecture that supports and how their columns are arranged i.e., six columns of LUTs
mul , add , and muladd assembly instructions. The choice of followed by one column of DSPs.

 762
PLDI ’21, June 20ś25, 2021, Virtual, Canada Luis Vega, Joseph McMahan, Adrian Sampson, Dan Grossman, and Luis Ceze

 Concretely, a target description is defined as a list of as- t0 : i8 = muladd (a ,b , in ) @ dsp (?? , ?? );
 t1 : i8 = muladd (c ,d , t0 ) @ dsp (?? , ?? );
sembly definitions that represents all the assembly in-
structions supported by a specific family. Each definition c a co
has an operation name , a hardware resource that the d b DSP y t1
operation occupies, and area and latency costs as integers c ci

 , and the (typed) inputs and outputs to the operation. The
 a a co
definition also has a body that defines its semantics in terms
 b b DSP y
of intermediate instructions. The body consists of a sequence in c ci
 t0

of instructions that resemble an intermediate language pro-
 (a) Without cascading, regular routing
gram, without cycles (DAG) or information. The in-
struction selection algorithm uses the body and costs to t0 : i8 = muladd_co (a ,b , in ) @ dsp (x ,y );
determine when a fragment of an intermediate language t1 : i8 = muladd_ci (c ,d , t0 ) @ dsp (x ,y +1);
program can be replaced with an equivalent target-specific
 c a co
assembly instruction. d b DSP y t1
 Figure 10 lists an example target in this specification lan- c ci
guage. This hypothetical target supports three assembly in- t0
structions: reg , add , and add_reg . a a co
 b b DSP y
 Instruction Selection. The steps for performing instruc- in c ci

tion selection include: data-flow graph (DFG) generation, (b) With cascading, high-speed routing
tree-partitioning and selection. Initially, the intermediate
program is converted to a DFG, where nodes represent in-
 Figure 11. Example of optimizing the layout of an assembly
structions and inputs, and edges correspond to how data
 program (a) using instruction cascading. In (b), the unknown
flow through the program.
 location specifiers (ł??”) have become parametric layout ex-
 Once the DFG is created, the graph is partitioned into trees
 pressions over and coordinates. They imply the adjacency
of intermediate instructions. The reason behind this parti-
 constraint in , while still being place-able almost anywhere.
tioning is the fact that the DFG might contains cycles, which
 These constraints can be solved later, during the instruction
are not supported by tree-covering algorithms. Because the
 placement step, for a given device.
Reticle definition of well-formed programs excludes combi-
national cycles (see Section 6.1), we know that simply cutting
on register operations is sufficient to make valid trees. The
procedure for partitioning the DFG into trees consists on 5.2 Layout Optimizations
finding the nodes in the graph that are root candidates to After instruction selection, the Reticle compiler can further
make a cut. There are two conditions required to be a root optimize assembly programs by placing them into high-
node, (1) the node must be a compute instruction, and (2) its performance spatial layouts. Layout optimizations can be
outgoing edges must be greater than one or none; compute expressed as constraints in the assembly language, using
nodes without outgoing edges represent outputs, meanwhile coordinate expressions. The relative placement of target-
compute nodes with more than one outgoing edge can con- specific operations can have a large impact on the efficiency
tain cycles and therefore they are considered as root nodes. of a program. For example, by placing DSP-mapped opera-
 After tree-partitioning, the next step is selection, whose tions within the same column, programs can take advantage
goal is to transform and optimize these trees of compute of DSP cascading: leveraging high-speed routing resources
instructions into assembly instructions using the target de- available within DSP columns [42]. Hardware support for
scription specification. Instruction selection is performed DSP cascading is widely available in most architectures to-
using a linear-time tree-covering algorithm originally devel- day, including FPGAs designed by Intel [23], Xilinx [47],
oped for code generation in compilers [2]. The procedure is Lattice [30], and Achronix [1].
based on dynamic programming, using previous solutions Figure 11a shows an example containing a pair of muladd
to create better solutions at every node while traversing the instructions without any layout constraints. There are mul-
tree in a postorder fashion. Then, the solutions (assembly tiple valid layout candidates for this program; however, the
instructions) from every tree are composed to produce a version in Figure 11b, which places the operations vertically
final assembly program. The assembly instructions in this adjacent in the same DSP column, is far faster than one that
program have unknown locations (coordinate holes) that scatters the operations across different columns or more
are further optimized spatially (if necessary), and later re- distant within the same column.
solved by the instruction placement stage in the compiler A Reticle assembly program can express this layout op-
for a specific device. timization as a layout constraint using expressions for the

 763
Reticle: A Virtual Machine for Programming Modern FPGAs PLDI ’21, June 20ś25, 2021, Virtual, Canada

placement coordinates on each instruction. By using x as If Z3 cannot find a valid placement for every instruction,
the column for both operations and y and y +1 as the row placement fails.
expression, the assembly program describes a placement of Once a valid placement is found for each instruction,
neighboring DSPs. Moreover, the semantic of the muladd_co the layout engine optionally performs a series of shrink-
assembly instruction means that not only the DSP is con- ing passes as an optimization. It computes the highest - and
figured to perform the muladd operation but must use the -coordinate for each resource type, takes this as a maximum
cascade port ( co ) instead of the default port ( y ) for the result. area, then uses a binary search to successively re-run place-
Similarly, the muladd_ci instruction uses the cascade input ment with an artificially reduced area. If it succeeds, the next
port ( ci ) instead of the default port ( c ) for the partial sum. iteration shrinks again; if it fails, binary search is repeated in
 Notably, this and other parameterizable layout optimiza- the new interval. The end result is a more compact physical
tions can be ported within an FPGA family using our assem- layout on the FPGA.
bly language, and later solved in the compilation pipeline
i.e., instruction placement, for maximum portability. 5.4 Code Generation
 The goal of code generation is to expand assembly and wire
5.3 Instruction Placement instructions into structural Verilog with layout annotations
 (Figure 2c). Because of the work of our prior compiler passes,
After instruction selection and layout optimizations have
 this step is purely one of generation Ð we simply need to
taken place, all assembly instructions must be placed in a
 create valid Verilog that reflects our accumulated decisions.
valid position on the target device. Therefore, the placement
 While this transformation is more complex than a conven-
procedure consists of converting a family-specific program
 tional assembly to a binary format, it conceptually serves
(unresolved locations) into an equivalent device-specific pro-
 the same role Ð converting to a format that can be given to
gram (resolved locations) as described in Figure 7; finding
 program a hardware target.
a unique value for each coordinate variable used, as well as
 Instructions have been selected, optimized, and placed;
filling in all wildcards (??).
 now, based on the resources previously chosen for each in-
 Deciding a physical layout consists of mapping all assem-
 struction, they are expanded to a set of primitive LUTs or
bly instructions to specific FPGA resources, for a specific
 DSPs. DSP-based instructions are converted into a DSP prim-
target FPGA. Each instruction will already have undergone
 itive with a proper configuration in terms of ports and at-
selection, so the task is reduced to finding a mapping for
 tributes to execute the desired instruction. On the other hand,
each LUT instruction to an available LUT slice and each DSP
 LUT-based instructions require configuring a LUT for every
instruction to an available DSP slice. All modern FPGAs are
 bit of computation. The reason for this is that these primi-
constructed as columns of resources; the layout engine takes
 tives produce a single-bit output and not a full word. (For
as input the layout of the target FPGA Ð specifically, which
 example, one 8-bit integer operation requires 8 LUTs.) Addi-
columns are DSPs and LUTs, and how many entries or slices
 tionally, there are instructions e.g., addition or subtraction
those columns have.
 that require other primitives also present within a LUT slice
 Notably, LUT column slices are different from DSP slices,
 such as carry chains. In any case, each primitive is anno-
due to the fact that LUT slices host more than one pro-
 tated with the coordinate result obtained in the instruction
grammable resource. We formulate the placement problem
 placement step.
in terms of these slices.
 Not every instruction will result in instantiating LUTs or
 To solve and optimize layout, we use the Z3 SAT solver [13].
 DSPs. As we expect, wire operations consume no area to
The layout problem is expressed as a series of constraints for
 execute (they simply require different wiring). These instruc-
each instruction, which are fed to the solver. Z3 quickly finds
 tions are generated as direct structural Verilog expressions
a valid coordinate assignment for each instruction, subject
 without location information.
to the following constraints:

 • The -coordinate must match a column of the appro- 6 Implementation
 priate resource (DSP vs LUT); We implemented a Reticle compiler in 8662 LoC in Rust,
 • The -coordinate must be between 0 and the maximum together with a Verilog AST library (2486 LoC). We used
 number of resources for that type of column; this AST library for code generation. Additionally, a target
 • If there is a relative constraint placed (as described library describing the assembly instructions supported by
 in Section 5.2) such that this instruction must follow the Xilinx UltraScale FPGA is written in 444 lines, using our
 another at 1 , then the -coordinate must be at 1 + 1; target description language (TDL). The instruction placement
 • All instruction resources are unique (this instruction’s is implemented in 117 lines of Python, using the Z3 bindings.
 coordinates cannot match any other instruction’s co- The following two subsections explain details about the
 ordinates). well-formedness criterion and the interpretation of programs.

 764
PLDI ’21, June 20ś25, 2021, Virtual, Canada Luis Vega, Joseph McMahan, Adrian Sampson, Dan Grossman, and Luis Ceze

 t0 : bool = const [1];
 Algorithm 1: Reticle interpreter.
 t1 : i8 = const [4];
t0 : i8 = const [4]; t2 : i8 = add (t3 , t1 ) @ ?? ; 1 function Interpreter(trace, program)
t1 : i8 = add (t1 , t0 ) @ ?? ; t3 : i8 = reg [0]( t2 , t0 ) @ ?? ; 2 ( , , ) ← WellFormedCheck( );
 3 ( ,outputs) ← GetPortNames( );
 (a) Ill-formed (b) Well-formed
 4 ← Trace();
 5 foreach step_in ∈ trace do
Figure 12. Example of an ill and well formed program. A
 6 ← Update( , _ , );
well-formed program only allows cycles when stateful in-
 7 ← Eval( , );
structions such as reg are present in the path of the cycle.
 8 _ ← Step( , );
 9 ← Push( , _ );
6.1 Well-Formedness 10 ← Eval( , );
In hardware design, programs typically need to avoid combi- 11 end foreach
national loops: register-free cycles in the wiring graph that 12 return T ;
would produce undefined behavior [39]. In Reticle, this con- 13 end function
straint manifests as a well-formedness criterion. The depen-
dency graph for a well-formed program, in both the interme-
diate and assembly language variants, must be acyclic (a dag)
when register instructions ( reg ) are removed. This section gives users a fast, convenient way to debug their programs
describes how we define and check this criterion. without having to actually program an FPGA.
 Figure 12 shows examples of ill-formed and well-formed The first step in the interpreter consists of checking that a
Reticle intermediate language programs. Both programs at- program is well-formed (line 2). This procedure topologically
tempt to increment a stored value by a constant value 4. And sorts the instructions in the function (see Section 6.1) and
both programs contain dependence cycles: in general, cycles returns two sorted instruction queues and an environment
are required for instructions to reuse their own outputs as ar- containing the initial value of every register instruction.
guments later in time. However, Figure 12b’s cycle includes The returned queues include one queue of pure instructions
a reg instruction while Figure 12a’s has a combinational and another queue with register instructions.
(register-free) loop. For every step in the input trace, the interpreter updates all
 The Reticle implementation checks well-formedness by variables in the environment with new values _ 
forming a dependence graph for a given function, where (line 6). Then, pure instructions are evaluated under the
the vertices are instructions and the edges are definitionś current environment (line 7). Next, a new step value is cre-
use relationships. It then sorts nodes in topological order, ated for all variables and pushed into the output
excluding reg instructions. If the sort procedure succeeds, trace (lines 8 - 9). Finally, register instructions are evalu-
the program is well-formed. ated, updating the environment for the step in the .
 Reticle differs from many traditional hardware tools in
rejecting programs with combinational loops. Many inter- 7 Evaluation
preters (a.k.a. simulators) for hardware description languages We evaluated Reticle by generating programs for linear al-
(HDLs) such as Verilog and VHDL silently produce undefined gebra operators and control coroutines (Section 7.1), and
or x-values instead of producing errors [46]. Hardware engi- then compiled them to structural Verilog with layout anno-
neers must therefore carefully avoid creating these cycles or tations (Figure 2c) using the compilation pipeline described
risk obscuring serious bugs. We instead opt to reject these in Section 5. We also compiled these benchmarks to two
programs ahead of time to avoid the need to handle this behavioral Verilog baselines for a standard vendor toolchain,
undesired behavior during compilation and interpretation. and compared their compilation time and the quality of the
 resulting hardware.
6.2 Interpreter Furthermore, the two behavioral Verilog baselines include:
To clearly define the meaning of a Reticle program, we de- (1) one using standard, portable Verilog, and (2) an advanced
fine an interpreter in Algorithm 1 that evaluates a function version using vendor-specific synthesis hints. The latter rep-
program by stepping through a sequence of values defined resents the use of ad hoc and vendor-specific Verilog lan-
in an input trace1 and producing an output trace T. This guage extensions that can tune the toolchain to do a better
 job of mapping the program to the FPGA’s fixed-function
1 A łtrace” is a general term for a map of values present in a hardware circuit.
 resources, and it represents significant implementation effort
Each variable (circuit element) in the trace has an associated value for every
clock cycle in the domain. An input trace gives a complete specification beyond standard RTL design. We generate these baselines by
for a circuit’s inputs, for every cycle, while an output trace does so for the transforming Reticle programs using translation backends
outputs. that emit code resembling standard behavioral Verilog.

 765
Reticle: A Virtual Machine for Programming Modern FPGAs PLDI ’21, June 20ś25, 2021, Virtual, Canada

 2 4000
 10
 Compiler speedup (log)

 3
 300

 Run-time speedup
 3000
 Lang

 DSPs used
 LUTs used
 2 200 base
 1
 10 2000 hint
 reticle
 1 100
 1000
 0
 10
 0 0 0
 64 128 256 512 64 128 256 512 64 128 256 512 64 128 256 512
 Size Size Size Size
 (a) tensoradd

 8000
 Compiler speedup (log)

 2
 10 1.5 150
 Run-time speedup

 6000 Lang

 DSPs used
 LUTs used
 1.0 100 base
 1 4000 hint
 10
 reticle
 0.5 50
 2000
 0
 10
 0.0 0 0
 5x3 5x9 5x18 5x36 5x3 5x9 5x18 5x36 5x3 5x9 5x18 5x36 5x3 5x9 5x18 5x36
 Size Size Size Size
 (b) tensordot
 1.0
 60
 0.04
 Compiler speedup (log)

 2
 10
 Run-time speedup

 0.8 50
 0.02
 Lang

 DSPs used
 LUTs used

 0.6 40
 base
 1 0.00
 10 30 hint
 0.4 reticle
 20 −0.02
 0.2
 10 −0.04
 0
 10
 0.0 0
 3 5 7 9 3 5 7 9 3 5 7 9 3 5 7 9
 Size Size Size Size
 (c) fsm

Figure 13. Compiler, run-time, and utilization results of three benchmarks, tensoradd (a), tensordot (b), and fsm (c), when
using behavioral Verilog (base), behavioral Verilog with DSP hints (hint), and Reticle (reticle).

 We use a Xilinx xczu3eg-sbva484-1 FPGA, with 360 DSPs number of states (3, 5, 7, 9) based on input values. The moti-
and 71K LUTs, as a target device. For the baseline RTL vation is to show that Reticle programs can describe control-
toolchain, we use Xilinx’s Vivado 2020.1. oriented programs normally found in hardware processor
 schedulers and protocols. More importantly, these programs
 can only be implemented on LUTs, not DSPs: conditional
7.1 Benchmark Description branching requires multiplexing (the mux instruction in Reti-
We use three benchmarks, intended to represent three dis- cle), which it is implemented using only LUT-based logic.
tinct facets of Reticle: a tensor addition kernel tensoradd
demonstrates vectorization, a dot product implementation
tensordot demonstrates fused operations and cascading, 7.2 Results Comparison
and a finite state machine fsm demonstrates support for Compiler speed. The leftmost plots in Figure 13 compare
control-oriented programs. Each benchmark is parameter- the compilation time for Vivado (labeled base for standard
ized with four sizes. Verilog and hint for directive-laden Verilog) and our com-
 The tensoradd benchmark consists of an element-wise piler (reticle), when compiling and placing (layout) pro-
summation over four different one-dimensional tensor sizes grams for every benchmark described in Section 7.1. The
(128, 256, 512, 1024). We pipelined the addition operation Reticle compiler is between 10 and 100 times faster than
with register instructions to get the best possible perfor- Vivado. By starting with programs at a lower level of ab-
mance available in DSP primitives. straction, the Reticle compiler is solving a simpler problem
 Next, tensordot consists of five systolic arrays [29] per- than a traditional HDL toolchain like Vivado. The Reticle
forming the dot operation over five pairs of one-dimensional compiler focuses exclusively on selecting and configuring
tensors of four different sizes (3, 9, 18, 36). the FPGA’s coarse-grained heterogeneous resources; an RTL
 Finally, fsm is based on a coroutine, implemented as a toolchain also attempts to perform bit-level logic synthe-
hardware finite state machine (FSM), that ranges over some sis [8] to transform behavioral descriptions into structural

 766
PLDI ’21, June 20ś25, 2021, Virtual, Canada Luis Vega, Joseph McMahan, Adrian Sampson, Dan Grossman, and Luis Ceze

realizations, which is important for traditional circuit gen- Lastly, the fsm benchmark shows the performance of
eration but does not directly affect the mapping to modern control-oriented programs when mapped to LUTs. This kind
units like DSPs. of control logic is a kind of pathological case for Reticle: there
 The compilation performance gains in linear algebra bench- is no way to use hardened logic resources like DSPs, which
marks (tensoradd, tensordot) monotonically decreases as are Reticle’s main target, and traditional HDL toolchains
the sizes of the tensors grow, which translates into more use complex logic synthesis optimizations to minimize the
DSPs to be placed by Reticle’s SMT-based layout mechanism. number of LUTs they require. Our aim with this benchmark
On the other hand, the speedup obtained when compiling is to show that Reticle can nonetheless support this kind
the fsm benchmark is somewhat average due to the fact the of synthesis and that the performance is not much worse
number of used LUTs are relatively low. from a heavily engineered behavioral HDL toolchain. In this
 case, Reticle produces fsm programs that are slower than
 Run-time performance. The second plots in Figure 13 Verilog’s results. While the Reticle compiler focuses on ex-
show run-time speedup for Reticle over Vivado, which is the tracting peak performance from hardened logic units like
ratio between the running times for the generated FPGA- DSPs, it nonetheless supports LUT-based compilation with
based programs from the different compilers. Here, a running much faster compilation and some performance penalty.
time is the critical path of the hardware circuit, which de-
termines the maximum clock frequency at which hardware
operates. For tensoradd, Reticle-generated programs are Utilization. The final two plots in the rows of Figure 13
faster than the standard Vivado baseline for all tensor sizes compare the FPGA resources used by the generated pro-
because of the performance advantages of using the hard- grams. The aim here is to show how the difference in the
ened units in DSPs compared to LUT-based logic. Vivado’s resource binding policies for Reticle versus Verilog. With
heuristics fail to exploit DSPs at all using a pure behavioral Verilog, Vivado’s job is to search for any implementation that
description ( base ); Reticle, in contrast, maps the program to matches the behavioral descriptionÐany resource-binding
DSP hardware deterministically. hints are łsoft” and the compiler can ignore them. In contrast,
 Surprisingly, even though there is hardware support for Reticle placement and resource annotations are łhard”: the
vectorization in every DSP of Xilinx FPGAs, Vivado fails to compiler predictably allocates exactly the kind of resource
use this feature when using behavioral representation even that the programmer requested.
in the presence of compiler hints. Vivado fails to exploit The benchmarks’ resource utilization reveals this unpre-
vectorization even for this simple, dependency-free parallel dictability in Vivado as the sizes vary. In Reticle, both linear
workload. Reticle successfully selects vectorized DSP config- algebra benchmarks use vector instructions and chained
urations in every case. While vectorized configurations make instructions ( mul followed by an add ) that the compiler de-
more area-efficient use of DSP resources, they are slightly terministically maps to DSPs. Vivado performs a heuristic
slower than scalar operations on DSPs. This phenomenon mapping based on the availability of resources, resulting on
explains why the hint-laden Verilog versions can be slightly unpredictable behavior that, for example, silently replaces
faster than Reticle for some sizes: when sufficient DSP re- DSPs with LUTs in the largest size of tensoradd.
sources exist on a target, Vivado can heuristically select
scalar operations (at tensor sizes 64, 128, and 256). However,
Vivado’s heuristic approach fails when the program grows 8 Discussion
larger, i.e., at a tensor size of 512: a scalar configuration ex-
 This section describes the requirements and optimization
hausts all the DSPs on the target, and the toolchain silently
 opportunities for front-end tools when targeting Reticle.
falls back to using slower LUT-based implementations in-
stead. At this latter configuration, the Reticle-generated vec-
torized program is nearly 3× faster than the Verilog program,
both with and without hints. A differently annotated Reticle 8.1 Requirements
program could express the scalar configuration as well; we When compiling from a higher-level language or tool, the fol-
focus specifically on the vectorized version here to show the lowing compilation steps may be necessary depending on the
differences with a traditional HDL toolchain. features and abstractions of the source tool. Reticle is built
 Next, the tensordot benchmark shows the benefits of around instructions and does not have higher-level features
cascading DSPs (Section 5.2). The latest version of Vivado for control-flow, dynamic scheduling of operations, or am-
(2020.1) is capable of applying this type of cascade optimiza- biguous resource sharing; these features, if present, must be
tion when using hints, similar to our compiler, at the expense compiled down to Reticle instructions. Notably, the absence
of compilation time (up to 100 times slower in the worst case). of these features in Reticle does not exclude any categories
The performance is the same for Reticle and Verilog with of hardware design, but rather requires that higher-level fea-
hints, and both outperform plain Verilog. tures are translated to explicit, łstructural” implementations.

 767
Reticle: A Virtual Machine for Programming Modern FPGAs PLDI ’21, June 20ś25, 2021, Virtual, Canada

 t0 : i8 = mul (a ,b ); t0 : i8 = add (a ,b );
 t1 : i8 = add (t0 ,c ); t1 : i8 = add (c ,d );
 t2 : i8 = reg [0]( t1 , en ); // cycle 0 t2 : i8 = add (e ,f );
 t0 : i8 = add (a ,b ); t3 : i8 = add (g ,h );
 (a) Scheduled in one cycle
 (a) Sequential (b) Parallel
 t0 : i8 = reg [0]( a , en ); // cycle 0
 t1 : i8 = reg [0]( b , en ); // cycle 0 Figure 15. Example of two different resource sharing strate-
 t2 : i8 = mul (t0 , t1 ); gies for a program that computes four additions. (a) The
 t3 : i8 = reg [0]( t2 , en ); // cycle 1 sequential implementation, using one instruction in four
 t4 : i8 = add (t3 ,c ); units of time, and (b) the parallel implementation, using four
 t5 : i8 = reg [0]( t4 , en ); // cycle 2
 instructions in one unit of time.
 (b) Scheduled in three cycles
 t0 : i8 = add (a ,b );
 t1 : i8 = add (c ,d );
Figure 14. Example of two different scheduling solutions for
 t2 : i8 = add (e ,f );
a program that computes the expression ∗ + . t3 : i8 = add (g ,h ); t0 : i8 = add (a ,b );

 (a) Scalar program (b) Vector program

 Figure 16. Example of two equivalent programs computing
 Control flattening. This step involves flattening any con- four additions in parallel: a scalar program (unoptimized)
trol structures available in the high-level languages to Reti- and a vector program (optimized).
cle instructions. For example, if-then-else constructs or phi-
nodes can be lowered to mux instructions as:
 8.2 Optimizations
t0 : i8 = mux ( cond ,a ,b ); The following compilation steps are not required to generate
 a Reticle program; however, they provide important opportu-
 nities for higher-level tools to take advantage of information
 Scheduling. This task consists of choosing when abstract present in the source program and use it for optimization.
operations run by mapping them onto clock cycles and insert-
 Vectorization. The goal of this optimization is to com-
ing registers. Scheduling decisions impact the performance,
 bine independent scalar instructions, that are scheduled at
including throughput and run-time, of programs. For exam-
 the same clock cycle, into vector instructions (Section 4.)
ple, Figure 14 describes two schedules for a program that
 Front-end tools can promote the use of vector instructions
computes the abstract expression × + . The Reticle pro-
 in Reticle by using vector types; alternatively, more complex
gram in Figure 14a produces a result every clock cycle, while
 optimizations can attempt to automatically combine scalar
the program in Figure 14b does so every three clock cycles.
 operations into vector expressions. The benefits of vectoriza-
Depending on the target primitives, certain schedules can be
 tion are twofold: faster programs, due to high-performance
more profitable than others, because the register distribution
 primitives (i.e, DSPs), and better resource utilization, because
within DSPs and LUTs slices varies considerably.
 more operations are mapped to the same primitive. For ex-
 ample, the two programs shown in Figure 16 describe the
 Resource sharing. This step includes assigning abstract result of this optimization. Generating individual instruc-
operations to Reticle instructions, which are eventually low- tions for each arithmetic operation is valid, but vectorization
ered to physical resources by the Reticle compiler. Resource can provide large gains in performance and efficiency.
sharing strategies often involve space-time trade-offs and,
similar to scheduling, affect program performance. For exam- Resource binding. The purpose of this optimization is
ple, Figure 15 shows two different strategies for a program to control how instructions are bound to primitives. The
that computes four additions. The program described in Fig- Reticle IR supports such control via the resource annotations
ure 15a uses one add instruction sequentially to compute the dsp or lut , as described in Figure 17a and 17b respectively.
four additions, taking four units of time, while the program These annotations are not required; if omitted, the Reticle
in 15b uses four instructions for computing all additions compiler will assign resources according to internal metrics
in one unit of time (in parallel). If the input tool does not as described in Section 5. Interestingly, these constraints can
distinguish between these cases, or uses the ambiguity for be exploited by higher-level tools to optimize programs for
optimizations, the final space/time decision must be made metrics the Reticle compiler does not, by default, accom-
before outputting Reticle code. modate. For example, rather than prioritizing performance,

 768
You can also read