Unifying Performance and Security Evaluation for Microarchitecture Design Exploration - Northeastern University

Page created by Kevin Watkins
 
CONTINUE READING
Unifying Performance and Security Evaluation for Microarchitecture Design Exploration - Northeastern University
Unifying Performance and Security Evaluation for Microarchitecture
                           Design Exploration

                               A Thesis Presented
                                        by

                                  Griffin Knipe

                                        to

            The Department of Electrical and Computer Engineering

                     in partial fulfillment of the requirements
                                 for the degree of

                                Master of Science

                                        in

                             Electrical Engineering

                            Northeastern University
                             Boston, Massachusetts

                                  April 20, 2021
Unifying Performance and Security Evaluation for Microarchitecture Design Exploration - Northeastern University
To Krina, for sharing your determination to a life of play.

                            i
Unifying Performance and Security Evaluation for Microarchitecture Design Exploration - Northeastern University
Contents

List of Figures                                                                                                                               iv

List of Tables                                                                                                                                 v

Acknowledgments                                                                                                                               vi

Abstract of the Thesis                                                                                                                        vii

1   Introduction                                                                                                                               1

2   Background                                                                                                                                 3
    2.1 Instruction Set Architecture . . . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    3
        2.1.1 Architectural State . . . . . . . . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    3
        2.1.2 Assembly . . . . . . . . . . . . . . . . .              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    3
        2.1.3 Virtual Memory . . . . . . . . . . . . . .              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    4
    2.2 Microarchitecture . . . . . . . . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    4
        2.2.1 Pipelining . . . . . . . . . . . . . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    5
        2.2.2 Handling Control-flow Instructions . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    7
        2.2.3 Data Hazards in the 5-stage Pipeline . . .              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    8
        2.2.4 Out-of-order Execution . . . . . . . . . .              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    8
        2.2.5 Memory Hierarchy . . . . . . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    9
        2.2.6 Transient Instructions . . . . . . . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
        2.2.7 Side-channels . . . . . . . . . . . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
        2.2.8 Transient Execution Side-channel Attacks                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   11
    2.3 Microarchitectural Security Verification . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   12

3   Related Work                                                                                                                              13
    3.1 Performance Simulators . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   13
         3.1.1 Akita . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   14
    3.2 Formal Methods for Security Evaluation .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   14
         3.2.1 CheckMate . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   15
    3.3 Software Implementation Error Catching .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   16

                                                 ii
Unifying Performance and Security Evaluation for Microarchitecture Design Exploration - Northeastern University
4   Yori                                                                                                                               18
    4.1 VEmu . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   18
         4.1.1 Verification and Performance . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   18
    4.2 Microarchitecture Simulation . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   19
         4.2.1 Instruction Fetch Unit . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   20
         4.2.2 Memory Unit . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   21
         4.2.3 Execution Backend Overview . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   21
         4.2.4 Decode . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   22
         4.2.5 Register-rename . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   22
         4.2.6 Reorder-Buffer . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   24
         4.2.7 Issue . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   24
         4.2.8 Register-read . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   24
         4.2.9 Execute . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   25
         4.2.10 Memory Operation Ordering Exception .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   25
         4.2.11 Verification . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   26
         4.2.12 Performance Discussion . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   26
    4.3 CheckMate Integration . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   28
         4.3.1 Communication Challenges . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   29
         4.3.2 Undesirable Solutions . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   33
         4.3.3 Request Driven Simulation . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   33
         4.3.4 3-stage model translation to CheckMate. .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   34
         4.3.5 Yori as a Request Driven Simulator . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   34
         4.3.6 Remaining Challenges for Integration . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   38

5   Conclusion                                                                                                                         40
    5.1 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                       41
    5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                        41

Bibliography                                                                                                                           43

                                                  iii
Unifying Performance and Security Evaluation for Microarchitecture Design Exploration - Northeastern University
List of Figures

 2.1   MIPS ISA single-cycle data-path (from Harris and Harris [1]). . . . . . . . . . . .         5
 2.2   Simple 5-stage RISC-V pipelined data-path (from Hennessy and Patterson [2]). . .            6

 3.1   High-level Akita component topology. . . . . . . . . . . . . . . . . . . . . . . . .       14
 3.2   A generalized happens-before graph depicting two instructions traversing three stages
       in a microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    16

 4.1  A high-level view of Yori’s simulation components. . . . . . . . . . . . . . . . . .        20
 4.2  Communication flow between stages in the Execution Backend. . . . . . . . . . .             22
 4.3  High-level software topology of tools facilitating performance simulation and security
      evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   28
 4.4 Akita model component topology for simple 3-stage pipeline. . . . . . . . . . . . .          29
 4.5 Generic Akita connection employing default message buffering structures. . . . . .           30
 4.6 The sequence of operations corresponding to default message handling in the 3-stage
      model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    31
 4.7 Sequence of operations, following the cycle shown in Figure 4.6, causing back-
      pressure in the 3-stage model. . . . . . . . . . . . . . . . . . . . . . . . . . . . .      32
 4.8 Sequence of operations for request-driven communication design. . . . . . . . . .            35
 4.9 Event-message sequence diagram for the 3-stage model Fetch stage. . . . . . . . .            36
 4.10 Event-message sequence diagram for the 3-stage model Commit stage. . . . . . . .            37
 4.11 Event-message sequence diagram for the 3-stage model Memory Unit. . . . . . . .             37
 4.12 Messaging overview for Yori Execution Backend as a request driven simulation. . .           38

                                                iv
List of Tables

 4.1 Shown here are the transient instruction window lengths, measured in number of
     instructions, for the median micro-benchmark [3] on Yori and Verilator simulators.
     Yori’s transient instruction window is far shorter than that of the Verilator model, on
     average. This motivates additional work for Yori to faithfully model SonicBOOM
     transient execution attacks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   26
 4.2 Shown here is the CPI, measured in cycles, for the Yori simulation, running the
     median micro-benchmark [3]. Measurements are categorized by instruction type,
     as each type executes through a different hardware path. ALU and Jmp instructions
     both use the ALU functional unit of the Execute stage. Mem and CSR instructions
     use the CSR Unit and LSU, respectively. . . . . . . . . . . . . . . . . . . . . . . .      27
 4.3 Shown here is the CPI, measured in cycles, for the Verilator simulation, running the
     median micro-benchmark [3]. Measurements are categorized by instruction type
     as each type executes through a different hardware path. . . . . . . . . . . . . . .       27

                                               v
Acknowledgments

           First, I’d like to thank Dr. David Kaeli, who generously accepted me into NUCAR and
supported me through my first research experience. He gave me the freedom to explore an unfamiliar
field, and with that countless opportunities to grow as an engineer, as a student, and as a person.
           Next, I’d like to thank Dr. Yunsi Fei for her feedback and guidance in this work.
           Thank you to Mike Ridge and Andrew Stone for encouraging me to pursue, and ultimately
secure, a Draper research fellowship. Andrew has been a key mentor in my engineering education
through multiple internships and work opportunities.
           To my peers at NUCAR, I am inspired by your work and determination. Thank you for
welcoming me into your group and sharing your thoughts and ideas. I hope we can work together
again in the future.
           To Mom, Dad, and Bella, I wouldn’t be here if not for your love and support. Thank you.

                                                vi
Abstract of the Thesis

    Unifying Performance and Security Evaluation for Microarchitecture
                                       Design Exploration
                                                   by
                                             Griffin Knipe
                           Master of Science in Electrical Engineering
                              Northeastern University, April 20, 2021
                                      Dr. David Kaeli, Advisor

           Computer architects develop microarchitectural features that boost instruction-level paral-
lelism to improve CPU performance. While performance may be improved, adding new features
increases the CPU’s design complexity. This further compounds the effort required to complete
design verification. Trustworthy design verification is paramount to microarchitecture design, as
silicon chips cannot easily be patched in the field.
           Despite the best efforts for security verification, researchers have created transient execution
side-channel attacks which can exploit microarchitecture performance features to leak data across
ISA-prescribed security boundaries. This motivates the unification of performance evaluation and
security verification techniques to ensure that new microarchitectural features are understood from
multiple design perspectives.
           This thesis presents Yori, a RISC-V microarchitecture simulator that aims to enable
computer architects to evaluate microarchitecture performance and security using a single framework.
As Yori is a work-in-progress, this thesis presents the work-to-date, focusing on a detailed model of
the reference microarchitecture and evaluation of the current model accuracy. We describe a viable
methodology to interface between the Yori simulator and an existing security verification tool. We
conclude the thesis, laying out a plan to complete this marriage of performance and security.

                                                   vii
Chapter 1

Introduction

          Computer architects work to improve a CPU’s power efficiency, reliability, form-factor,
security and performance with each CPU design iteration. These design aspects are tightly coupled,
such that microarchitectural features aimed at addressing a subset of the design aspects will surely
affect the others. This observation is supported by the discovery of transient execution side-channel
attacks [4, 5, 6, 7, 8], which exploit microarchitectural performance features that violate security
constraints. Malicious attackers can use these exploits to uncover protected data on common
consumer CPUs.
          Transient execution side-channel attacks come in many varieties. They differ in their
design in how they exploit attack triggers [4, 5] and achieve data leakage [9, 10]. Designers have
produced new microarchitectural defenses that are resilient to previously known attacks, though
researchers continue to discover new sets of attacks daily on new designs [6]. Such cases highlight
how secure microarchitecture design is always playing catch-up with new vulnerabilities. Researchers
can create software mitigations for existing microarchitectures and hardware mitigations for future
microarchitectures, but these protections incur a performance cost [11]. This overhead is typically
unacceptable for an industry whose consumers expect a device to be secure and performant for its
lifetime. This challenge motivates us to explore a new approach to microarchitecture design.
          Ongoing research for a new approach to microarchitecture design centers around the
RISC-V [12] instruction set architecture (ISA). RISC-V is a customizable, open-source, ISA that
is activively being studied across a large and growing research community. By utilizing RISC-V,
research projects may leverage free and sophisticated microarchitecture designs [13, 14, 15], have
access to ever-expanding software support [16], and utilize a rich set of microarchitectural simulation
tools [17, 18, 19].

                                                  1
CHAPTER 1. INTRODUCTION

          This thesis presents Yori, a RISC-V microarchitecture simulator that aims to marry per-
formance characterization with the formal verification of security constraints. Where these types of
evaluations currently require a designer to implement the microarchitecture in separate frameworks,
this tool should support both given a single imperative description of the hardware.
          This thesis is organized as follows. In Chapter 2 we review many of the background
concepts utilized in this work. Chapter 3 describes prior work on the class of problems that Yori is
targeting. Chapter 4 presents Yori, and provides an analysis of its performance. The thesis concludes
with Chapter 5 and includes a discussion of directions for future work.

                                                 2
Chapter 2

Background

2.1     Instruction Set Architecture

          The ISA defines a program’s perspective of hardware state during program execution and
the operations by which the program can update the hardware state [2]. In essence, an ISA describes
the interface between software and the logical data structures on-chip. Programmers and computer
architects must adhere to the ISA to ensure correct program execution. Here, the state of the logical
data structures exposed by the ISA will be referred to as the architectural state.

2.1.1   Architectural State

          Consider the base configuration of the RISC-V ISA known as RV64I. The qualifier RV64I
indicates that the ISA is configured to operate upon 64-bit integer data. The architectural state
exposed in this configuration includes 32 integer registers, 4096 control-status registers (CSR), and
264 bytes of memory [12]. The program stores its execution state in these architectural data structures
and operates upon the data to carry-out the behavior described by the programmer.

2.1.2   Assembly

          The operations for updating architectural state are known as assembly instructions. A
programmer creates a program by writing assembly instructions manually or by describing the
program’s behavior using a high-level programming language. A compiler translates the high-
level programming language to assembly instructions. For instance, programs written in the C
programming language [20] are translated to assembly by a compiler, such as GCC [21].

                                                  3
CHAPTER 2. BACKGROUND

          Each instruction is encoded as a bit-word which corresponds to the instruction’s operation
and, for some instructions, data operands and destinations. An RV64I instruction can be classified
as an arithmetic, control-flow, or data-transfer instruction [2]. Arithmetic instructions complete a
mathematic operation, given its register operands, and write the result to the destination register
(e.g. add, sub). Control-flow instructions enable non-linear instruction execution by indicating
which instruction should execute following the control-flow instruction (e.g. beq). This class of
instructions also includes instructions for enforcing memory operation order (e.g. fence). Data-
transfer instructions facilitate the movement of data between the data registers and memory (e.g. lw,
sw).

2.1.3   Virtual Memory

          The RISC-V ISA also provides the means for multi-program execution where multiple
programs may execute concurrently on shared hardware [2]. To account for potentially malicious
software sharing hardware with other programs, the ISA prescribes that one program cannot covertly
access the data of another program. This is one of the motivations behind virtual memory [2, 12]
which gives each program an independent perspective of physical memory [2]. Physical memory
refers to the memory address to data translation that exists on hardware. Virtual memory maps
different portions of physical memory to each program. The virtual memory system provides an
extra layer of address translation to hide the exact location of this data in physical memory. That
is, a program uses a virtual memory address to access memory which the hardware translates to a
physical memory address. The hardware can then access the data at that physical address without the
program knowing where that data lies in hardware.
          Note that there are cases where programs may share portions of virtual memory to facilitate
communication or to reduce memory utilization where multiple programs read the same data [2]. By
sharing certain portions of virtual memory address space, the programs also share portions of the
corresponding physical memory address space.

2.2     Microarchitecture

          The microarchitecture design comprises the physical circuit implemented on-chip. It
exposes the interface described by the ISA while accounting for real-world constraints, including
power, reliability, chip utilization, and financial cost. To balance these constraints with the need

                                                 4
CHAPTER 2. BACKGROUND

for performance, a microarchitecture may generate and track state, in addition to architectural state.
Here, this will be referred to as microarchitectural state.
          Computer architects can use microarchitectural state to employ clever algorithms to im-
prove a chip design. As execution progresses and an instruction completes, its microarchitectural
state will be made available to the program as architectural state. It is the computer architect’s
responsibility to ensure that microarchitectural state is not revealed as architectural state in such
a way that violates the ISA. Such cases can cause data leakage [9, 10] across boundaries pre-
scribed by the ISA [22]. Here, we’ll review a few fundamental microarchitectural features and the
microarchitectural state that each generates.

2.2.1   Pipelining

          Consider the single-cycle data-path [1] shown in Figure 2.1. The data-path shown here is
intended for the MIPS ISA [23], but is conceptually consistent with a RISC-V single-cycle data-path.
This data-path executes one instruction per cycle (IPC), which is a measurement of the number of
instructions to complete in one clock cycle [2]. The clock signal drives the timing of state updates in
the data-path.

                      Branch Resolution

                                                                                    Write-
         Fetch            Decode             Execute            Memory
                                                                                    Back

             Figure 2.1: MIPS ISA single-cycle data-path (from Harris and Harris [1]).

          Referring to Figure 2.1, the instruction completes its operation as it travels from left-to-
right in a single cycle. The data-path first retrieves the instruction from Instruction Memory. The
instruction operands are retrieved from the Register File or are decoded as an immediate value
from the instruction word. The Arithmetic Logic Unit (ALU) completes the operation for the given
operands. If the instruction is a data-transfer instruction, it can access memory at the Data Memory
module. Last, the instruction may write data to the Register File. In this design, writing to the Data
Memory module or to the Register File reveals new architectural state.

                                                   5
CHAPTER 2. BACKGROUND

          Since these operations occur in one cycle, the clock speed is restricted by the time it takes
for all hardware modules in the data-path to complete its operation [1]. This time value is found by
calculating the sum of time required by each module. This design leaves much of the hardware idle
during a cycle as only one module is doing work at a given time. These performance flaws motivate
data-path pipelining [2] which improves the maximum clock speed and reduces the amount of idle
hardware. Mitigating these flaws increases instruction throughput and thus data-path performance.

                            Branch Resolution

                                                                                      Write-
        Fetch             Decode              Execute            Memory
                                                                                      Back

    Figure 2.2: Simple 5-stage RISC-V pipelined data-path (from Hennessy and Patterson [2]).

          The 5-stage pipelined data-path [2] is shown in Figure 2.2. Conceptually, the 5-stage
data-path separates each of the major hardware modules in the single-cycle data-path into separate
stages. These stages are Instruction Fetch, Instruction Decode, Execute, Memory, and Write-back.
Stages are separated by pipeline registers which store the state generated by the preceding stage in
one cycle and transmit that state to the subsequent stage in the following cycle.
          This provides a performance improvement over the single-cycle data-path in a few ways.
The clock speed is no longer restricted by the sum of all hardware latencies in the data-path. Instead,
the maximum clock speed is restricted by the latency of only the single slowest hardware module [2].
Therefore, the clock can run faster for the 5-stage data-path than for the single-cycle data-path.
While the 5-stage data-path does not improve IPC over the single-cycle data-path, the potential for
increasing the clock speed provides an instruction throughput improvement.
          Pipeline registers also mitigate the idle hardware problem present in the single-cycle
data-path, described above. Each stage in the 5-stage data-path can concurrently operate on a unique
instruction in the same clock cycle, rather than remaining idle. Therefore, multiple instructions
in the instruction stream can occupy the pipeline at the same time. An instruction does not create
architectural state until it writes to the Memory stage or to the Write-back stage. Until then, the
instruction is considered microarchitectural state which is stored by the pipeline registers.

                                                  6
CHAPTER 2. BACKGROUND

2.2.2   Handling Control-flow Instructions

          A challenge in instruction pipelining arises from control-flow instructions, which enable
non-linear execution. Here, we’ll focus on conditional branch instructions, like the beq instruction
from RISC-V [2].
          The 5-stage data-path resolves branches in the Execute stage. Branch resolution refers to
the decision to continue executing the current instruction stream or jump to a different instruction
stream, specified by the branch instruction. The decision is dictated by a comparison between the
branch instruction data operands. RISC-V provides a variety of conditional branch instructions, each
corresponding to a unique comparison type (e.g. is-equal, is-less-than) [2].
          The Execute stage operates on an instruction three cycles after the Instruction Fetch stage
dispatches the instruction to the pipeline. When a branch instruction resolves in the Execute stage,
two instructions which follow the branch in program order occupy the Instruction Fetch stage and the
Instruction Decode stage. No special handling is required in the case that the branch dictates execution
should continue down the current instruction stream. However, jumping to a new instruction stream
requires a few actions. The Instruction Fetch stage is directed to dispatch from the new instruction
stream. Also, the data-path must clear any microarchitectural state that should not continue to execute,
known as a pipeline flush [2]. In the case of the 5-stage pipeline, the instructions dispatched before
the branch is resolved are cleared. This results in a 2-cycle penalty as the cleared instructions, or
nops (no-operation) [2], do no useful work in the rest of the pipeline stages.
          This sequence of events is not unique to conditional branches and may also result from
microarchitectural interrupts. Interrupts signify a microarchitectural event that may also cause
non-linear execution [2]. Note that microarchitectural interrupts are not supported in the 5-stage
pipeline but are featured in more complex data-paths [14].
          Computer architects employ branch prediction [2] to mitigate the cycle penalty incurred
by control-flow instructions. Branch prediction uses heuristics based on prior execution to guess the
correct instruction path following a branch instruction. The Instruction Fetch stage will dispatch
instructions from the speculated instruction path. Note that the 5-stage data-path does not employ
branch prediction. By guessing the correct instruction path, the execution suffers no cycle penalty.
An incorrect guess forces the pipeline to flush the speculated instructions and incurs a cycle penalty.
The data structures facilitating branch prediction are classified as microarchitectural state.

                                                   7
CHAPTER 2. BACKGROUND

2.2.3   Data Hazards in the 5-stage Pipeline

          In the case that a pipeline stage cannot complete the operation for the given instruction, the
stage will stall [2]. A pipeline stage stalls by holding the instruction for an extra cycle rather than
passing the instruction to the following stage. Instead, the pipeline stage passes a nop which incurs a
1-cycle penalty.
          In the 5-stage data-path, a data hazard [2] occurs when an instruction cannot advance
to the next pipeline stage as a result of a data dependency between two instructions. Consider a
sequence of two instructions where the second instruction uses the destination register of the first
instruction as a data operand. The Instruction Decode stage cannot complete the second instruction
until result of the first instruction is written to the Register-File. Therefore, the second instruction
cannot advance past Instruction Decode until the Write-Back stage completes the first instruction.
This results in a 2-cycle penalty for stalls resulting from the data hazard.

2.2.4   Out-of-order Execution

          A microarchitecture may employ out-of-order execution, by which the order instructions
complete is not restricted to the order prescribed by the program [2]. In contrast, the 5-stage data-path
is an in-order execution design [2]. Out-of-order execution is intended to avoid cycle penalties
incurred by data hazards and structural hazards. Structural hazards occur when an instruction is
unable to advance due to insufficient microarchitectural resources [2]. To gain a better understanding
of the potential performance gain and the additional microarchitectural state generated by out-of-order
execution, consider the SonicBOOM [14] microarchitecture. The stages comprising SonicBOOM
are shown in Figure 4.1 which portrays a model of the microarchitecture.
          SonicBOOM [14] is a 10-stage out-of-order pipeline which utilizes features that are
common on modern mobile devices. As is the case in the 5-stage data-path, instructions are retrieved
from the Fetch stage and subsequently undergo instruction decoding. Instruction decoding is
facilitated by the Decode stage and the Register-rename stage, discussed further in Chapter 4. The
Register-rename stage delivers instructions to the Issue stage and to the Reorder-buffer (ROB) in
program order [24].
          The Issue stage is responsible for enforcing data dependency constraints between instruc-
tions and respecting the structural limits of the Execute stage [2]. The structural limits of the Execute
stage are imposed by the number of instructions, for each instruction type, on which the Execute stage
can operate concurrently. The Issue stage will dispatch an instruction once its data dependencies

                                                   8
CHAPTER 2. BACKGROUND

have resolved and the Execute stage can accept it. In the case that the oldest instruction is not ready
to be dispatched, the Issue stage can dispatch a newer instruction if its constraints are satisfied. By
dispatching a newer instruction, the Issue stage has caused out-of-order execution and obviates the
potential cycle penalty waiting for the oldest instruction to become ready.
           Following the Issue stage, the Register-Read stage retrieves the potential data operands
from the register file and the Execute stage performs the instruction’s intended operation [2]. The
Execute stage notifies the ROB of an instruction’s completion and writes the instruction’s result to
the register-file.
           Running concurrently with the instruction path described above, the ROB preserves the
order of instructions in a queue and waits for instruction completion notifications from the Execute
stage. When the oldest instruction at the head of the ROB’s queue is complete, the ROB commits its
state by making the operation architecturally visible [2]. The head of the queue is then moved to the
next oldest instruction.
           Any data generated by an instruction which completes execution before the oldest instruc-
tion commits is microarchitectural state. This data includes results from operations occurring in the
Execute stage and data retrieval from memory.

2.2.5    Memory Hierarchy

           From the perspective of the RISC-V ISA, memory is a series of addressable registers. For
RV64I, this space is 264 bytes long [12]. This is a virtually infinite amount of space compared to the
32 architectural registers provided by RV64I. When those 32 registers are insufficient to track the
data required for program execution, the memory acts as the backing data structure to which register
values can be saved [2]. Conversely, the data-path can retrieve data values from the memory and
populate the registers for further execution.
           Designing fast memory faces performance challenges stemming from physical constraints
including implementation technology and its distance from the data-path on-chip [2]. These chal-
lenges motivate memory hierarchy designs in which memory is implemented in multiple levels [2].
Each level acts as the backing data structure to the levels below. The fastest memory module is
known as the level 1 (L1) cache and this represents the lowest level of the memory hierarchy. The
L1 cache is the fastest memory module because it is the physically closest memory module to the
data-path and uses the fastest memory technology [2]. However, the fastest memory technology is
also the most expensive which makes the L1 cache the smallest memory module in the hierarchy.

                                                  9
CHAPTER 2. BACKGROUND

          Additional memory modules higher in the memory hierarchy act as the backing data
structure to the L1 cache in the event that its small size is insufficient for the workload. Moving
higher in the memory module correlates with a few memory characteristics including greater distance
from the data-path, larger size, cheaper technology, and slower access times [2]. Therefore, accessing
data that exists in the lowest level cache is faster than accessing data in higher level memory.
          Memory hierarchies are designed to exploit locality which describes the tendency of
programs to repeatedly access the same data [2]. Typically, the most recently used data is stored in
the L1 cache for fast access. When the program moves on to new operations on other data, the cache
may replace the old data with the new data and store the old data at a higher level in the memory
hierarchy. This is known as eviction [2]. Thus, accessing the new data is now a faster operation and
accessing the old data will take more time to retrieve from a higher level of memory.

2.2.6   Transient Instructions

          By employing out-of-order execution, instruction pipelining, and non-linear execution
instructions, a microarchitecture can facilitate transient execution [4]. An instruction is considered
transient when the following is true. A potentially transient instruction follows a non-linear execution
instruction in Instruction Fetch order. Note that Instruction Fetch order is not strictly equal to
program order as the branch predictor may incorrectly guess the instruction path. Furthermore,
non-linear execution instructions include conditional branches (e.g. beq) and instructions causing
microarchitectural interrupts, like data-transfer instructions (e.g. lw) [24]. A potentially transient
instruction is completed by the Execute stage before the preceding non-linear execution instruction.
Instructions are considered transient if the non-linear execution instruction indicates that they are
not of the correct instruction stream. The microarchitectural state of transient instructions must be
flushed from the pipeline and cannot be observable by the program.
          Here, we’ll discuss some of the known security vulnerabilities arising from the microarchi-
tectural performance features discussed above. Such security vulnerabilities are characterized by
covert data leakage between programs.

2.2.7   Side-channels

          A program may use a side-channel to infer microarchitectural behavior from architecturally
visible data [25]. For instance, timing side-channels [25] extract data by measuring the time for
architectural operations to complete. From this information, microarchitectural behavior can be

                                                  10
CHAPTER 2. BACKGROUND

inferred. Researchers have developed attacks utilizing this mechanism which allow an attacker
program to infer victim program behavior [9, 10].
          Using Flush+Reload [9], an attacker program can extract keys from the RSA encryption
algorithm [26] running in a victim program. In one variant, the attacker loads the victim’s instructions
into its own memory. In this case, the victim binary is configured such that the attacker and victim
share the virtual memory storing the victim instructions. In one attack round, the attacker flushes the
virtual memory corresponding to the victim instructions of interest from the L1 cache. Specifically,
the instructions of interest are the functions for the RSA computation. The victim calls one of the
functions in this round which causes the L1 cache to re-fetch the instruction memory for that single
function. The attacker then times its own accesses to the functions of interest. All of the functions
except for the one accessed by the victim are stored beyond the L1 cache. Therefore, the fastest
access time corresponds to the function called by the victim. By tracking the RSA operations called
over multiple attack rounds, the attacker can extrapolate the encryption key.

2.2.8   Transient Execution Side-channel Attacks

          Transient execution side-channel attacks exploit microarchitectural state generated by
transient instructions to extract data from victim programs [4, 5]. The microarchitectural state is
made architecturally visible using side-channels like Flush+Reload [9].
          One type of transient execution side-channel sttacks, known as Spectre attacks [4], exploit
transient instructions following conditional branches. In one Spectre variant, the attacker uses
transient load instructions to access protected victim data. Similar to Flush+Reload, this requires
the attacker and victim programs to share the virtual memory storing the victim’s protected data.
By using Flush+Reload, the attacker encodes the victim data as a buffer index to a buffer in its own
memory space. This is done using a transient store instruction which attempts to store the victim
data to the attacker local buffer. When the conditional branch is resolved, the transient instructions
are flushed and the attacker cannot read the victim data in its architectural registers or its memory
local buffer. However, the attacker can time accesses to each element of the buffer and calculate the
victim data. Thus, the attacker has successfully read data from the victim program.

                                                  11
CHAPTER 2. BACKGROUND

2.3    Microarchitectural Security Verification

          Side-channel vulnerabilities are difficult to recognize from the hardware description lan-
guage (HDL) [27] implementation of a microarchitecture. HDL refers to the declarative description
of the circuit comprising the microarchitecture. In some HDL implementations, each stage of
the microarchitecture is implemented as an independent module [24, 28]. Therefore, the complex
interactions between multiple modules exploited by side-channel attacks are not apparent in the
HDL.
          Design verification, utilizing both formal methods and unit testing, is a standard practice in
the microarchitecture design cycle, for identifying and addressing hardware implementation bugs [29,
30]. For any new CPU design, a large part of the total engineering resources spent on developing
the CPU are devoted to design verification [29]. Deploying a functionally-correct microarchitecture
is critical for computer architects since, unlike software, hardware designs cannot be updated in
the field. This motivates a variety of formal verification approaches to microarchitecture security,
differing by abstraction level, user interface, and targeted security vulnerability class [31, 32, 33, 34].
          A formal approach to verification provides a mathematical proof of design constraints [35].
By using algorithms like satisfiability solvers (SAT-solvers) [36] all possible system states can be
enumerated and verified automatically. The exhaustive nature of this search makes formal methods
an appealing approach to security verification.

                                                   12
Chapter 3

Related Work

3.1     Performance Simulators

          Microarchitectural simulation in software provides a flexible environment for performance
evaluation and design iteration. Prior work presents many different simulation frameworks, targeting
a variety of simulated systems and simulation approaches. Of note, simulators provide runtime
performance optimizations by utilizing different host system features [37, 38, 39], the ability to
characterize interactions between multiple types of guest devices [40, 41], and novel user interfaces
for specification flexibility [18].
          ZSim [38] is intended to make many-core system simulation tractable. The researchers
note that simulating interactions between microarchitectural components, executed in parallel, is
expensive due to synchronization. They introduce a novel parallel execution algorithm to reduce lock
contention and use dynamic binary execution to improve simulation speed. Akita [19] addresses the
issue of parallel simulation synchronization through its inter-component communication protocol,
which limits the interactions between components to packet based messaging. This limits the common
data structures for which components need to contend.
          The promise of heterogeneous computing necessitates simulators capable of modeling such
systems. Multi2Sim [40] is a heterogeneous computing simulation framework, capable of modeling
interactions between CPU’s and GPU’s. Recent work [42] building on the highly flexible Gem5
simulator [18] also provides the ability to simulate CPU-GPU systems.
          EMSim [43] is a microarchitecture simulation framework that, in addition to microarchi-
tectural state, simulates electromagnetic leakage for each cycle. This allows researchers to analyze a
design for possible power side-channel attacks at design time.

                                                 13
CHAPTER 3. RELATED WORK

3.1.1   Akita

          Akita is an event driven simulation framework, proven to be accurate and performant for
multi-GPU simulations [19]. The most basic simulated system, depicted in Figure 3.1, includes the
event engine, simulated modules, and a message connection.

                         Figure 3.1: High-level Akita component topology.

          Each event corresponds to a state update at a specified time, in a single component. The
event engine drives simulation by dispatching events according to the simulation time. Event
dispatching can be done sequentially, where events complete one at a time, or in parallel, where
events in the same time slot execute concurrently.
          During event execution, a component can update its internal state, schedule new events to
be executed later, or send messages to other components through a port. The port delivers a message
to a connection, which is a simulation component responsible for relaying messages to the intended
destination component. Upon receiving a message, the destination component schedules a future
event or immediately updates its internal state. Simulation execution continues until the event queue
is exhausted of future events to execute.

3.2     Formal Methods for Security Evaluation

          Coppelia [31] is a hardware security verification tool that analyzes design constraints at the
HDL level. The HDL implementation of a microarchitecture is translated to C++ using Verilator [44].
This translated code can be analyzed using symbolic execution, which identifies conditions that
violate design constraints. Coppelia will generate a series of instructions comprising the exploit that
violated a design constraint, which researchers can use to assess the microarchitecture’s security.

                                                  14
CHAPTER 3. RELATED WORK

Coppelia aims to shorten the verification gap [45] by enabling exhaustive constraint verification
utilizing a common hardware description, originally not intended for that purpose.
          Information flow tracking [46] combines each hardware signal with a trust bit that indicates
whether the signal’s origin is trusted. If one of the input signals to a gate is untrusted, the output
of that gate is also untrusted. Signal trust propagates through all gates in the data path. This is an
exhaustive approach to security that guarantees the trustworthiness of all data. But, this comes at the
cost of additional hardware logic which hinders chip utilization and performance. VeriSketch [33]
uses information flow tracking and program sketching [47] to automatically synthesize HDL modules
according to a set of security constraints. The information flow tracking signals are used only to
verify security constraints during module synthesis and are not present in the final implementation.
As such, VeriSketch hardware implementations do not suffer the same performance penalties as
information flow tracking implementations.

3.2.1   CheckMate

          This section will discuss the collection of work culminating in CheckMate [32], a formal
tool for detecting transient execution attacks at the microarchitectural level. The first work in
this collection, PipeCheck [48], is a formal verification tool for evaluating memory consistency
model implementations at the microarchitectural level. This functionality requires the adaptation of
happens-before graphs, from architectural-level description to microarchitectural-level description.
          A microarchitecture happens-before graph, shown in Figure 3.2, represents instruction
execution as a set of nodes. Each node is a pair consisting of an instruction and an execution event
which the instruction is completing. The graph is organized by columns of instructions, where the
program order executes from left to right, and rows of execution events, temporally ordered from top
to bottom. The directed graph edges describe inter-instruction and intra-instruction happens-before
relationships, where the source node occurs before the destination node.
          The user describes the temporal relationships of a happens-before graph as a set of first-
order relational constraints. In the case of PipeCheck, the user specifies constraints for the ISA’s
memory consistency model and for all instructions’ valid traversal through the data path. The SAT
solving engine can enumerate all possible executions for a combination of instructions and ensure that
no possible execution violates the constraints of the ISA. Subsequent work [49, 50, 51, 52] extends
this functionality to verify cache coherence models and memory operation ordering by software.
          CheckMate [32] applies the same formal verification methods to side-channel attack

                                                  15
CHAPTER 3. RELATED WORK

                                    Instruction 1        Instruction 2

                                      Event A              Event A

                                      Event B              Event B

                                      Event C             Event C

Figure 3.2: A generalized happens-before graph depicting two instructions traversing three stages in
a microarchitecture.

analysis. It provides a similar user interface to PipeCheck and asks the user to describe valid
instruction traversal and an exploit execution pattern as a set of model constraints. Like in PipeCheck,
the SAT solving engine enumerates all possible execution sequences and searches each sequence for
the exploit pattern. By finding the exploit pattern within a candidate execution sequence, the tool
indicates that the microarchitecture is vulnerable to the exploit.
          CheckMate will produce a security litmus test which is the sequence of instructions that
should induce the vulnerability on hardware. The work presenting CheckMate lists several exploits
to which a five-stage out-of-order pipeline is vulnerable, including Spectre, Meltdown, SpectrePrime,
and MeltdownPrime [53]. The SAT solver execution times for these exploits are shown to be tractable,
as the longest search required about 215 minutes to produce twelve litmus tests for Spectre.

3.3    Software Implementation Error Catching

          Modern programming languages feature tools for mitigating implementation errors be-
fore software deployment. Such tools include runtime environments [54] or compile-time static
analysis [55] that reduce the responsibility of the programmer to manage dynamically allocated
memory. Memory management errors cause the majority of security vulnerabilities in some common
software [56].
          Prior work has studied the implications of implementing a Unix-like kernel in the Go

                                                    16
CHAPTER 3. RELATED WORK

programming language [54] and compared this implementation to the Linux kernel, written in the
C programming language [57]. Go provides a runtime environment that handles dynamic memory
operations. Researchers found that 40 security bugs in Linux do not exist in the Go implementation
for one of two reasons. Either the Go runtime environment would detect an out-of-bounds memory
access, resulting in a runtime error, or the automated garbage collector in the runtime environment
removes the possibility of use-after-free errors. Therefore, these security bugs could not be exploited
in the hypothetically deployed OS.
          Static analysis is a common software practice for identifying implementation errors at
compile time [58, 55]. Extensible static analyzers allow users to implement checkers that enforce
custom design constraints, such as security [59].
          The micro-kernel, seL4 [60], is an OS kernel designed and formally verified for information
flow security. The verification assumes the correctness of the host hardware and of the software
that facilitate the operation of the OS (e.g. compiler, boot loader). This highlights the importance
of verifiable secure hardware design, since insecure hardware may compromise the security of
higher-level applications.

                                                  17
Chapter 4

Yori

          This chapter presents the collection of tools developed to support Yori, our RISC-V
microarchitecture simulator. As with many simulator projects, development on these tools is ongoing.
This chapter focuses on the current state of the tools.

4.1     VEmu

          VEmu [61] is a RISC-V functional emulator capable of ISA feature testing and binary
debugging. VEmu supports execution for 64-bit RISC-V binaries containing instructions from the
integer and multiply instruction groupings (known as RV64IM). As a functional emulation, VEmu
tracks and updates only architectural state, including 32 architectural registers, control status registers
(CSR), and memory.
          VEmu provides user-mode emulation, where the guest program must assume it runs at
user-level hardware privilege. To support this, the emulator catches user-level interrupt instructions
and handles the OS functionality on the host system. This allows the emulator to catch the exit
Linux system calls and end program emulation.
          VEmu leverages existing data structures developed for prior emulated memory implemen-
tations [62]. This is done as a development convenience and to facilitate planned simulation augmen-
tation features. Such features include simulation fast-forwarding [63] and check-pointing [63].

4.1.1   Verification and Performance

          Emulator correctness is verified using the ISA unit tests implemented by the RISC-V
community [3]. Each unit test is a RISC-V binary workload that intends to stress test one instruction.

                                                    18
CHAPTER 4. YORI

This is done by implementing code-blocks that execute the target instruction with typical operand
combinations and corner-case operand combinations. VEmu successfully executes the unit tests
for RV64IM instructions, which assumes physically addressed memory. Note, that VEmu does not
currently support the fence instruction.
          The RISC-V community also provides RISC-V micro-benchmarks that can be used to
characterize VEmu [3]. One such benchmark, quicksort, is compiled for RV64IM and sorts 220
integers. This workload completes after retiring 325, 370, 695 instructions. On a host system
featuring an AMD Ryzen 5 3600 CPU and 16 GB of RAM, the quicksort completes in 61.8 seconds.
This yields an simulated execution speed of over 5.2 MIPS.

4.2    Microarchitecture Simulation

          Transient execution side-channel attacks are the target vulnerability class for this research.
The reference microarchitecture that Yori implements must be susceptible to such attacks. As dis-
cussed in Chapter 2, SonicBOOM [14] implements performance features which enable transient
execution. This has allowed researchers to exploit Spectre vulnerabilities on simulated SonicBOOM
hardware [64]. Note that this research utilized a predecessor microarchitectural design to Sonic-
BOOM. This prior work, the public documentation [24], and open-source HDL implementation [65]
make SonicBOOM a suitable target for Yori and provide the information needed to implement a
simulated model of the microarchitecture.
          SonicBOOM allows users to configure microarchitectural parameters when synthesizing
the HDL. Yori targets the provided SmallBoomConfig [13] configuration. This configuration
prescribes a scalar pipeline, in which a single instruction per cycle traverses a pipeline stage. This
configuration also limits the size of hardware data structures to account for scalar execution.
          Yori uses the Akita framework [19] to drive simulation. Prior work has shown Akita to be
performant and scalable for multi-GPU simulations [19]. Akita’s event-driven execution provides a
logical interface to security verification. This will be discussed further in Section 4.3. Yori’s current
SonicBOOM implementation in the Akita framework is depicted in Figure 4.1.
          Yori implements the microarchitecture in three simulated components, including the
Instruction Fetch Unit (IFU), the Memory Unit, and the Execution Backend. Note that work to this
point has focused on a detailed implementation of the Execution Backend. The pipeline stages
comprising that component facilitate out-of-order execution and, therefore, transient execution.
The IFU and Memory Unit are currently abstract models of the reference systems in SonicBOOM.

                                                   19
CHAPTER 4. YORI

                   Figure 4.1: A high-level view of Yori’s simulation components.

Structural features, and the communications between components, are presented in the following
sections.

4.2.1   Instruction Fetch Unit

            The IFU is responsible for dispatching instruction packets to the execution backend.
An instruction packet contains sequential instructions, in fetch order, and their corresponding
microarchitectural control signals. As Yori implements a scalar microarchitecture, one instruction
packet contains one instruction. The microarchitectural control signals are assigned by pipeline
stages to inform execution decisions made by succeeding stages.
            As an abstract model, the Yori IFU facilitates a subset of the functionality provided by
SonicBOOM’s IFU. The Yori IFU manages the program counter (PC) and the instruction cache
(ICache). The PC is a single memory address that corresponds to the address of the next instruction
to be dispatched by the IFU. The ICache is a 4-way set-associative cache.
            Each cycle, the IFU attempts to dispatch a new instruction packet corresponding to the
current PC. It first polls the ICache for the desired instruction. Each instruction in the ICache is a
32-bit value, in which architectural information for execution is encoded. The bit encoding for each
instruction is prescribed by the ISA.
            If the instruction is available in the ICache, the IFU generates a new instruction packet,
assigns the instruction to the packet, and dispatches the packet to the Execution Backend. Otherwise,
the IFU generates a read request for the Memory Unit, which will deliver a cache line of instructions,
beginning at the requested PC, at a later cycle. Upon receiving the read response from the Memory
Unit, the IFU can dispatch the instruction packet corresponding to the PC.
            The IFU may encounter back-pressure from the Execution Unit, where a succeeding

                                                  20
CHAPTER 4. YORI

pipeline stage cannot complete an operation at the expected cycle. Thus, instruction packets cannot
continue to flow through the pipeline stages. This causes the IFU to stall until back-pressure is
relieved.
            The IFU may also receive branch resolution updates from the Execution Backend which
will set the PC to an address on a new instruction path. The IFU will then dispatch instructions from
the new PC.
            As an abstract model, the Yori IFU implements a subset of the features and data structures
employed by the SonicBOOM IFU. Importantly, the Yori IFU does not currently implement branch
prediction and instead assumes that the branch is not-taken. This results in a large performance
penalty to workload execution.

4.2.2     Memory Unit

            The Memory Unit is responsible for facilitating memory read and write operations on
architecturally-visible memory. The abstract model is implemented by previous work [62]. The
Memory Unit may receive requests from the IFU or from the Execution Backend. It responds to
memory requests in one cycle.

4.2.3     Execution Backend Overview

            The Execution Backend Akita component is comprised of several pipeline stages. Each
stage is implemented as an independent sub-module of the Akita component. During execution, an
instruction traverses all stages in an instruction packet. The order of pipeline traversal is Decode,
Rename, ROB, Issue, Register-Read, and Execute. A pipeline stage may stall upon reaching a hazard
during execution. This will cause back-pressure on the instruction traversal path, which may cause
preceding stages to also stall.
            In addition to forward instruction traversal, Figure 4.2 highlights the additional communi-
cation between stages. The ROB and Execute stages must broadcast messages upon reaching a state
that requires microarchitectural state roll-back. Here, microarchitectural roll-back includes pipeline
flushing. The Execute stage also broadcasts the availability of stage-local hardware resources to
ensure the preceding stages dispatch instructions during the correct cycle. These communication
paths are necessary to ensure valid execution, but lead to software challenges that are discussed
further in Section 4.3. The following sections describe the Execution Backend stages in greater
detail.

                                                  21
CHAPTER 4. YORI

                                      Progress back-pressure

        Decode                Rename           ROB              Issue             Reg-Read

             Forward-progress
             Roll-back μarch state
             Broadcast availability         Execute

            Figure 4.2: Communication flow between stages in the Execution Backend.

4.2.4   Decode

          Upon receiving an instruction packet from the IFU, the Decode stage extracts architectural
information from the 32-bit instruction and assigns additional microarchitectural control signals to
the instruction packet. Such architectural information includes logical (architectural) register indices
and static immediate data values. The microarchitectural control signals, referenced here, may be
used by later pipeline stages to guide execution.
          Microarchitectural control signals of note include the branch mask and branch tag. The
Decode stage assigns each unresolved in-flight conditional branch instruction a unique one-hot
encoded branch tag. The Decode stage accumulates the current branch tags in a stage-local branch
mask. In essence, the branch mask enumerates all branch instructions leading to the current instruction
path.
          The Decode stage assigns the current branch mask to each instruction it processes. This
allows succeeding stages to flush the instruction if a branch in the branch mask is resolved as
mispredicted.

4.2.5   Register-rename

          The Register-rename stage manages the mapping between logical and physical registers.
Physical registers are microarchitectural registers that the pipeline uses in place of the architectural
registers. The Register-rename stage allocates physical registers to mitigate data hazards. Physical
registers are stored in the Register-read stage. The functionality of this stage is facilitated by three
hardware data structures. An instruction packet traverses the structures in the following order.

                                                    22
CHAPTER 4. YORI

   1. Map-table - This structure tracks the current physical register to which each logical register
      corresponds. To each processed instruction packet, the Map-table adds the physical register
      indices corresponding to the logical register operands. The Map-table also attaches the stale
      physical destination register index, which is the current register mapping for the instruction’s
      destination register. This stale physical destination register was allocated to the last instruction
      that wrote to the same logical register.

   2. Free-list - This structure tracks the allocation state of all physical registers. Here, allocation
      state refers to the state of being allocated or unallocated. An allocated physical register
      corresponds to a logical register, while an unallocated physical register corresponds to no
      valid logical register. For each instruction, the Free-list allocates a physical register for the
      instruction’s logical destination register. The Free-list marks the physical register as allocated
      and attaches the register index to the instruction packet. Note that the instruction packet
      now contains a physical destination register index that is different from the stale destination
      register index. This assignment scheme avoids write-after-read (WAR) and write-after-write
      (WAW) [2] data dependencies between subsequent instructions. The Map-table is updated
      each cycle to reflect the register allocation. When the ROB notifies the Register-rename stage
      of a newly-committed instruction, the Free-list deallocates the corresponding stale physical
      destination register.

   3. Busy-table - A physical register is considered busy until the instruction writing to that register is
      completed by the Execute stage. The Busy-table tracks the busy state of each physical register.
      When processing an instruction, the Busy-table marks the destination register allocated by the
      Free-list as busy and attaches the busy state for the instruction’s operands to the instruction
      packet.

          When the ROB attempts to commit an instruction that results in an exception, it triggers
rename state unwinding [24] in the Register-rename stage. In state unwinding, the state of the
Map-table and Free-list are undone, one instruction at a time, in reverse Fetch order. Through this
process, the Register-rename stage reverts to a state which has only been modified by committed
instructions. From this state, the pipeline can successfully re-execute the instruction that encountered
an exception.

                                                   23
CHAPTER 4. YORI

4.2.6   Reorder-Buffer

          The ROB preserves the Fetch order of instructions, as delivered by the Rename-stage. The
data structure facilitating this functionality is a queue. Each queue slot stores the instruction packet,
its busy state, and its exception state. Once the instruction is stored, the ROB delivers the instruction
packet to the Issue stage.
          When the Execute stage completes an instruction, the ROB is notified and marks the
instruction as not-busy. The ROB will commit an instruction when it reaches the queue head and is
marked as not-busy. This dequeues the instruction from the ROB. By reaching the queue head, the
instruction is the oldest instruction in the ROB. Therefore, by only committing the head of the ROB,
the queue preserves program order.
          If the ROB tries to commit an instruction which generates an exception, the pipeline is
flushed and the Register-rename state is unwound. Execution is then restarted, beginning with the
instruction generating the exception.

4.2.7   Issue

          The Issue stage buffers instructions in a priority queue before being dispatched to the
Register-read stage. An instruction can be dispatched once two conditions are met. The instruction
operands must not be busy, as dictated by the Busy-table of the Register-rename stage. This state is
updated in the Issue stage upon instruction completion by the Execute stage.
          The second condition to be met before instruction dispatch is that the Execute stage must
have the hardware resources available to accept the instruction. The Execute stage broadcasts its
hardware availability to the Issue stage to facilitate the checking of this condition.
          Once these conditions are met, the Issue stage can dispatch the instruction. The Issue stage
tries to dispatch instructions in program order by dispatching the oldest instruction in the queue that
meets the conditions. This can cause the Issue stage to dispatch instructions in non-program order in
cases when a younger instruction satisfies the conditions before the oldest instruction.

4.2.8   Register-read

          The Register-read stage contains the physical register file (PRF), comprised of the microar-
chitectural physical registers. An instruction reads its register operands from the PRF to be processed
in the Execute stage.
          Instructions completed by the Execute stage write operation results to the PRF.

                                                   24
You can also read