KE CHAI XBT: FPGA Accelerated Binary Translation - OhioLINK ETD ...

Page created by Cory Dunn
 
CONTINUE READING
KE CHAI XBT: FPGA Accelerated Binary Translation - OhioLINK ETD ...
XBT: FPGA Accelerated Binary Translation

                              KE CHAI

  Submitted in Partial Fulfillment of the Requirements for the Degree of

                         Master of Science

             Thesis Advisor: Dr. Christos A. Papachristou

    Department of Electrical, Computer and Systems Engineering

           CASE WESTERN RESERVE UNIVERSITY

                              August, 2021
XBT: FPGA Accelerated Binary Translation
                              Case Western Reserve University
                              Case School of Graduate Studies

                              We hereby approve the thesis1 of

                                           Ke Chai

                                       for the degree of

                                     Master of Science

Christos A. Papachristou

Committee Chair, Advisor                                                            07/16/2021
Department of Electrical, Computer and Systems Engineering

Daniel G. Saab

Committee Member                                                                    07/16/2021
Department of Electrical, Computer and Systems Engineering

Seyed Hossein Miri Lavasani

Committee Member                                                                    07/16/2021
Department of Electrical, Computer and Systems Engineering

1
We certify that written approval has been obtained for any proprietary material contained therein.
Table of Contents

List of Tables                                     v

List of Figures                                   vi

Acknowledgements                                  vii

ABSTRACT                                           1

Chapter 1.   Introduction                          2
  Background                                       2
  Motivation                                       2
  Contribution                                     3
  Outline                                          3

Chapter 2.   Literature Review                     5
  Binary Translation                               5
  Dynamic Binary Translation                       5
  Hardware-Accelerated Binary Translation          6

Chapter 3.   Methodology                           7
  Configuration blocks                             7
  Translation Blocks                               8
  Reallocation Registers                           9
  Branch Offset Issues                            10
  Unrecognized Instructions                       11

Chapter 4.   Prototype Design                     12
  Instruction Set                                 12
  System Design                                   12
  Microcode                                       14
  Translation Process                             15
  Architecture Implementation                     17

Chapter 5.   Results                              21

                                     iii
Design Reports                21
  Benchmark Technique           22
  Measurement of Speedup        22
  Results                       23

Chapter 6.   Conclusions        25

Chapter 7.   Future Work        26

References                      27

                           iv
List of Tables

4.1   MIPS32 User App. Instructions      13
4.2   Description of XBT Blocks          14
4.3   IMB/AMB Microcode                  15
4.4   Register Reallocation Example      16
4.5   Unfolding I-type Example           16
4.6   Unfolding Load/Store Example       17
4.7   Reordering Example                 17
4.8   Complex Instruction Example        18
4.9   XBT Configuration Registers        20

5.1   Translation Time: BT vs XBT        24

                                  v
List of Figures

3.1   XBT System Block Diagram                8
3.2   Address Mapping Flow                   11

4.1   An XBT Configuration Instance          13
4.2   Zynq 7000 SoC 24                       18
4.3   Block Design of XBT in Vivado          19

5.1   Power Report                           21
5.2   Utilization Report                     22
5.3   Timing Report                          22

                                 vi
Acknowledgements

   First I want to thank my advisor Dr. Papachristou and Dr. Wolff. They have
generously provided me with their knowledge, experience and help. Without them,
this thesis would never be finished.
   Also I want to thank the committee members who have paid effort into reading
this thesis.
   Thanks to my parents who gave me their consistent support, both emotionally
and economically, to pursue my degree.
   Last but not least, I want to thank my wife who abandoned her well-paying job
and followed me to America to take care of me. I really enjoyed her company and
will never forget how much she has sacrificed for me.

                                       vii
ABSTRACT

               XBT: FPGA Accelerated Binary Translation
                                     Ke Chai

Binary translation (BT) is the process of converting executable binary from one
instruction set architecture (ISA) to another. Accelerated binary translation (XBT)
refers to BT using FPGA for hardware acceleration and feeding the target proces-
sor at-speed. This work proposes a reconfigurable pipelined structure built on
FPGA that performs XBT on different ISAs. An XBT system that translates MIPS to
RISC-V is implemented and tested on the Xilinx Zynq platform. Results of several
benchmarks show obvious speedup of approximately 48 times compared to an
equivalent software approach.

                                        1
1 Introduction

1.1 Background

Binary translation (BT) is the process of converting executable binary from one
instruction set architecture (ISA) to another 19 . BT makes it possible to migrate
applications between two ISAs without the need of source code and recompila-
tion 8,9,26 . For example, a legacy MIPS program can be translated to an equivalent
RISC-V program using BT and run on a RISC-V processor. BT also serves as an
emulation method which has higher performance than normal software-based
interpretation. Emulators like QEMU use BT techniques for better performance 5 .
BT is a way to achieve Architecture-Independent Computing (AIC) which means
to enable executing code of different ISAs on any machine 3 .
   There are mainly two kinds of BT approach: static binary translation (SBT) and
dynamic binary translation (DBT). SBT translates the whole binary code before
the execution, while DBT translates at runtime. Software DBT is more widely used
for emulation purpose since it deals better with problems such as self-modifying
code, but it usually has worse performance than SBT.

1.2 Motivation

Unlike a program originally built for the target ISA, a binary translated program
from another ISA suffers from the performance loss due to the differences between
the ISAs 25 . Since DBT systems translates codes on-the-fly, the translation overhead

                                         2
Introduction                                                                        3

is also a key factor that affects performance 6,19 . Accelerating the translation pro-
cess is an important part of the overall speed improvement in the DBT process.
   FPGAs are widely used in applications which need flexible hardware accelera-
tion such as AI and neural networks. FPGA fabrics are even embedded into system-
on-chips (SoCs) and have high-speed, high-bandwidth connection to processors.
Pipelining on FPGA allows the ability to have overlapping parallelism in prob-
lems dealing with large amount of sequential data. Though being less efficient
compared to ASICs 17 , FPGAs have more flexibility that ASICs cannot provide. The
FPGA’s reprogrammability enables the system to switch between different config-
urations at runtime.

1.3 Contribution

This work proposes a pipelined structure built on FPGA that performs accelerated
binary translation (XBT). Using FPGA could make better use of parallelism, which
enhances the performance. With the speedup brough by the FPGA fabrics, the
method could efficiently generate semantically equivalent target code (i. e. the gen-
erated binary after translation) from source code (i. e. the binary to be translated).
In addition to the increase of translation speed, it also provides more flexibility at
runtime.
   An XBT prototype that translates MIPS to RISC-V is presented in this work.
Several benchmarks are run on a Xilinx Zynq chip using both XBT approach and
software-based BT approach. Comparation of their translation speed proves that
XBT have greater performance gain on the BT process.

1.4 Outline

Section 2 cites and comments on some related work and background study of
relevant BT topics.
Introduction                                                                   4

   Section 3 describes the methodology of XBT and how XBT solves the key prob-
lems that occurs in BT process.
   Section 4 gives a specific prototype of XBT translates MIPS to RISC-V. Details
of the design are also shown.
   Section 5 gives the design report, benchmark method and result of the MIPS to
RISC-V XBT on Xilinx Zynq platform.
   Section 6 gives the conclusion from the results.
   Section 7 discusses about the shortcomings and future work to be done.
2 Literature Review

2.1 Binary Translation

Sites et al. 22 described the concept of BT in a 1993 paper, in which two binary trans-
lators targeting Alpha AXP computers are also given. Altman et al. 2 introduced BT
as an effective way of automatic code porting without recompilation. Cifuentes et
al. 11 developed a reusable, component-based BT framework called UQBT, which
can adapt easily and inexpensively to different source and target machines. More
works 4,13,18,23 are proposed on the optimization of BT process.
   In order to migrate legacy x86 applications to the newly-designed M1 processor
with ARM architecture, Apple developed a BT system named Rosetta 2 16 . It uses
static BT approach that translates before the execution. However, It is not capable
for translating kernel extensions or Virtual Machine apps.

2.2 Dynamic Binary Translation

The concept of DBT can date back to a 1996 paper by Cifuentes et al. 10 . This pa-
per argues that dynamic binary translators can reach performance equal to static
ones while requiring less complex environment at runtime. It also presents a new
technique as a complement to a retargetable binary translator.
   Probst 19 gave the definition and usage of DBT in his 2002 paper. It shows so-
lutions to the problems that occurs in the DBT process like jump/branch offset

                                          5
Literature Review                                                                   6

issues, register mapping and conditional bits. It also mentions the existence of a
translation cache.
   There are also works using DBT for architectural emulation. Chapman et al. 7
combines DBT and virtualization for cross-platform emulation. The prototype,
named “MagiXen”, is an implementation of a Xen virtual machine monitor that
can run IA-32 virtual machines on Itanium platforms.
   DBT targeting VLIW machines is also designed for static scheduling, which can
handle the trade-offs between performance and hardware complexity. Ebcioglu et
al. 1,12 Proposed an architecture called DAISY, i. e. Dynamically Architected Instruc-
tion Set from Yorktown, to use DBT and VILW machines to gain high instruction
level parallelism with simpler hardware designs.

2.3 Hardware-Accelerated Binary Translation

There are existing works that involve hardware acceleration in DBT process. Yao
et al. 25 propose an FPGA based hardware-software co-designed DBT system from
x86 to MIPS. A “CCflag” register and several user defined instructions are added to
the MIPS processor core to resolve the problem brought by x86 conditional flags
and different byte order, i. e. endianess. To enhance the speed of translation, a
jump address look-up table (JLUT) is also implemented as a part of the translator.
Though involving FPGA, this work does not develop its reconfigurability.
   Rokicki et al. 20 proposed a hardware-accelerated DBT operates on MIPS bina-
ries and targets a custom VLIW core. A small single-issue processor is dedicated to
the DBT process, along with blocks designed with high-level synthesis (HLS) tech-
nology. A more recent paper of Rokicki 21 even started to develop this approach on
heterogeneous multi-core architectures to lower the power consumption while
maintaining considerable performance.
3 Methodology

   The XBT system is implemented on the FPGA fabric as shown in Figure 3.1.
The green blocks are the configuration blocks that take charge of managing and
monitoring the current FPGA configuration. The blue blocks are the translation
blocks, which are the main components where XBT is performed. The blocks in
XBT can be accessed by the processor at-speed through AXI interfaces.
   In the following sections, we will discuss the functionality of each blocks and
how they resolve problems in BT process.

3.1 Configuration blocks

The profile monitor and configuration manager in Figure 3.1 are implemented for
flexible reconfiguration. The profile monitor collects and analyzes statistics of the
currently translating program, and the configuration manager switches between
different configurations according to the program context. The FPGA uses alter-
native configurations for common instruction flows. For example, applications
that have a lot of string processing will load the FPGA string flow configuration.
If the user application uses a lot of integer math, then the FPGA will load the in-
teger math configuration which is optimized for integer flow. Using specifically
optimized configurations for different kinds of instruction flows can lower the la-
tency of translation, which is essential for at-speed execution. Since the resources
on the FPGA are limited, it is not realistic to put all the configurations on the FPGA.
Further more, the delay of the circuits gets worse as the FPGA blocks grows bigger.

                                          7
Methodology                                                                          8

                             Figure 3.1. XBT System Block Diagram

The configuration blocks can also manage the FPGA configurations and provide
choices on different tradeoffs.

3.2 Translation Blocks

The translation work is mainly performed by the instruction mapping block (IMB)
in Figure 3.1, which is divided into several pipeline stages. The pipeline stages may
vary depending on the source and target ISAs. The source code to be translated is
stored in the source buffer, and the target code will be written into the target buffer
Methodology                                                                         9

after the translation process. The address mapping block (AMB) and the address
mapping table (AMT) work together to derive and store the address mappings be-
tween the source program counter (SPC) addresses and the target program counter
(TPC) addresses. As the number of instructions usually changes during translation,
the targets of branch instructions need to be adjusted. Deriving and Storing the
mapping information in the AMB can speedup the branch offsets look-up during
translation.
   The translation cache is designed for lower latency. During at-speed execution,
the processor fetches repeated instructions in contexts such as loops. Repeated
translation information of the instructions can be kept so that they do not need
to be translated more than once. If a source instruction is in the translation cache,
the translated target instruction can be read directly from the cache without going
through the translation process repeatedly. The translated instructions read from
the translation cache can be directly executed at-speed on the target processor.

3.3 Reallocation Registers

Due to the architectural difference between the two ISAs, one source instruction
could be translated into two or more target instructions. As a result, registers are
needed to pass on intermediate results between these instructions. If these regis-
ters are previously occupied, their original value should be well preserved. These
stored register values should be restored when they are further needed by other
instructions. A small memory region called scratch pad memory (SPM) is allocated
in order to solve the register reallocation problem. The original values of the regis-
ters can be written into specific locations in the SPM, and loaded back afterwards.
SPM can be a memory region allocated on the FPGA or in the main memory, which
is configured through the configuration blocks.
   The register reallocation is performed during the translation process by a re-
allocation module in the IMB. It tracks the usage of each register and the register
values in the SPM. When reallocation is needed, it inserts load/store instructions
Methodology                                                                           10

and changes the source and destination registers in the instructions. It can be con-
figured to assign registers that are available for reallocation.
   In most of the RISC architectures like RISC-V and MIPS, the memory is accessed
by load/store instructions, which form effective addressing by adding up an imme-
diate offset and a register value 15 . Therefore, another special register called scratch
base register (SBR) is needed in order to access the SPM. The SBR stores the base
address of the SPM so that the SPM can be accessed by load/store instructions
using SBR and another offset value. The scratch base register cannot serve as a
reallocation register.
   Since the SPM is needed during runtime, it should be initialized by software be-
fore the execution of the program. The memory region for SPM should be properly
allocated, and its base address should be stored in the scratch base register.

3.4 Branch Offset Issues

As one source instruction could be translated into several target instructions, the
address offsets in the target branch instructions are different from the ones in the
source instructions 25 . To solve this issue, the AMB is designed to derive how many
target instructions will be generated out of one source instruction and calculate
the relevant TPC address of every SPC address. The TPC addresses are stored into
AMT afterwards.
   As shown in Figure 3.2, the AMT can be implemented as a block memory with a
write port and a read port connecting to the AMB and the IMB respectively. The SPC
values are truncated and used as memory address to access the AMT. Accordingly,
the TPC values are the data stored in this memory. At runtime, the TPC of the
branch target is needed in IMB when it recognizes a branch instruction. The SPC
will be sent to AMT and it will return its corresponding TPC. After that, the new
branch offset is calculated based on the branch target TPC and the TPC of the
branch instruction itself.
Methodology                                                                        11

                               Figure 3.2. Address Mapping Flow

   The AMB works simultaneously with the IMB, and is expected to run faster than
the IMB. If the IMB is querying data that AMB and AMT has not yet done with, the
IMB will stall its pipeline and wait until the data are ready.

3.5 Unrecognized Instructions

In actual practice, there are some instructions that cannot be simply translated or
not even recognized. This could include reserved instructions, privileged instruc-
tions and some instructions with complicated operations, depending on different
source/target ISAs. In this case, they will be translated to a system call instruc-
tion or a software interrupt. Parameters including the original instruction binaries
are passed on in specific registers so that software, or say handler, could decide
whether this instruction is valid and perform a software translation on-the-fly.
4 Prototype Design

   A prototype of XBT from MIPS to RISC-V has been designed and implemented.
Both belonging to the RISC family, MIPS and RISC-V are similar to each other in
the instruction set composition, while having differences in encoding formats and
data representation. Using them as an example for source and target of XBT can
illustrate how key problems of BT are solved on FPGA and lower the complexity of
the design.

4.1 Instruction Set

For user applications, we need only to implement an XBT of the user application
instruction set. It is not necessary to implement the operating system or privileged
supervisor instructions. The user instructions of the MIPS32 release 1 ISA 14 are
selected as a source to be translated, as shown in Table 4.1. It mainly contains the
arithmetic/logic instructions, the memory access instructions, the unconditional
jump instructions and the conditional branch instructions. These instructions are
translated to the base instruction set of the RISC-V, i. e. “RV32I”.

4.2 System Design

The detailed design is shown in Fig. 4.1. The IMB and AMB are implemented in a
pipelined fashion. Some blocks with the same name in IMB and AMB are different

                                          12
Prototype Design                                                          13

                            Table 4.1. MIPS32 User App. Instructions

          Category   Instructions
          I-type A/L ADDI ADDIU SLTI SLTIU ANDI ORI XORI LUI
          R-type A/L SLL SRL SRA SLLV SRLV SRAV ADD ADDU SUB
                     SUBU AND OR XOR NOR SLT SLTU
          Load/Store LB LBU LH LHU LW SB SH SW
          Jump       J JAL JR JALR
          Branch     BEQ BNE BLTZ BGEZ BLEZ BGTZ

hardware instances of the same design, for example, “Fetch” and “Decode”. The
detailed function of each pipeline stage is listed in Table 4.2.

                           Figure 4.1. An XBT Configuration Instance
Prototype Design                                                                14

                             Table 4.2. Description of XBT Blocks

    Pipeline Stage   Description
    Fetch            Fetch instruction from MIPS code buffer
    Decode           Decode MIPS instructions into microcode
    Instruction      Exchange every branch instruction with its delay slot
    Reorder          instruction
    Register         Insert load/store instructions to gain/recover the
    Reallocation     original value of the reallocated registers when needed
    Unfold           Extend the I-type instructions with immediate
    Values           value wider than 12 bits into more instructions
    Address          Map the branch / jump target SPC to TPC according
    Mapping          to the AMT
    Instruction      Translate MIPS microcode into one or more equivalent
    Equivalency      RISC-V microcode
    Code Mapping     Generate RISC-V instructions from microcode
    Address          Derives the number of RISC-V instructions a MIPS
    Reorder          instruction will be translated into, and writes the AMT

   As shown in Figure 4.1, IMB and AMB fetch MIPS instruction simultaneously
from the source buffer. Instructions are passed onto the next stage every clock
cycle, so when one of the stages are trying to produce more than 1 instruction in
IMB, its previous stages are blocked. On the contrary, AMB does not have blocking
stages, so it fetches more instructions than IMB in the same period of time.

4.3 Microcode

Instead of passing on real MIPS instruction code, the pipeline stages use microcode
to represent MIPS or RISC-V instructions. Each MIPS instruction is interpreted into
MIPS microcode in the “Decode” stage and mapped to one or more RISC-V mi-
crocode words in the “Instruction Equivalency” stage. By transforming the instruc-
tions to microcode, the data are more clearly represented and more information
is provided. Pipeline stages could decide how a microcode should be processed
according to its “family” field and “type” field, which would simplify the circuit
and accelerate the process. Table 4.3 shows how some microcode fields are rep-
resented. For example, a MIPS “ADD $3, $1, $2” instruction will be classified into
Prototype Design                                                                   15

family “R-type” with code 0x3, and type “ADD” with code 0x6. Assuming its PC
value is 0xbfc00000, its microcode word is the combination of {0xbfc00000, 0x3,
0x6, 0x1, 0x2, 0x3, 0x0}.
                                 Table 4.3. IMB/AMB Microcode
                        Variable                Width (bits)
                        Source PC               32
                        Instruction Family      4
                        Instruction Type        4
                        Source Register 1       5
                        Source Register 2       5
                        Destinaltion Register   5
                        Extended Immediate      32

4.4 Translation Process

Some examples are shown to illustrate how MIPS instructions are translated to
RISC-V instructions. Both ISAs belong to the RISC family and show lots of simi-
larity, so in most of the cases, MIPS instructions can be one-to-one translated to
RISC-V instructions 22 . However, there are special cases that one source instruc-
tion translates to multiple target instructions. Examples of these special cases are
shown below.

4.4.1 Register Reallocation

An example is shown in Table 4.4. In this example, register 27 is used as the SPM
base register, and registers 24, 25 and 26 are reallocation registers. When their orig-
inal values are needed, they are loaded from the SPM, and written back after the
operation.

4.4.2 Unfolding Instructions with Large Offset

MIPS I-type instructions have 16 bits offset field while RISC-V only have 12. As
a result, large immediate numbers with more than 12 bits need to be stored in
Prototype Design                                                                 16

                           Table 4.4. Register Reallocation Example

       MIPS Code MIPS Assembly      RISC-V Code          RISC-V Assembly
       0x033ac020 ADD $24, $25, $26 0x008dac83           LW x25, 8(x27)
                                    0x00cdad03           LW x26, 12(x27)
                                    0x01ac8c33           ADD x24, x25, x26
                                    0x018da223           SW x24, 4(x27)

a reallocated register, and the I-type instruction should be translated to a corre-
sponding R-type instruction. It is noticeable that the immediate field should be
zero-extended for logic instructions and sign-extended for arithmetic instructions
in MIPS, while in RISC-V they are all sign-extended. Some examples are shown in
Table 4.5.
                             Table 4.5. Unfolding I-type Example

      MIPS Code MIPS Assembly        RISC-V Code          RISC-V Assembly
      0x20217fff ADDI $1, $1, 0x7FFF 0x00008c37           LUI x24, 0x8
                                     0xfffc0c13           ADDI x24,x24, -1
                                     0x018080b3           ADD x1, x1, x24
      0x28628000 SLTI $2, $3, 0x8000 0xffff8c37           LUI x24, 0xffff8
                                     0x0181a133           SLT x2, x3, x24
      0x30217fff ANDI $1, $1, 0x7fff 0x00008c37           LUI x24, 0x8
                                     0xfffc0c13           ADDI x24,x24, -1
                                     0x0180f0b3           AND x1, x1, x24
      0x34628000 ORI $2, $3, 0x8000  0x00008c37           LUI x24, 0x8
                                     0x0181e133           OR x2, x3, x24

   Load and store instructions have the same problem as the I-type arithmetic
instructions, but are slightly differently adjusted because a load/store instruction
has a base register field. Examples are shown in Table 4.6.

4.4.3 Reordering

The MIPS architecture uses the technique of branch delay slots, so adjustment
should be made when translating to an ISA without branch delay slots like RISC-
V. Assuming that there is no data hazard between the branch instruction and the
delay slot instruction, an effective way to solve this problem is to exchange the
Prototype Design                                                                    17

                            Table 4.6. Unfolding Load/Store Example

      MIPS Code MIPS Assembly       RISC-V Code            RISC-V Assembly
      0x8c627ffc LW $2, 0x7ffc($3)  0x00008c37             LUI x24, 0x8
                                    0x003c0c33             ADD x24, x24, x3
                                    0xffcc2103             LW x2, -4(x3)
      0xac248000 SW $4, -0x8000($1) 0xffff8c37             LUI x24, 0xffff8
                                    0x001c0c33             ADD x24, x24, x1
                                    0x004c2023             SW x4, -0x8000(x1)

order of the two instructions, as shown in Table 4.7. Note that the offset in MIPS
branch instruction is added to the PC of the delay slot instruction to form the target,
which should also be considered.
                                 Table 4.7. Reordering Example

       MIPS Code MIPS Assembly RISC-V Code RISC-V Assembly
       0x1022ffff BEQ $1, $2, 0xffff 0x01248433 ADD x8, x9, x18
       0x01324020 ADD $8, $9, $18 0x00208063    BEQ x1, x2, 0x0

4.4.4 Instruction Equivalency

As part of the architectural heterogeneity, not every instruction in the source ISA
has an equivalent one in the target ISA. This problem is mostly settled by using
two or more instructions to do similar operations. Instructions that cannot be
simply translated will be taken over by the system software. In order to do this,
these instructions will be translated into a series of instructions that performs a
system call with parameters. The system software will be invoked by the system
call to translate and execute them. In the case shown in Table 4.8, the “JR $31”
instruction jumps to the address indicated in register 31. It is usually used to end a
subprogram return. Since the jump target can be only known at runtime, its source
binary code is passed on as a parameter in register 24 to the system software.

4.5 Architecture Implementation
Prototype Design                                                               18

                           Table 4.8. Complex Instruction Example

       MIPS Code MIPS Assembly RISC-V Code             RISC-V Assembly
       0x03e00008 JR $31       0x03e00c37              LUI x24, 0x3e00
                               0x008c0c13              ADDI x24, x24, 0x8
                               0x00000073              ECALL

4.5.1 Target Platform

The XBT from MIPS to RISC-V is implemented on an FPGA development board
named PYNQ-Z2. PYNQ refers to “Python Productivity of Zynq”, is an easy-to-use
FPGA board with a Xilinx Zynq-7000 series system-on-chip (SoC) 24 . As shown in
Figure 4.2, a Zynq Chip contains an ARM Cortex-A9 dual-core processor and an
FPGA fabric. They are connected through high-speed AXI interfaces.

                                 Figure 4.2. Zynq 7000 SoC 24

4.5.2 Detailed Design

The tool used for simulation, synthesis and implementation is Xilinx Vivado. The
block design diagram produced by Vivado is shown in Fig.4.3. The source code is
written into the XBT by the Zynq processing system via AXI buses and AXI Smart-
Connect, and the translated target code is read out the same way. In order to gain
Prototype Design                                                                   19

performance and avoid timing violation, the clock frequency is set to 100 MHz
according to the post-implementation timing report.

                           Figure 4.3. Block Design of XBT in Vivado

   The address mapping and configuration registers are shown in Table 4.9. The
instructions to be translated is first written into the Source Code Buffer area. The
SPC, TPC and the source code length should also be properly set through writing
the corresponding CR fields. After that, a value other than zero can be written into
the Start/Finish Register (SFR), which will start the XBT process. The SFR reads 1 if
the translation process is done, otherwise 0. After translation, the target code can
be read from the Target Code Buffer memory space, and the TPC corresponding
to each SPC can be read from the AMT area. This marks the completion of a whole
translation process.
   An example is given as follows. A MIPS bubble sort program has 1159 instruc-
tions and the start address is 0xbfc00000. To translate this program to a RISC-V
program at address 0xc0000000, it needs to be copied to the address space start-
ing with offset 0x0. To configure the XBT system properly, we should write value
0xbfc00000 to register 0x10004, value 1159 to register 0x10008, value 0xc0000000
to register 0x1000C. After that, writing 0x1 into SFR at offset 0x10000 will start the
translation. SFR reads 1 after the translation process is finished, otherwise 0. The
target code can be read or at-speed executed from target code buffer starting at
offset 0x8000, and the AMT is available for look-up at offset 0x4000.
Prototype Design                                                          20

                          Table 4.9. XBT Configuration Registers

        Offset (Range)    Size    R/W   Description
        0x0000 - 0x3FFF   16KB    RW    Source Code Buffer (MIPS)
        0x4000 - 0x7FFF   16KB    RW    Address Mapping Table (AMT)
        0x8000 - 0xFFFF   32KB    RW    Target Code Buffer (RISC-V)
        0x10000           32bit   RW    Start/Finish Register (SFR)
        0x10004           32bit   RW    Source PC base
        0x10008           32bit   RW    Source code length
        0x1000C           32bit   RW    Target PC base
        0x10010           32bit   R     Target code length
        0x10014           32bit   RW    Scratch base register config.
        0x10018           32bit   RW    Reallocation register 1 config.
        0x1001C           32bit   RW    Reallocation register 2 config.
        0x10020           32bit   RW    Reallocation register 3 config.
5 Results

5.1 Design Reports

The design reports of the whole implemented design on the Zynq platform are
shown in Figure 5.2, 5.3 and 5.1. The reports show that the design has met all the
constraints.

                                   Figure 5.1. Power Report

                                       21
Results                                                                      22

                                Figure 5.2. Utilization Report

                                 Figure 5.3. Timing Report

5.2 Benchmark Technique

In order to benchmark the XBT translation time, a software binary translator was
used for comparison. Both translators are run on the Xilinx Zynq platform with
the same ARM processor. Several Testbenches compiled into MIPS binary format
are used to test their performance. The goal of the benchmark is to measure how
many times is a XBT system faster than a software BT system.

5.3 Measurement of Speedup

The terms TBT and TXBT are used to indicate the translation time of the software
BT and XBT respectively. The translation speeds SBT and SXBT can be calculated
by (5.1) and (5.2):

                                          Nsrc
                                  SBT =                                    (5.1)
                                          TBT

                                          Nsrc
                                 SXBT =                                    (5.2)
                                          TXBT
Results                                                                           23

   where Nsrc indicates the number of instructions to be translated. Using (5.1)
and (5.2), the speedup can be calculated by (5.3):

                                     SXBT     TBT
                              Speedup =    =                             (5.3)
                                      SBT    TXBT
   Different approaches are used to measure TBT and TXBT . TBT and TXBT is cal-
culated by (5.4) and (5.5):

                                        Tend − Tbegin
                                TBT =                                           (5.4)
                                            1000

                                             Ncycle
                                   TXBT =                                        (5.5)
                                              fclk
   In (5.4), Tbegin and Tend are time values in milliseconds from the “clock()” func-
tion in the C library “time.h”. Tbegin is the start time and Tend is the end time. In
order to increase the timing accuracy, the program was repeatedly run 1000 itera-
tions to average out the execution time in microseconds. The time of file I/O and
initialization is deliberately not counted in.
   In (5.5), Ncycle is the number of clock cycles of the whole translation process,
and fclk is the frequency of the clock. In this case, fclk = 100MHz.

5.4 Results

The translation time and speedup data are shown in Table 5.1. All measurements
are made on the same Zynq chip. The actual speedup data of the 6 testbenches
vary slightly. This is possibly because the programs have different number of each
type of instructions.
   The average speedup is calculated in the last row. It shows that the XBT ap-
proach is approximately 48 times faster than an equivalent software BT approach.
Results                                                          24

                        Table 5.1. Translation Time: BT vs XBT
          Bench         Nsrc   TBT (µs)   TXBT (µs)   Speedup
          Dhrystone     1711   1821.53    40.77       44.68
          Bubble Sort   1159   1229.34    24.64       49.89
          Select Sort   1123   1192.28    23.90       49.89
          Quick Sort    2555   2720.59    59.12       46.02
          SHA           3091   3320.32    68.48       48.49
          CRC32         1439   1553.18    31.67       49.04
          Average                                     48.00
6 Conclusions

   This work proposed the concept and methodology of the FPGA-based XBT ap-
proach. A prototype of XBT is designed, implemented and tested. It shows an aver-
age speedup of 48 times compared to traditional software BT approaches in several
different testbenches.
   XBT proves to provide greater performance and better flexibility compared to
traditional software-based BT. The performance gain comes from the parallelism
brought by the pipeline structure. The FPGA actually has more advantage than
software when running tasks with lots of bitwise operations like instruction decod-
ing.
   In addition to the performance gain, the flexibility of XBT can also be very useful.
It is a popular trend that FPGA fabrics are currently embedded in some SoCs. This
technique is called embedded FPGA (eFPGA). The eFPGAs can be programmed
at runtime, enabling the system to switch between different translation configura-
tions when needed to run applications of different ISAs.
   With XBT and at-speed execution, we can expect less performance loss when an
application is migrated to another ISA and have to be translated. Even further, we
can expect greater performance in system emulation with XBT, which can narrow
the architectural gap between ISAs.

                                         25
7 Future Work

   The XBT methodology and prototype proposed in this work are still far from ac-
tual practice. This work only discusses the statically linked program as the source
program. However, dynamically-linked program is more common in modern sys-
tems, which should be taken into consideration. Adjustments and optimizations
need to be made to detect and translate dynamic link libraries. Cache coherency
problems could also occur when XBT is working with multiple target processors.
More measures need to be taken to solve the coherency problem.
   Due to the limited time and capability, this work did not really involve a real
RISC-V processor to test the validity and efficiency when executing the translated
code at-speed. This is truly a pity. It is expected that more works on bringing XBT
to practice and adapt it to different source and target ISAs are done in the future.

                                        26
References

 [1] Erik R Altman and Kemal Ebcioglu. Full system binary translation: Risc to vliw.
     IBM, Yorktown Heights, NY, Tech. Rep. RC23262, 2000.

 [2] Erik R Altman, David Kaeli, and Yaron Sheffer. Welcome to the opportunities
     of binary translation. Computer, 33(3):40–45, 2000.

 [3] Marc Angelone. Approaches for Universal Static Binary Translation. PhD the-
     sis, Citeseer, 2006.

 [4] Sorav Bansal and Alex Aiken. Binary translation using peephole superopti-
     mizers. In OSDI, volume 8, pages 177–192, 2008.

 [5] Fabrice Bellard. Qemu, a fast and portable dynamic translator. In USENIX
     annual technical conference, FREENIX Track, volume 41, page 46. Califor-nia,
     USA, 2005.

 [6] Edson Borin and Youfeng Wu. Characterization of dbt overhead. In 2009 IEEE
     International Symposium on Workload Characterization (IISWC), pages 178–
     187. IEEE, 2009.

 [7] Matthew Chapman, Daniel J Magenheimer, and Parthasarathy Ranganathan.
     Magixen: Combining binary translation and virtualization. HP Enterprise Sys-
     tems and Software Laboratory, pages 1–15, 2007.

 [8] Jiunn-Yeu Chen, Wuu Yang, Tzu-Han Hung, Hong-Men Su, and Wei-Chung
     Hsu. A static binary translator for efficient migration of arm-based applica-
     tions. In Workshop on Optimizations for DSP and Embedded Systems, pages
     36–39. Citeseer, 2008.

 [9] Anton Chernoff, Mark Herdeg, Ray Hookway, Chris Reeve, Norman Rubin,
     Tony Tye, S Bharadwaj Yadavalli, and John Yates. Fx! 32: A profile-directed
     binary translator. IEEE Micro, 18(02):56–64, 1998.

[10] Cristina Cifuentes and Vishv M Malhotra. Binary translation: Static, dynamic,
     retargetable? In icsm, volume 96, pages 340–349, 1996.

[11] Cristina Cifuentes and Mike Van Emmerik. Uqbt: Adaptable binary transla-
     tion at low cost. Computer, 33(3):60–66, 2000.

                                        27
Future Work                                                                     28

[12] Kemal Ebcioglu, Erik Altman, Michael Gschwind, and Sumedh Sathaye. Dy-
     namic binary translation and optimization. IEEE Transactions on Computers,
     50(6):529–548, 2001.

[13] Byron Hawkins, Brian Demsky, Derek Bruening, and Qin Zhao. Optimizing
     binary translation of dynamically generated code. In 2015 IEEE/ACM Interna-
     tional Symposium on Code Generation and Optimization (CGO), pages 68–78.
     IEEE, 2015.

[14] MIPS Technologies Inc. MIPS® Architecture For Programmers Volume I-A:
     Introduction to the MIPS32® Architecture, 2011. Revision 3.02.

[15] MIPS Technologies Inc. MIPS® Architecture For Programmers Volume II-A:
     The MIPS32® Instruction Set, 2011. Revision 3.02.

[16] Apple Insider. Rosetta 2. Website. https://appleinsider.com/inside/
     rosetta-2 Accessed Jun 14, 2021.

[17] Ian Kuon and Jonathan Rose. Measuring the gap between fpgas and asics.
     IEEE Transactions on computer-aided design of integrated circuits and systems,
     26(2):203–215, 2007.

[18] Mathias Payer and Thomas Gross. Fast binary translation: Translation effi-
     ciency and runtime efficiency. In 2nd Workshop on Architectural and Microar-
     chitectural Support for Binary Translation (AMAS-BT’09), Austin, Texas, USA,
     2009.

[19] Mark Probst. Dynamic binary translation. In UKUUG Linux Developer’s Con-
     ference, volume 2002, 2002.

[20] Simon Rokicki, Erven Rohou, and Steven Derrien. Hardware-accelerated dy-
     namic binary translation. In Design, Automation & Test in Europe Conference
     & Exhibition (DATE), 2017, pages 1062–1067. IEEE, 2017.

[21] Simon Rokicki, Erven Rohou, and Steven Derrien. Hybrid-dbt: Hard-
     ware/software dynamic binary translation targeting vliw. IEEE Transactions
     on Computer-Aided Design of Integrated Circuits and Systems, 38(10):1872–
     1885, 2018.

[22] Richard L Sites, Anton Chernoff, Matthew B Kirk, Maurice P Marks, and Scott G
     Robinson. Binary translation. Communications of the ACM, 36(2):69–81, 1993.
Future Work                                                                    29

[23] Matthew Smithson, Khaled ElWazeer, Kapil Anand, Aparna Kotha, and Rajeev
     Barua. Static binary rewriting without supplemental information: Overcom-
     ing the tradeoff between coverage and correctness. In 2013 20th Working Con-
     ference on Reverse Engineering (WCRE), pages 52–61. IEEE, 2013.

[24] Xilinx. Python productivity for Zynq (Pynq) Documentation, 2020. Release 2.5.

[25] Yuan Yao, Zhongyong Lu, Qingsong Shi, and Wenzhi Chen. Fpga based
     hardware-software co-designed dynamic binary translation system. In 2013
     23rd International Conference on Field programmable Logic and Applications,
     pages 1–4. IEEE, 2013.

[26] Cindy Zheng and Carol Thompson. Pa-risc to ia-64: Transparent execution,
     no recompilation. Computer, 33(3):47–52, 2000.
You can also read