Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset

Page created by Andrew Fields
 
CONTINUE READING
Appears in Computer Architecture News (CAN), September 2005

                        Multifacet’s General Execution-driven
                       Multiprocessor Simulator (GEMS) Toolset

          Milo M. K. Martin1, Daniel J. Sorin2, Bradford M. Beckmann3, Michael R. Marty3, Min Xu4,
                Alaa R. Alameldeen3, Kevin E. Moore3, Mark D. Hill3,4, and David A. Wood3,4

                                             http://www.cs.wisc.edu/gems/

1                                  2                               3                             4
    Computer and Information        Electrical and Computer        Computer Sciences Dept.       Electrical and Computer
        Sciences Dept.                Engineering Dept.           Univ. of Wisconsin-Madison        Engineering Dept.
     Univ. of Pennsylvania                Duke Univ.                                            Univ. of Wisconsin-Madison

                          Abstract                                nization, thread scheduling and migration). Fur-
                                                                  thermore, as the single-chip microprocessor
The Wisconsin Multifacet Project has created a sim-               evolves to a chip-multiprocessor, the ability to sim-
ulation toolset to characterize and evaluate the per-             ulate these machines running realistic multi-
formance of multiprocessor hardware systems                       threaded workloads is paramount to continue
commonly used as database and web servers. We                     innovation in architecture research.
leverage an existing full-system functional simula-               Simulation Challenges. Creating a timing simula-
tion infrastructure (Simics [14]) as the basis around             tor for evaluating multiprocessor systems with
which to build a set of timing simulator modules for              workloads that require operating system support is
modeling the timing of the memory system and                      difficult. First, creating even a functional simulator,
microprocessors. This simulator infrastructure                    which provides no modeling of timing, is a sub-
enables us to run architectural experiments using a               stantial task. Providing sufficient functional fidelity
suite of scaled-down commercial workloads [3]. To                 to boot an unmodified operating system requires
enable other researchers to more easily perform such              implementing supervisor instructions, and inter-
research, we have released these timing simulator                 facing with functional models of many I/O devices.
modules as the Multifacet General Execution-driven                Such simulators are called full-system simulators
Multiprocessor Simulator (GEMS) Toolset, release                  [14, 20]. Second, creating a detailed timing simula-
1.0, under GNU GPL [9].                                           tor that executes only user-level code is a substan-
                                                                  tial undertaking, although the wide availability of
1 Introduction                                                    such tools reduces redundant effort. Finally, creat-
    Simulation is one of the most important tech-                 ing a simulation toolset that supports both full-sys-
niques used by computer architects to evaluate                    tem and timing simulation is substantially more
their innovations. Not only does the target machine               complicated than either endeavor alone.
need to be simulated with sufficient detail, but it               Our Approach: Decoupled Functionality and
also must be driven with a realistic workload. For                Timing Simulation. To address these simulation
example, SimpleScalar [4] has been widely used in                 challenges, we designed a modular simulation
the architectural research community to evaluate                  infrastructure (GEMS) that decouples simulation
new ideas. However it and other similar simulators                functionality and timing. To expedite simulator
run only user-mode, single-threaded workloads                     development, we used Simics [14], a full-system
such as the SPEC CPU benchmarks [24]. Many                        functional simulator, as a foundation on which var-
designers are interested in multiprocessor systems                ious timing simulation modules can be dynami-
that run more complicated, multithreaded work-                    cally loaded. By decoupling functionality and
loads such as databases, web servers, and parallel                timing simulation in GEMS, we leverage both the
scientific codes. These workloads depend upon                     efficiency and the robustness of a functional simu-
many operating system services (e.g., I/O, synchro-               lator. Using modular design provides the flexibility

                                                              1
Appears in Computer Architecture News (CAN), September 2005

                                 Contended Locks

                                                   Basic Verification
                 Random
                 Tester                                                                         Simics
                                                                                                                    Opal

                                                                               etc.
                                                                                                                    Detailed
                                                                                                                    Processor
                                                                                                                    Model
                                Microbenchmarks

                                                                             Ruby                         M
                                                                        Memory System Simulator

                                                                                                     S         I
                          Interconnection                                                             Coherence
                             Network                                                                  Controllers
                                                                           Caches & Memory

Figure 1. A view of the GEMS architecture: Ruby, our memory simulator can be driven by one of
four memory system request generators.

to simulate various system components in different                                        dent on Simics. Such a decoupling allows the tim-
levels of detail.                                                                         ing models to focus on the most common 99.9% of
    While some researchers used the approach of                                           all dynamic instructions. This task is much easier
adding full-system simulation capabilities to exist-                                      than requiring a monolithic simulator to model the
ing user-level-only timing simulators [6, 21], we                                         timing and correctness of every function in all
adopted the approach of leveraging an existing full-                                      aspects of the full-system simulation. For example,
system functional simulation environment (also                                            a small mistake in handling an I/O request or in
used by other simulators [7, 22]). This strategy                                          modeling a special case in floating point arithmetic
enabled us to begin workload development and                                              is unlikely to cause a significant change in timing
characterization in parallel with development of                                          fidelity. However, such a mistake will likely affect
the timing modules. This approach also allowed us                                         functional fidelity, which may prevent the simula-
to perform initial evaluation of our research ideas                                       tion from continuing to execute. By allowing
using a more approximate processor model while                                            Simics to always determine the result of execution,
the more detailed processor model was still in                                            the program will always continue to execute cor-
development.                                                                              rectly.
    We use the timing-first simulation approach                                               Our approach is different from trace-driven
[18], in which we decoupled the functional and                                            simulation. Although our approach decouples
timing aspects of the simulator. Since Simics is                                          functional simulation and timing simulation, the
robust enough to boot an unmodified OS, we used                                           functional simulator is still affected by the timing
its functional simulation to help us avoid imple-                                         simulator, allowing the system to capture timing-
menting rare but effort-consuming instructions in                                         dependent effects. For example, the timing model
the timing simulator. Our timing modules interact                                         will determine the winner of two processors that
with Simics to determine when Simics should exe-                                          are trying to access the same software lock in mem-
cute an instruction. However, what the result of the                                      ory. Since the timing simulator determines when
execution of the instruction is ultimately depen-                                         the functional simulator advances, such timing-

                                                                                      2
Appears in Computer Architecture News (CAN), September 2005

dependent effects are captured. In contrast, trace-         vides multiple drivers that can serve as a source of
driven simulation fails to capture these important          memory operation requests to Ruby:
effects. In the limit, if our timing simulator was          1) Random tester module: The simplest driver of
100% functionally correct, it would always agree            Ruby is a random testing module used to stress test
with the functional simulator, making the func-             the corner cases of the memory system. It uses false
tional simulation redundant. Such an approach               sharing and action/check pairs to detect many pos-
allows for “correctness tuning” during simulator            sible memory system and coherence errors and
development.                                                race conditions [25]. Several features are available
Design Goals. As the GEMS simulation system has             in Ruby to help debug the modeled system includ-
primarily been used to study cache-coherent shared          ing deadlock detection and protocol tracing.
memory systems (both on-chip and off-chip) and
                                                            2) Micro-benchmark module: This driver sup-
related issues, those aspects of GEMS release 1.0 are
                                                            ports various micro-benchmarks through a com-
the most detailed and the most flexible. For exam-
                                                            mon interface. The module can be used for basic
ple, we model the transient states of cache coher-
                                                            timing verification, as well as detailed performance
ence protocols in great detail. However, the tools
                                                            analysis of specific conditions (e.g., lock contention
have a more approximate timing model for the
                                                            or widely-shared data).
interconnection network, a simplified DRAM sub-
system, and a simple I/O timing model. Although             3) Simics: This driver uses Simics’ functional sim-
we do include a detailed model of a modern                  ulator to approximate a simple in-order processor
dynamically-scheduled processor, our goal was to            with no pipeline stalls. Simics passes all load, store,
provide a more realistic driver for evaluating the          and instruction fetch requests to Ruby, which per-
memory system. Therefore, our processor model               forms the first level cache access to determine if the
may lack some details, and it may lack flexibility          operation hits or misses in the primary cache. On a
that is more appropriate for certain detailed micro-        hit, Simics continues executing instructions,
architectural experiments.                                  switching between processors in a multiple proces-
Availability. The first release of GEMS is available        sor setting. On a miss, Ruby stalls Simics’ request
at http://www.cs.wisc.edu/gems/. GEMS is open-              from the issuing processor, and then simulates the
source software and is licensed under GNU GPL               cache miss. Each processor can have only a single
[9]. However, GEMS relies on Virtutech’s Simics, a          miss outstanding, but contention and other timing
commercial product, for full-system functional              affects among the processors will determine when
simulation. At this time, Virtutech provides evalua-        the request completes. By controlling the timing of
tion licenses for academic users at no charge. More         when Simics advances, Ruby determines the tim-
information about Simics can be found at                    ing-dependent functional simulation in Simics
http://www.virtutech.com/.                                  (e.g., to determine which processor next acquires a
                                                            memory block).
    The remainder of this paper provides an over-
view of GEMS (Section 2) and then describes the             4) Opal: This driver models a dynamically-sched-
two main pieces of GEMS: the multiprocessor                 uled SPARC v9 processor and uses Simics to verify
memory system timing simulator Ruby (Section 3),            its functional correctness. Opal (previously known
which includes the SLICC domain-specific lan-               as TFSim[18]) is described in more detail later in
guage for specifying cache-coherence protocols and          Section 4.
systems, and the detailed microarchitectural pro-               The first two drivers are part of a stand-alone
cessor timing model Opal (Section 4). Section 5             executable that is independent of Simics or any
discusses some constraints and caveats of GEMS,             actual simulated program. In addition, Ruby is spe-
and we conclude in Section 6.                               cifically designed to support additional drivers
                                                            (beyond the four mentioned above) using a well-
2 GEMS Overview                                             defined interface.
   The heart of GEMS is the Ruby memory system                  GEMS’ modular design provides significant
simulator. As illustrated in Figure 1, GEMS pro-            simulator configuration flexibility. For example,

                                                        3
Appears in Computer Architecture News (CAN), September 2005

our memory system simulator is independent of                3.1 Protocol-Independent Components
our out-of-order processor simulator. A researcher               The protocol-independent components of Ruby
can obtain preliminary results for a memory system           include the interconnection network, cache arrays,
enhancement using the simple in-order processor              memory arrays, message buffers, and assorted glue
model provided by Simics, which runs much faster             logic. The only two components that merit discus-
than Opal. Based on these preliminary results, the           sion are the caches and interconnection network.
researcher can then determine whether the accom-
                                                             Caches. Ruby models a hierarchy of caches associ-
panying processor enhancement should be imple-
                                                             ated with each single processor, as well as shared
mented and simulated in the detailed out-of-order
                                                             caches used in chip multiprocessors (CMPs) and
simulator.
                                                             other hierarchical coherence systems. Cache char-
    GEMS also provides flexibility in specifying             acteristics, such as size and associativity, are config-
many different cache coherence protocols that can            uration parameters.
be simulated by our timing simulator. We separated
                                                             Interconnection Network. The interconnection
the protocol-dependent details from the protocol-
                                                             network is the unified communication substrate
independent system components and mechanisms.
                                                             used to communicate between cache and memory
To facilitate specifying different protocols and sys-
                                                             controllers. A single monolithic interconnection
tems, we proposed and implemented the protocol
                                                             network model is used to simulate all communica-
specification language SLICC (Section 3.2). In the
                                                             tion, even between controllers that would be on the
next two sections, we describe our two main simu-
                                                             same chip in a simulated CMP system. As such, all
lation modules: Ruby and Opal.
                                                             intra-chip and inter-chip communication is han-
                                                             dled as part of the interconnect, although each
3 Multiprocessor Memory System (Ruby)                        individual link can have different latency and band-
    Ruby is a timing simulator of a multiprocessor           width parameters. This design provides sufficient
memory system that models: caches, cache control-            flexibility to simulate the timing of almost any kind
lers, system interconnect, memory controllers, and           of system.
banks of main memory. Ruby combines hard-                        A controller communicates by sending mes-
coded timing simulation for components that are              sages to other controllers. Ruby’s interconnection
largely independent of the cache coherence proto-            network models the timing of the messages as they
col (e.g., the interconnection network) with the             traverse the system. Messages sent to multiple des-
ability to specify the protocol-dependent compo-             tinations (such as a broadcast) use traffic-efficient
nents (e.g., cache controllers) in a domain-specific         multicast-based routing to fan out the request to
language called SLICC (Specification Language for            the various destinations.
Implementing Cache Coherence).
                                                                 Ruby models a point-to-point switched inter-
Implementation. Ruby is implemented in C++ and               connection network that can be configured simi-
uses a queue-driven event model to simulate tim-             larly to interconnection networks in current high-
ing. Components communicate using message                    end multiprocessor systems, including both direc-
buffers of varying latency and bandwidth, and the            tory-based and snooping-based systems. For simu-
component at the receiving end of the buffer is              lating systems based on directory protocols, Ruby
scheduled to wake up when the next message will              release 1.0 supports three non-ordered networks: a
be available to be read from the buffer. Although            simplified full connected point-to-point network, a
many buffers are used in a strictly first-in-first-out       dynamically-routed 2D-torus interconnect inspired
(FIFO) manner, the buffers are not restricted to             by the Alpha 21364 [19], and a flexible user-defined
FIFO-only behavior. The simulation proceeds by               network interface. The first two networks are auto-
invoking the wakeup method for the next sched-               matically generated using certain simulator config-
uled event on the event queue. Conceptually, the             uration parameters, while the third creates an
simulation would be identical if all components              arbitrary network by reading a user-defined config-
were woken up each cycle; thus, the event queue              uration file. This file-specified network can create
can be thought of as an optimization to avoid
unnecessary processing during each cycle.

                                                         4
Appears in Computer Architecture News (CAN), September 2005

complicated networks such as a CMP-DNUCA                      tem components such as cache controllers and
network [5].                                                  directory controllers. Each controller is conceptu-
    For snooping-based systems, Ruby has two                  ally a per-memory-block state machine, which
totally-ordered networks: a crossbar network and a            includes:
hierarchical switch network. Both ordered net-                  • States: set of possible states for each cache
works use a hierarchy of one or more switches to                  block,
create a total order of coherence requests at the net-
                                                                • Events: conditions that trigger state transitions,
work’s root. This total order is enough for many
                                                                  such as message arrivals,
broadcast-based snooping protocols, but it requires
that the specific cache-coherence protocol does not             • Transitions: the cross-product of states and
rely on stronger timing properties provided by the                events (based on the state and event, a transi-
more traditional bus-based interconnect. In addi-                 tion performs an atomic sequence of actions
tion, mechanisms for synchronous snoop response                   and changes the block to a new state), and
combining and other aspects of some bus-based                   • Actions: the specific operations performed
protocols are not supported.                                       during a transition.
    The topology of the interconnect is specified by              For example, the SLICC code might specify a
a set of links between switches, and the actual rout-         “Shared” state that allows read-only access for a
ing tables are re-calculated for each execution,              block in a cache. When an external invalidation
allowing for additional topologies to be easily               message arrives at the cache for a block in Shared, it
added to the system. The interconnect models vir-             triggers an “Invalidation” event, which causes a
tual networks for different types and classes of mes-         “Shared x Invalidation” transition to occur. This
sages, and it allows dynamic routing to be enabled            transition specifies that the block should change to
or disabled on a per-virtual-network basis (to pro-           the “Invalid” state. Before a transition can begin, all
vide point-to-point order if required). Each link of          required resources must be available. This check
the interconnect has limited bandwidth, but the               prevents mid-transition blocking. Such resource
interconnect does not model the details of the                checking includes available cache frames, in-flight
physical or link-level layer. By default, infinite net-       transaction buffer, space in an outgoing message
work buffering is assumed at the switches, but Ruby           queue, etc. This resource check allows the control-
also supports finite buffering in certain networks.           ler to always complete the entire sequence of
We believe that Ruby’s interconnect model is suffi-           actions associated with the transition without
cient for coherence protocol and memory hierarchy             blocking.
research, but a more detailed model of the inter-
                                                                  SLICC is syntactically similar to C or C++, but it
connection network may need to be integrated for
                                                              is intentionally limited to constrain the specifica-
research focusing on low-level interconnection net-
                                                              tion to hardware-like structures. For example, no
work issues.
                                                              local variables or loops are allowed in the language.
3.2 Specification Language for Implement-                     We also added special language constructs for
ing Cache Coherence (SLICC)                                   inserting messages into buffers and reading infor-
                                                              mation from the next message in a buffer.
    One of our main motivations for creating
GEMS was to evaluate different coherence proto-                   Each controller specified in SLICC consists of
cols and coherence-based prediction. As such, flex-           protocol-independent components, such as cache
ibility in specifying cache coherence protocols was           memories and directories, as well as all fields in:
essential. Building upon our earlier work on table-           caches, per-block directory information at the
driven specification of coherence protocols [23], we          home node, in-flight transaction buffers, messages,
created SLICC (Specification Language for Imple-              and any coherence predictors. These fields consist
menting Cache Coherence), a domain-specific lan-              of primitive types such as addresses, bit-fields, sets,
guage that codifies our table-driven methodology.             counters, and user-specified enumerations. Mes-
                                                              sages contain a message type tag (for statistics gath-
    SLICC is based upon the idea of specifying indi-
                                                              ering) and a size field (for simulating contention on
vidual controller state machines that represent sys-

                                                          5
Appears in Computer Architecture News (CAN), September 2005

the interconnection network). A controller uses                are hopeful the process can be partially or fully
these messages to communicate with other control-              automated in the future. Finally, we have restricted
lers. Messages travel along the intra-chip and inter-          SLICC in ways (e.g., no loops) in which we believe
chip interconnection networks. When a message                  will allow automatic translation of a SLICC specifi-
arrives at its destination, it generates a specific type       cation directly into a synthesizable hardware
of event determined by the input message control               description language (such as VHDL or Verilog).
logic of the particular controller (also specified in          Such efforts are future work.
SLICC).
                                                               3.3 Ruby’s Release 1.0 Limitations
    SLICC allows for the specification of many
types of invalidation-based cache coherence proto-                 Most of the limitations in Ruby release 1.0 are
cols and systems. As invalidation-based protocols              specific to the implementation and not the general
are ubiquitous in current commercial systems, we               framework. For example, Ruby release 1.0 supports
constructed SLICC to perform all operations on                 only physically-indexed caches, although support
cache block granularity (configurable, but canoni-             for indexing the primary caches with virtual
cally 64 bytes). As such, the word-level granularity           addresses could be added. Also, Ruby does not
required for update-based protocols is currently               model the memory system traffic due to direct
not supported. SLICC is perhaps best suited for                memory access (DMA) operations or memory-
specifying directory-based protocols (e.g., the pro-           mapped I/O loads and stores. Instead of modeling
tocols used in the Stanford DASH [13] and the SGI              these I/O operations, we simply count the number
Origin [12]), and other related protocols such as              that occur. For our workloads, these operations are
AMD’s Opteron protocol [1, 10]. Although SLICC                 infrequent enough (compared to cache misses) to
can be used to specify broadcast snooping proto-               have negligible relative impact on our simulations.
cols, SLICC assumes all protocols use an asynchro-             Those researchers who wish to study more I/O
nous point-to-point network, and not the simpler               intensive workloads may find it necessary to model
(but less scalable) synchronous system bus. The                such effects.
GEMS release 1.0 distribution contains a SLICC
specification for an aggressive snooping protocol, a           4 Detailed Processor Model (Opal)
flat directory protocol, a protocol based on the                   Although GEMS can use Simics’ functional sim-
AMD Opteron [1, 10], two hierarchical directory                ulator as a driver that approximates a system with
protocols suitable for CMP systems, and a Token                simple in-order processor cores, capturing the tim-
Coherence protocol [16] for a hierarchical CMP                 ing of today’s dynamically-scheduled superscalar
system [17].                                                   processors requires a more detailed timing model.
    The SLICC compiler translates a SLICC specifi-             GEMS includes Opal—also known as TFSim
cation into C++ code that links with the protocol-             [18]—as a detailed timing model using the timing-
independent portions of the Ruby memory system                 first approach. Opal runs ahead of Simics’ func-
simulator. In this way, Ruby and SLICC are tightly             tional simulation by fetching, decoding, predicting
integrated to the extent of being inseparable. In              branches, dynamically scheduling, executing
addition to generating code for Ruby, the SLICC                instructions, and speculatively accessing the mem-
language is intended to be used for a variety of pur-          ory hierarchy. When Opal has determined that the
poses. First, the SLICC compiler generates HTML-               time has come for an instruction to retire, it
based tables as documentation for each controller.             instructs the functional simulation of the corre-
This concise and continuously-updated documen-                 sponding Simics processor to advance one instruc-
tation is helpful when developing and debugging                tion. Opal then compares its processor states with
protocols. Example SLICC code and corresponding                that of Simics to ensure that it executed the instruc-
HTML tables can be found online [15]. Second, the              tion correctly. The vast majority of the time Opal
SLICC code has also served as the basis for translat-          and Simics agree on the instruction execution;
ing protocols into a model-checkable format such               however, when an interrupt, I/O operation, or rare
as TLA+ [2, 11] or Murphi [8]. Although such                   kernel-only instruction not implemented by Opal
efforts have thus far been manual translations, we             occurs, Opal will detect the discrepancy and

                                                           6
Appears in Computer Architecture News (CAN), September 2005

recover as necessary. The paper on TFSim/Opal              5 Constraints and Caveats
[18] provides more discussion of the timing-first              As with any complicated research tool, GEMS is
philosophy, implementation, effectiveness, and             not without its constraints and caveats. One caveat
related work.                                              of GEMS is that although individual components
Features. Opal models a modern dynamically-                and some entire system timing testing and sanity
scheduled, superscalar, deeply-pipelined processor         checking has been performed, a full end-to-end
core. Opal is configured by default to use a two-          validation of GEMS has not been performed. For
level gshare branch predictor, MIPS R10000 style           instance, through the random tester we have veri-
register renaming, dynamic instruction issue, mul-         fied Ruby coherently transfers cache blocks, how-
tiple execution units, and a load/store queue to           ever, we have not performed exhaustive timing
allow for out-of-order memory operations and               verification or verified that it strictly adheres to the
memory bypassing. As Opal simulates the SPARC              sequential consistency memory model. Neverthe-
ISA, condition codes and other architectural state         less, the relative performance comparisons gener-
are renamed as necessary to allow for highly-con-          ated by GEMS should suffice to provide insight into
current execution. Because Opal runs ahead of the          many types of proposed design enhancements.
Simics functional processor, it models all wrong-              Another caveat of GEMS is that we are unable to
path effects of instructions that are not eventually       distribute our commercial workload suite [3], as it
retired. Opal implements an aggressive implemen-           contains proprietary software that cannot be redis-
tation of sequential consistency, allowing memory          tributed. Although commercial workloads can be
operations to occur out of order and detecting pos-        set up under Simics, the lack of ready-made work-
sible memory ordering violations as necessary.             loads will increase the effort required to use GEMS
Limitations. Opal is sufficient for modeling a pro-        to generate timing results for commercial servers
cessor that generates multiple outstanding misses          and other multiprocessing systems.
for the Ruby memory system. However, Opal’s
microarchtectual model, although detailed, does            6 Conclusions
not model all the most advanced features of some
                                                              In this paper we presented Multifacet’s General
modern processors (e.g., a trace cache and
                                                           Execution-driven       Multiprocessor    Simulator
advanced memory-dependence speculation predic-
                                                           (GEMS) as a simulation toolset to evaluate multi-
tors). Thus, the model may need to be extended
                                                           processor architectures. By using the full-system
and further validated in order to be used as a basis
                                                           simulator Simics, we were able to decouple the
for experiments that focus on the microarchitec-
                                                           development of the GEMS timing model from
ture of the system. Also, the first release of Opal
                                                           ensuring the necessary functional correctness
does not support hardware multithreading, but at
                                                           required to run an unmodified operating system
least one of our internal projects has preliminarily
                                                           and commercial workloads. The multiprocessor
added such support to Opal. Similar to most
                                                           timing simulator Ruby allows for detailed simula-
microarchitecture simulators, Opal is heavily tied
                                                           tion of cache hierarchies. The domain-specific lan-
to its target ISA (in this case, SPARC), and porting
                                                           guage SLICC grants GEMS the flexibility to
it to a different ISA, though possible, would not be
                                                           implement different coherence protocols and sys-
trivial.
                                                           tems under a single simulation infrastructure. The
    Perhaps Opal’s biggest constraints deal with its       processor timing simulator Opal can be used to
dependence on the Simics functional execution              simulate a dynamically-scheduled superscalar pro-
mode. For instance, because Simics uses a flat             cessor, and it relies on Simics for functional cor-
memory image, Opal cannot execute a non-                   rectness. To enables others to more easily perform
sequentially consistent execution. Also we devel-          research on multiprocessors with commercial
oped Opal to simulate systems based on the SPARC           workloads, we have released GEMS under the
ISA, and therefore it is limited to the specific TLB       GNU GPL, and the first release is available at
configurations supported by Simics and the simu-           http://www.cs.wisc.edu/gems/.
lated operating system.

                                                       7
Appears in Computer Architecture News (CAN), September 2005

7 Acknowledgments                                                   Language and Tools for Hardware and Software
                                                                    Engineers. Addision-Wesley, 2002.
   We thank Ross Dickson, Pacia Harper, Carl                   [12] James Laudon and Daniel Lenoski. The SGI Origin:
Mauer, Manoj Plakal, and Luke Yen for their contri-                 A ccNUMA Highly Scalable Server. In Proceedings
                                                                    of the 24th Annual International Symposium on
butions to the GEMS infrastructure. We thank the                    Computer Architecture, pages 241–251, June 1997.
University of Illinois for original beta testing. This         [13] Daniel Lenoski, James Laudon, Kourosh
work is supported in part by the National Science                   Gharachorloo, Anoop Gupta, and John Hennessy.
                                                                    The Directory-Based Cache Coherence Protocol for
Foundation (CCR-0324878, EIA/CNS-0205286,                           the DASH Multiprocessor. In Proceedings of the
and CCR-0105721) and donations from Compaq,                         17th Annual International Symposium on Computer
                                                                    Architecture, pages 148–159, May 1990.
IBM, Intel Corporation and Sun Microsystems. We                [14] Peter S. Magnusson et al. Simics: A Full System
thank Amir Roth for suggesting the acronym                          Simulation Platform. IEEE Computer, 35(2):50–58,
                                                                    February 2002.
SLICC. We thank Virtutech AB, the Wisconsin                    [15] Milo M. K. Martin et al. Protocol Specifications and
Condor group, and the Wisconsin Computer Sys-                       Tables for Four Comparable MOESI Coherence
tems Lab for their help and support.                                Protocols: Token Coherence, Snooping, Directory,
                                                                    and                                        Hammer.
                                                                    http://www.cs.wisc.edu/multifacet/theses/milo_ma
                                                                    rtin_phd/, 2003.
References                                                     [16] Milo M. K. Martin, Mark D. Hill, and David A.
                                                                    Wood. Token Coherence: A New Framework for
[1] Ardsher Ahmed, Pat Conway, Bill Hughes, and                     Shared-Memory Multiprocessors. IEEE Micro,
     Fred Weber. AMD Opteron Shared Memory MP                       23(6), Nov/Dec 2003.
     Systems. In Proceedings of the 14th HotChips              [17] Michael R. Marty, Jesse D. Bingham, Mark D. Hill,
     Symposium, August 2002.                                        Alan J. Hu, Milo M. K. Martin, and David A. Wood.
[2] Homayoon Akhiani, Damien Doligez, Paul Harter,                  Improving Multiple-CMP Systems Using Token
     Leslie Lamport, Joshua Scheid, Mark Tuttle, and                Coherence. In Proceedings of the Eleventh IEEE
     Yuan Yu. Cache Coherence Verification with                     Symposium on High-Performance Computer
     TLA+. In FM’99—Formal Methods, Volume II,                      Architecture, February 2005.
     volume 1709 of Lecture Notes in Computer Science,         [18] Carl J. Mauer, Mark D. Hill, and David A. Wood.
     page 1871. Springer Verlag, 1999.                              Full System Timing-First Simulation. In
[3] Alaa R. Alameldeen, Milo M. K. Martin, Carl J.                  Proceedings of the 2002 ACM Sigmetrics Conference
     Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin,                on Measurement and Modeling of Computer
     Mark D. Hill, and David A. Wood. Simulating a                  Systems, pages 108–116, June 2002.
     $2M Commercial Server on a $2K PC. IEEE                   [19] Shubhendu S. Mukherjee, Peter Bannon, Steven
     Computer, 36(2):50–57, February 2003.                          Lang, Aaron Spink, and David Webb. The Alpha
[4] Todd Austin, Eric Larson, and Dan Ernst.                        21364 Network Architecture. In Proceedings of the
     SimpleScalar: An Infrastructure for Computer                   9th Hot Interconnects Symposium, August 2001.
     System Modeling. IEEE Computer, 35(2):59–67,              [20] Mendel Rosenblum, Stephen A. Herrod, Emmett
     February 2002.                                                 Witchel, and Anoop Gupta. Complete Computer
[5] Bradford M. Beckmann and David A. Wood.                         System Simulation: The SimOS Approach. IEEE
     Managing Wire Delay in Large Chip-                             Parallel and Distributed Technology: Systems and
     Multiprocessor Caches. In Proceedings of the 37th              Applications, 3(4):34–43, 1995.
     Annual IEEE/ACM International Symposium on                [21] Lambert Schaelicke and Mike Parker. ML-RSIM
     Microarchitecture, December 2004.                              Reference Manual. Technical Report tech. report
[6] Nathan. L. Binkert, Erik. G. Hallnor, and Steven. K.            02-10, Department of Computer Science and
     Reinhardt.     Network-Oriented        Full-System             Engineering, Univ. of Notre Dame, Notre Dame,
     Simulation using M5. In Proceedings of the Sixth               IN, 2002.
     Workshop on Computer Architecture Evaluation              [22] Jared Smolens, Brian Gold, Jangwoo Kim, Babak
     Using Commercial Workloads, February 2003.                     Falsafi, James C. Hoe, , and Andreas G. Nowatzyk.
[7] Harold W. Cain, Kevin M. Lepak, Brandon A.                      Fingerprinting: Bounding the Soft-Error Detection
     Schwartz, and Mikko H. Lipasti. Precise and                    Latency and Bandwidth. In Proceedings of the
     Accurate Processor Simulation. In Proceedings of               Eleventh International Conference on Architectural
     the Fifth Workshop on Computer Architecture                    Support for Programming Languages and Operating
     Evaluation Using Commercial Workloads, pages 13–               Systems, pages 224–234, October 2004.
     22, February 2002.                                        [23] Daniel J. Sorin, Manoj Plakal, Mark D. Hill, Anne E.
[8] David L. Dill, Andreas J. Drexler, Alan J. Hu, and              Condon, Milo M. K. Martin, and David A. Wood.
     C. Han Yang. Protocol Verification as a Hardware               Specifying and Verifying a Broadcast and a
     Design Aid. In International Conference on                     Multicast Snooping Cache Coherence Protocol.
     Computer Design. IEEE, October 1992.                           IEEE Transactions on Parallel and Distributed
[9] Free Software Foundation. GNU General Public                    Systems, 13(6):556–578, June 2002.
     License                                     (GPL).        [24] Systems Performance Evaluation Cooperation.
     http://www.gnu.org/copyleft/gpl.html.                          SPEC Benchmarks. http://www.spec.org.
[10] Chetana N. Keltcher, Kevin J. McGrath, Ardsher            [25] David A. Wood, Garth A. Gibson, and Randy H.
     Ahmed, and Pat Conway. The AMD Opteron                         Katz. Verifying a Multiprocessor Cache Controller
     Processor for Multiprocessor Servers. IEEE Micro,              Using Random Test Generation. IEEE Design and
     23(2):66–76, March-April 2003.                                 Test of Computers, pages 13–25, August 1990.
[11] Leslie Lamport. Specifying Systems: The TLA+

                                                           8
You can also read