Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset

Page created by Andrew Fields

Shopping

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Appears in Computer Architecture News (CAN), September 2005

Multifacet’s General Execution-driven
Multiprocessor Simulator (GEMS) Toolset

Milo M. K. Martin1, Daniel J. Sorin2, Bradford M. Beckmann3, Michael R. Marty3, Min Xu4,
Alaa R. Alameldeen3, Kevin E. Moore3, Mark D. Hill3,4, and David A. Wood3,4

http://www.cs.wisc.edu/gems/

1 2 3 4
Computer and Information Electrical and Computer Computer Sciences Dept. Electrical and Computer
Sciences Dept. Engineering Dept. Univ. of Wisconsin-Madison Engineering Dept.
Univ. of Pennsylvania Duke Univ. Univ. of Wisconsin-Madison

Abstract nization, thread scheduling and migration). Fur-
thermore, as the single-chip microprocessor
The Wisconsin Multifacet Project has created a sim- evolves to a chip-multiprocessor, the ability to sim-
ulation toolset to characterize and evaluate the per- ulate these machines running realistic multi-
formance of multiprocessor hardware systems threaded workloads is paramount to continue
commonly used as database and web servers. We innovation in architecture research.
leverage an existing full-system functional simula- Simulation Challenges. Creating a timing simula-
tion infrastructure (Simics [14]) as the basis around tor for evaluating multiprocessor systems with
which to build a set of timing simulator modules for workloads that require operating system support is
modeling the timing of the memory system and difficult. First, creating even a functional simulator,
microprocessors. This simulator infrastructure which provides no modeling of timing, is a sub-
enables us to run architectural experiments using a stantial task. Providing sufficient functional fidelity
suite of scaled-down commercial workloads [3]. To to boot an unmodified operating system requires
enable other researchers to more easily perform such implementing supervisor instructions, and inter-
research, we have released these timing simulator facing with functional models of many I/O devices.
modules as the Multifacet General Execution-driven Such simulators are called full-system simulators
Multiprocessor Simulator (GEMS) Toolset, release [14, 20]. Second, creating a detailed timing simula-
1.0, under GNU GPL [9]. tor that executes only user-level code is a substan-
tial undertaking, although the wide availability of
1 Introduction such tools reduces redundant effort. Finally, creat-
Simulation is one of the most important tech- ing a simulation toolset that supports both full-sys-
niques used by computer architects to evaluate tem and timing simulation is substantially more
their innovations. Not only does the target machine complicated than either endeavor alone.
need to be simulated with sufficient detail, but it Our Approach: Decoupled Functionality and
also must be driven with a realistic workload. For Timing Simulation. To address these simulation
example, SimpleScalar [4] has been widely used in challenges, we designed a modular simulation
the architectural research community to evaluate infrastructure (GEMS) that decouples simulation
new ideas. However it and other similar simulators functionality and timing. To expedite simulator
run only user-mode, single-threaded workloads development, we used Simics [14], a full-system
such as the SPEC CPU benchmarks [24]. Many functional simulator, as a foundation on which var-
designers are interested in multiprocessor systems ious timing simulation modules can be dynami-
that run more complicated, multithreaded work- cally loaded. By decoupling functionality and
loads such as databases, web servers, and parallel timing simulation in GEMS, we leverage both the
scientific codes. These workloads depend upon efficiency and the robustness of a functional simu-
many operating system services (e.g., I/O, synchro- lator. Using modular design provides the flexibility

Appears in Computer Architecture News (CAN), September 2005

                                 Contended Locks

                                                   Basic Verification
                 Random
                 Tester                                                                         Simics
                                                                                                                    Opal

                                                                               etc.
                                                                                                                    Detailed
                                                                                                                    Processor
                                                                                                                    Model
                                Microbenchmarks

                                                                             Ruby                         M
                                                                        Memory System Simulator

                                                                                                     S         I
                          Interconnection                                                             Coherence
                             Network                                                                  Controllers
                                                                           Caches & Memory

Figure 1. A view of the GEMS architecture: Ruby, our memory simulator can be driven by one of
four memory system request generators.

to simulate various system components in different                                        dent on Simics. Such a decoupling allows the tim-
levels of detail.                                                                         ing models to focus on the most common 99.9% of
    While some researchers used the approach of                                           all dynamic instructions. This task is much easier
adding full-system simulation capabilities to exist-                                      than requiring a monolithic simulator to model the
ing user-level-only timing simulators [6, 21], we                                         timing and correctness of every function in all
adopted the approach of leveraging an existing full-                                      aspects of the full-system simulation. For example,
system functional simulation environment (also                                            a small mistake in handling an I/O request or in
used by other simulators [7, 22]). This strategy                                          modeling a special case in floating point arithmetic
enabled us to begin workload development and                                              is unlikely to cause a significant change in timing
characterization in parallel with development of                                          fidelity. However, such a mistake will likely affect
the timing modules. This approach also allowed us                                         functional fidelity, which may prevent the simula-
to perform initial evaluation of our research ideas                                       tion from continuing to execute. By allowing
using a more approximate processor model while                                            Simics to always determine the result of execution,
the more detailed processor model was still in                                            the program will always continue to execute cor-
development.                                                                              rectly.
    We use the timing-first simulation approach                                               Our approach is different from trace-driven
[18], in which we decoupled the functional and                                            simulation. Although our approach decouples
timing aspects of the simulator. Since Simics is                                          functional simulation and timing simulation, the
robust enough to boot an unmodified OS, we used                                           functional simulator is still affected by the timing
its functional simulation to help us avoid imple-                                         simulator, allowing the system to capture timing-
menting rare but effort-consuming instructions in                                         dependent effects. For example, the timing model
the timing simulator. Our timing modules interact                                         will determine the winner of two processors that
with Simics to determine when Simics should exe-                                          are trying to access the same software lock in mem-
cute an instruction. However, what the result of the                                      ory. Since the timing simulator determines when
execution of the instruction is ultimately depen-                                         the functional simulator advances, such timing-

                                                                                      2

Appears in Computer Architecture News (CAN), September 2005

dependent effects are captured. In contrast, trace- vides multiple drivers that can serve as a source of
driven simulation fails to capture these important memory operation requests to Ruby:
effects. In the limit, if our timing simulator was 1) Random tester module: The simplest driver of
100% functionally correct, it would always agree Ruby is a random testing module used to stress test
with the functional simulator, making the func- the corner cases of the memory system. It uses false
tional simulation redundant. Such an approach sharing and action/check pairs to detect many pos-
allows for “correctness tuning” during simulator sible memory system and coherence errors and
development. race conditions [25]. Several features are available
Design Goals. As the GEMS simulation system has in Ruby to help debug the modeled system includ-
primarily been used to study cache-coherent shared ing deadlock detection and protocol tracing.
memory systems (both on-chip and off-chip) and
2) Micro-benchmark module: This driver sup-
related issues, those aspects of GEMS release 1.0 are
ports various micro-benchmarks through a com-
the most detailed and the most flexible. For exam-
mon interface. The module can be used for basic
ple, we model the transient states of cache coher-
timing verification, as well as detailed performance
ence protocols in great detail. However, the tools
analysis of specific conditions (e.g., lock contention
have a more approximate timing model for the
or widely-shared data).
interconnection network, a simplified DRAM sub-
system, and a simple I/O timing model. Although 3) Simics: This driver uses Simics’ functional sim-
we do include a detailed model of a modern ulator to approximate a simple in-order processor
dynamically-scheduled processor, our goal was to with no pipeline stalls. Simics passes all load, store,
provide a more realistic driver for evaluating the and instruction fetch requests to Ruby, which per-
memory system. Therefore, our processor model forms the first level cache access to determine if the
may lack some details, and it may lack flexibility operation hits or misses in the primary cache. On a
that is more appropriate for certain detailed micro- hit, Simics continues executing instructions,
architectural experiments. switching between processors in a multiple proces-
Availability. The first release of GEMS is available sor setting. On a miss, Ruby stalls Simics’ request
at http://www.cs.wisc.edu/gems/. GEMS is open- from the issuing processor, and then simulates the
source software and is licensed under GNU GPL cache miss. Each processor can have only a single
[9]. However, GEMS relies on Virtutech’s Simics, a miss outstanding, but contention and other timing
commercial product, for full-system functional affects among the processors will determine when
simulation. At this time, Virtutech provides evalua- the request completes. By controlling the timing of
tion licenses for academic users at no charge. More when Simics advances, Ruby determines the tim-
information about Simics can be found at ing-dependent functional simulation in Simics
http://www.virtutech.com/. (e.g., to determine which processor next acquires a
memory block).
The remainder of this paper provides an over-
view of GEMS (Section 2) and then describes the 4) Opal: This driver models a dynamically-sched-
two main pieces of GEMS: the multiprocessor uled SPARC v9 processor and uses Simics to verify
memory system timing simulator Ruby (Section 3), its functional correctness. Opal (previously known
which includes the SLICC domain-specific lan- as TFSim[18]) is described in more detail later in
guage for specifying cache-coherence protocols and Section 4.
systems, and the detailed microarchitectural pro- The first two drivers are part of a stand-alone
cessor timing model Opal (Section 4). Section 5 executable that is independent of Simics or any
discusses some constraints and caveats of GEMS, actual simulated program. In addition, Ruby is spe-
and we conclude in Section 6. cifically designed to support additional drivers
(beyond the four mentioned above) using a well-
2 GEMS Overview defined interface.
The heart of GEMS is the Ruby memory system GEMS’ modular design provides significant
simulator. As illustrated in Figure 1, GEMS pro- simulator configuration flexibility. For example,

Appears in Computer Architecture News (CAN), September 2005

our memory system simulator is independent of 3.1 Protocol-Independent Components
our out-of-order processor simulator. A researcher The protocol-independent components of Ruby
can obtain preliminary results for a memory system include the interconnection network, cache arrays,
enhancement using the simple in-order processor memory arrays, message buffers, and assorted glue
model provided by Simics, which runs much faster logic. The only two components that merit discus-
than Opal. Based on these preliminary results, the sion are the caches and interconnection network.
researcher can then determine whether the accom-
Caches. Ruby models a hierarchy of caches associ-
panying processor enhancement should be imple-
ated with each single processor, as well as shared
mented and simulated in the detailed out-of-order
caches used in chip multiprocessors (CMPs) and
simulator.
other hierarchical coherence systems. Cache char-
GEMS also provides flexibility in specifying acteristics, such as size and associativity, are config-
many different cache coherence protocols that can uration parameters.
be simulated by our timing simulator. We separated
Interconnection Network. The interconnection
the protocol-dependent details from the protocol-
network is the unified communication substrate
independent system components and mechanisms.
used to communicate between cache and memory
To facilitate specifying different protocols and sys-
controllers. A single monolithic interconnection
tems, we proposed and implemented the protocol
network model is used to simulate all communica-
specification language SLICC (Section 3.2). In the
tion, even between controllers that would be on the
next two sections, we describe our two main simu-
same chip in a simulated CMP system. As such, all
lation modules: Ruby and Opal.
intra-chip and inter-chip communication is han-
dled as part of the interconnect, although each
3 Multiprocessor Memory System (Ruby) individual link can have different latency and band-
Ruby is a timing simulator of a multiprocessor width parameters. This design provides sufficient
memory system that models: caches, cache control- flexibility to simulate the timing of almost any kind
lers, system interconnect, memory controllers, and of system.
banks of main memory. Ruby combines hard- A controller communicates by sending mes-
coded timing simulation for components that are sages to other controllers. Ruby’s interconnection
largely independent of the cache coherence proto- network models the timing of the messages as they
col (e.g., the interconnection network) with the traverse the system. Messages sent to multiple des-
ability to specify the protocol-dependent compo- tinations (such as a broadcast) use traffic-efficient
nents (e.g., cache controllers) in a domain-specific multicast-based routing to fan out the request to
language called SLICC (Specification Language for the various destinations.
Implementing Cache Coherence).
Ruby models a point-to-point switched inter-
Implementation. Ruby is implemented in C++ and connection network that can be configured simi-
uses a queue-driven event model to simulate tim- larly to interconnection networks in current high-
ing. Components communicate using message end multiprocessor systems, including both direc-
buffers of varying latency and bandwidth, and the tory-based and snooping-based systems. For simu-
component at the receiving end of the buffer is lating systems based on directory protocols, Ruby
scheduled to wake up when the next message will release 1.0 supports three non-ordered networks: a
be available to be read from the buffer. Although simplified full connected point-to-point network, a
many buffers are used in a strictly first-in-first-out dynamically-routed 2D-torus interconnect inspired
(FIFO) manner, the buffers are not restricted to by the Alpha 21364 [19], and a flexible user-defined
FIFO-only behavior. The simulation proceeds by network interface. The first two networks are auto-
invoking the wakeup method for the next sched- matically generated using certain simulator config-
uled event on the event queue. Conceptually, the uration parameters, while the third creates an
simulation would be identical if all components arbitrary network by reading a user-defined config-
were woken up each cycle; thus, the event queue uration file. This file-specified network can create
can be thought of as an optimization to avoid
unnecessary processing during each cycle.

Appears in Computer Architecture News (CAN), September 2005

complicated networks such as a CMP-DNUCA                      tem components such as cache controllers and
network [5].                                                  directory controllers. Each controller is conceptu-
    For snooping-based systems, Ruby has two                  ally a per-memory-block state machine, which
totally-ordered networks: a crossbar network and a            includes:
hierarchical switch network. Both ordered net-                  • States: set of possible states for each cache
works use a hierarchy of one or more switches to                  block,
create a total order of coherence requests at the net-
                                                                • Events: conditions that trigger state transitions,
work’s root. This total order is enough for many
                                                                  such as message arrivals,
broadcast-based snooping protocols, but it requires
that the specific cache-coherence protocol does not             • Transitions: the cross-product of states and
rely on stronger timing properties provided by the                events (based on the state and event, a transi-
more traditional bus-based interconnect. In addi-                 tion performs an atomic sequence of actions
tion, mechanisms for synchronous snoop response                   and changes the block to a new state), and
combining and other aspects of some bus-based                   • Actions: the specific operations performed
protocols are not supported.                                       during a transition.
    The topology of the interconnect is specified by              For example, the SLICC code might specify a
a set of links between switches, and the actual rout-         “Shared” state that allows read-only access for a
ing tables are re-calculated for each execution,              block in a cache. When an external invalidation
allowing for additional topologies to be easily               message arrives at the cache for a block in Shared, it
added to the system. The interconnect models vir-             triggers an “Invalidation” event, which causes a
tual networks for different types and classes of mes-         “Shared x Invalidation” transition to occur. This
sages, and it allows dynamic routing to be enabled            transition specifies that the block should change to
or disabled on a per-virtual-network basis (to pro-           the “Invalid” state. Before a transition can begin, all
vide point-to-point order if required). Each link of          required resources must be available. This check
the interconnect has limited bandwidth, but the               prevents mid-transition blocking. Such resource
interconnect does not model the details of the                checking includes available cache frames, in-flight
physical or link-level layer. By default, infinite net-       transaction buffer, space in an outgoing message
work buffering is assumed at the switches, but Ruby           queue, etc. This resource check allows the control-
also supports finite buffering in certain networks.           ler to always complete the entire sequence of
We believe that Ruby’s interconnect model is suffi-           actions associated with the transition without
cient for coherence protocol and memory hierarchy             blocking.
research, but a more detailed model of the inter-
                                                                  SLICC is syntactically similar to C or C++, but it
connection network may need to be integrated for
                                                              is intentionally limited to constrain the specifica-
research focusing on low-level interconnection net-
                                                              tion to hardware-like structures. For example, no
work issues.
                                                              local variables or loops are allowed in the language.
3.2 Specification Language for Implement-                     We also added special language constructs for
ing Cache Coherence (SLICC)                                   inserting messages into buffers and reading infor-
                                                              mation from the next message in a buffer.
    One of our main motivations for creating
GEMS was to evaluate different coherence proto-                   Each controller specified in SLICC consists of
cols and coherence-based prediction. As such, flex-           protocol-independent components, such as cache
ibility in specifying cache coherence protocols was           memories and directories, as well as all fields in:
essential. Building upon our earlier work on table-           caches, per-block directory information at the
driven specification of coherence protocols [23], we          home node, in-flight transaction buffers, messages,
created SLICC (Specification Language for Imple-              and any coherence predictors. These fields consist
menting Cache Coherence), a domain-specific lan-              of primitive types such as addresses, bit-fields, sets,
guage that codifies our table-driven methodology.             counters, and user-specified enumerations. Mes-
                                                              sages contain a message type tag (for statistics gath-
    SLICC is based upon the idea of specifying indi-
                                                              ering) and a size field (for simulating contention on
vidual controller state machines that represent sys-

                                                          5

Appears in Computer Architecture News (CAN), September 2005

the interconnection network). A controller uses are hopeful the process can be partially or fully
these messages to communicate with other control- automated in the future. Finally, we have restricted
lers. Messages travel along the intra-chip and inter- SLICC in ways (e.g., no loops) in which we believe
chip interconnection networks. When a message will allow automatic translation of a SLICC specifi-
arrives at its destination, it generates a specific type cation directly into a synthesizable hardware
of event determined by the input message control description language (such as VHDL or Verilog).
logic of the particular controller (also specified in Such efforts are future work.
SLICC).
3.3 Ruby’s Release 1.0 Limitations
SLICC allows for the specification of many
types of invalidation-based cache coherence proto- Most of the limitations in Ruby release 1.0 are
cols and systems. As invalidation-based protocols specific to the implementation and not the general
are ubiquitous in current commercial systems, we framework. For example, Ruby release 1.0 supports
constructed SLICC to perform all operations on only physically-indexed caches, although support
cache block granularity (configurable, but canoni- for indexing the primary caches with virtual
cally 64 bytes). As such, the word-level granularity addresses could be added. Also, Ruby does not
required for update-based protocols is currently model the memory system traffic due to direct
not supported. SLICC is perhaps best suited for memory access (DMA) operations or memory-
specifying directory-based protocols (e.g., the pro- mapped I/O loads and stores. Instead of modeling
tocols used in the Stanford DASH [13] and the SGI these I/O operations, we simply count the number
Origin [12]), and other related protocols such as that occur. For our workloads, these operations are
AMD’s Opteron protocol [1, 10]. Although SLICC infrequent enough (compared to cache misses) to
can be used to specify broadcast snooping proto- have negligible relative impact on our simulations.
cols, SLICC assumes all protocols use an asynchro- Those researchers who wish to study more I/O
nous point-to-point network, and not the simpler intensive workloads may find it necessary to model
(but less scalable) synchronous system bus. The such effects.
GEMS release 1.0 distribution contains a SLICC
specification for an aggressive snooping protocol, a 4 Detailed Processor Model (Opal)
flat directory protocol, a protocol based on the Although GEMS can use Simics’ functional sim-
AMD Opteron [1, 10], two hierarchical directory ulator as a driver that approximates a system with
protocols suitable for CMP systems, and a Token simple in-order processor cores, capturing the tim-
Coherence protocol [16] for a hierarchical CMP ing of today’s dynamically-scheduled superscalar
system [17]. processors requires a more detailed timing model.
The SLICC compiler translates a SLICC specifi- GEMS includes Opal—also known as TFSim
cation into C++ code that links with the protocol- [18]—as a detailed timing model using the timing-
independent portions of the Ruby memory system first approach. Opal runs ahead of Simics’ func-
simulator. In this way, Ruby and SLICC are tightly tional simulation by fetching, decoding, predicting
integrated to the extent of being inseparable. In branches, dynamically scheduling, executing
addition to generating code for Ruby, the SLICC instructions, and speculatively accessing the mem-
language is intended to be used for a variety of pur- ory hierarchy. When Opal has determined that the
poses. First, the SLICC compiler generates HTML- time has come for an instruction to retire, it
based tables as documentation for each controller. instructs the functional simulation of the corre-
This concise and continuously-updated documen- sponding Simics processor to advance one instruc-
tation is helpful when developing and debugging tion. Opal then compares its processor states with
protocols. Example SLICC code and corresponding that of Simics to ensure that it executed the instruc-
HTML tables can be found online [15]. Second, the tion correctly. The vast majority of the time Opal
SLICC code has also served as the basis for translat- and Simics agree on the instruction execution;
ing protocols into a model-checkable format such however, when an interrupt, I/O operation, or rare
as TLA+ [2, 11] or Murphi [8]. Although such kernel-only instruction not implemented by Opal
efforts have thus far been manual translations, we occurs, Opal will detect the discrepancy and

Appears in Computer Architecture News (CAN), September 2005

recover as necessary. The paper on TFSim/Opal 5 Constraints and Caveats
[18] provides more discussion of the timing-first As with any complicated research tool, GEMS is
philosophy, implementation, effectiveness, and not without its constraints and caveats. One caveat
related work. of GEMS is that although individual components
Features. Opal models a modern dynamically- and some entire system timing testing and sanity
scheduled, superscalar, deeply-pipelined processor checking has been performed, a full end-to-end
core. Opal is configured by default to use a two- validation of GEMS has not been performed. For
level gshare branch predictor, MIPS R10000 style instance, through the random tester we have veri-
register renaming, dynamic instruction issue, mul- fied Ruby coherently transfers cache blocks, how-
tiple execution units, and a load/store queue to ever, we have not performed exhaustive timing
allow for out-of-order memory operations and verification or verified that it strictly adheres to the
memory bypassing. As Opal simulates the SPARC sequential consistency memory model. Neverthe-
ISA, condition codes and other architectural state less, the relative performance comparisons gener-
are renamed as necessary to allow for highly-con- ated by GEMS should suffice to provide insight into
current execution. Because Opal runs ahead of the many types of proposed design enhancements.
Simics functional processor, it models all wrong- Another caveat of GEMS is that we are unable to
path effects of instructions that are not eventually distribute our commercial workload suite [3], as it
retired. Opal implements an aggressive implemen- contains proprietary software that cannot be redis-
tation of sequential consistency, allowing memory tributed. Although commercial workloads can be
operations to occur out of order and detecting pos- set up under Simics, the lack of ready-made work-
sible memory ordering violations as necessary. loads will increase the effort required to use GEMS
Limitations. Opal is sufficient for modeling a pro- to generate timing results for commercial servers
cessor that generates multiple outstanding misses and other multiprocessing systems.
for the Ruby memory system. However, Opal’s
microarchtectual model, although detailed, does 6 Conclusions
not model all the most advanced features of some
In this paper we presented Multifacet’s General
modern processors (e.g., a trace cache and
Execution-driven Multiprocessor Simulator
advanced memory-dependence speculation predic-
(GEMS) as a simulation toolset to evaluate multi-
tors). Thus, the model may need to be extended
processor architectures. By using the full-system
and further validated in order to be used as a basis
simulator Simics, we were able to decouple the
for experiments that focus on the microarchitec-
development of the GEMS timing model from
ture of the system. Also, the first release of Opal
ensuring the necessary functional correctness
does not support hardware multithreading, but at
required to run an unmodified operating system
least one of our internal projects has preliminarily
and commercial workloads. The multiprocessor
added such support to Opal. Similar to most
timing simulator Ruby allows for detailed simula-
microarchitecture simulators, Opal is heavily tied
tion of cache hierarchies. The domain-specific lan-
to its target ISA (in this case, SPARC), and porting
guage SLICC grants GEMS the flexibility to
it to a different ISA, though possible, would not be
implement different coherence protocols and sys-
trivial.
tems under a single simulation infrastructure. The
Perhaps Opal’s biggest constraints deal with its processor timing simulator Opal can be used to
dependence on the Simics functional execution simulate a dynamically-scheduled superscalar pro-
mode. For instance, because Simics uses a flat cessor, and it relies on Simics for functional cor-
memory image, Opal cannot execute a non- rectness. To enables others to more easily perform
sequentially consistent execution. Also we devel- research on multiprocessors with commercial
oped Opal to simulate systems based on the SPARC workloads, we have released GEMS under the
ISA, and therefore it is limited to the specific TLB GNU GPL, and the first release is available at
configurations supported by Simics and the simu- http://www.cs.wisc.edu/gems/.
lated operating system.

Appears in Computer Architecture News (CAN), September 2005

7 Acknowledgments Language and Tools for Hardware and Software
Engineers. Addision-Wesley, 2002.
We thank Ross Dickson, Pacia Harper, Carl [12] James Laudon and Daniel Lenoski. The SGI Origin:
Mauer, Manoj Plakal, and Luke Yen for their contri- A ccNUMA Highly Scalable Server. In Proceedings
of the 24th Annual International Symposium on
butions to the GEMS infrastructure. We thank the Computer Architecture, pages 241–251, June 1997.
University of Illinois for original beta testing. This [13] Daniel Lenoski, James Laudon, Kourosh
work is supported in part by the National Science Gharachorloo, Anoop Gupta, and John Hennessy.
The Directory-Based Cache Coherence Protocol for
Foundation (CCR-0324878, EIA/CNS-0205286, the DASH Multiprocessor. In Proceedings of the
and CCR-0105721) and donations from Compaq, 17th Annual International Symposium on Computer
Architecture, pages 148–159, May 1990.
IBM, Intel Corporation and Sun Microsystems. We [14] Peter S. Magnusson et al. Simics: A Full System
thank Amir Roth for suggesting the acronym Simulation Platform. IEEE Computer, 35(2):50–58,
February 2002.
SLICC. We thank Virtutech AB, the Wisconsin [15] Milo M. K. Martin et al. Protocol Specifications and
Condor group, and the Wisconsin Computer Sys- Tables for Four Comparable MOESI Coherence
tems Lab for their help and support. Protocols: Token Coherence, Snooping, Directory,
and Hammer.
http://www.cs.wisc.edu/multifacet/theses/milo_ma
rtin_phd/, 2003.
References [16] Milo M. K. Martin, Mark D. Hill, and David A.
Wood. Token Coherence: A New Framework for
[1] Ardsher Ahmed, Pat Conway, Bill Hughes, and Shared-Memory Multiprocessors. IEEE Micro,
Fred Weber. AMD Opteron Shared Memory MP 23(6), Nov/Dec 2003.
Systems. In Proceedings of the 14th HotChips [17] Michael R. Marty, Jesse D. Bingham, Mark D. Hill,
Symposium, August 2002. Alan J. Hu, Milo M. K. Martin, and David A. Wood.
[2] Homayoon Akhiani, Damien Doligez, Paul Harter, Improving Multiple-CMP Systems Using Token
Leslie Lamport, Joshua Scheid, Mark Tuttle, and Coherence. In Proceedings of the Eleventh IEEE
Yuan Yu. Cache Coherence Verification with Symposium on High-Performance Computer
TLA+. In FM’99—Formal Methods, Volume II, Architecture, February 2005.
volume 1709 of Lecture Notes in Computer Science, [18] Carl J. Mauer, Mark D. Hill, and David A. Wood.
page 1871. Springer Verlag, 1999. Full System Timing-First Simulation. In
[3] Alaa R. Alameldeen, Milo M. K. Martin, Carl J. Proceedings of the 2002 ACM Sigmetrics Conference
Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, on Measurement and Modeling of Computer
Mark D. Hill, and David A. Wood. Simulating a Systems, pages 108–116, June 2002.
$2M Commercial Server on a $2K PC. IEEE [19] Shubhendu S. Mukherjee, Peter Bannon, Steven
Computer, 36(2):50–57, February 2003. Lang, Aaron Spink, and David Webb. The Alpha
[4] Todd Austin, Eric Larson, and Dan Ernst. 21364 Network Architecture. In Proceedings of the
SimpleScalar: An Infrastructure for Computer 9th Hot Interconnects Symposium, August 2001.
System Modeling. IEEE Computer, 35(2):59–67, [20] Mendel Rosenblum, Stephen A. Herrod, Emmett
February 2002. Witchel, and Anoop Gupta. Complete Computer
[5] Bradford M. Beckmann and David A. Wood. System Simulation: The SimOS Approach. IEEE
Managing Wire Delay in Large Chip- Parallel and Distributed Technology: Systems and
Multiprocessor Caches. In Proceedings of the 37th Applications, 3(4):34–43, 1995.
Annual IEEE/ACM International Symposium on [21] Lambert Schaelicke and Mike Parker. ML-RSIM
Microarchitecture, December 2004. Reference Manual. Technical Report tech. report
[6] Nathan. L. Binkert, Erik. G. Hallnor, and Steven. K. 02-10, Department of Computer Science and
Reinhardt. Network-Oriented Full-System Engineering, Univ. of Notre Dame, Notre Dame,
Simulation using M5. In Proceedings of the Sixth IN, 2002.
Workshop on Computer Architecture Evaluation [22] Jared Smolens, Brian Gold, Jangwoo Kim, Babak
Using Commercial Workloads, February 2003. Falsafi, James C. Hoe, , and Andreas G. Nowatzyk.
[7] Harold W. Cain, Kevin M. Lepak, Brandon A. Fingerprinting: Bounding the Soft-Error Detection
Schwartz, and Mikko H. Lipasti. Precise and Latency and Bandwidth. In Proceedings of the
Accurate Processor Simulation. In Proceedings of Eleventh International Conference on Architectural
the Fifth Workshop on Computer Architecture Support for Programming Languages and Operating
Evaluation Using Commercial Workloads, pages 13– Systems, pages 224–234, October 2004.
22, February 2002. [23] Daniel J. Sorin, Manoj Plakal, Mark D. Hill, Anne E.
[8] David L. Dill, Andreas J. Drexler, Alan J. Hu, and Condon, Milo M. K. Martin, and David A. Wood.
C. Han Yang. Protocol Verification as a Hardware Specifying and Verifying a Broadcast and a
Design Aid. In International Conference on Multicast Snooping Cache Coherence Protocol.
Computer Design. IEEE, October 1992. IEEE Transactions on Parallel and Distributed
[9] Free Software Foundation. GNU General Public Systems, 13(6):556–578, June 2002.
License (GPL). [24] Systems Performance Evaluation Cooperation.
http://www.gnu.org/copyleft/gpl.html. SPEC Benchmarks. http://www.spec.org.
[10] Chetana N. Keltcher, Kevin J. McGrath, Ardsher [25] David A. Wood, Garth A. Gibson, and Randy H.
Ahmed, and Pat Conway. The AMD Opteron Katz. Verifying a Multiprocessor Cache Controller
Processor for Multiprocessor Servers. IEEE Micro, Using Random Test Generation. IEEE Design and
23(2):66–76, March-April 2003. Test of Computers, pages 13–25, August 1990.
[11] Leslie Lamport. Specifying Systems: The TLA+

You can also read