What Can the GC Compute Efficiently? A Language for Heap Assertions at GC Time
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
What Can the GC Compute Efficiently?
A Language for Heap Assertions at GC Time
Christoph Reichenbach Neil Immerman Yannis Smaragdakis
University of Massachusetts University of Massachusetts University of Massachusetts
creichen@gmail.com neil.immerman@gmail.com yannis@cs.umass.edu
Edward E. Aftandilian Samuel Z. Guyer
Tufts University Tufts University
eaftan@cs.tufts.edu sguyer@cs.tufts.edu
Abstract 1. Introduction
We present the DeAL language for heap assertions that are Garbage collection (GC) is a widely popular mechanism in
efficiently evaluated during garbage collection time. DeAL is modern programming languages, employed in representa-
a rich, declarative, logic-based language whose programs are tives of virtually all language classes, from mainstream man-
guaranteed to be executable with good whole-heap locality, aged languages (e.g., Java, C#), to dynamic languages (e.g.,
i.e., within a single traversal over every live object on the Perl, Python), and to the language avant-garde (e.g., Haskell,
heap and a finite neighborhood around each object. As a OCaml). The traditional view on GC considers it a necessary
result, evaluating DeAL programs incurs negligible cost: cost. Our work is based on the observation that GC also rep-
for simple assertion checking at each garbage collection, resents a unique opportunity. GC necessitates an occasional
the end-to-end execution slowdown is below 2%. DeAL is traversal of all live program objects. If all objects are going
integrated into Java as a VM extension and we demonstrate to be traversed anyway, it is natural to consider using this
its efficiency and expressiveness with several applications traversal for more than computing liveness. With minimal
and properties from the past literature. machinery that piggybacks on the GC traversal, the system
Compared to past systems for heap assertions, DeAL is can perform computation over a program’s heap. Such com-
distinguished by its very attractive expressiveness/efficiency putation should be side-effect free, but also similar in struc-
tradeoff: it offers a significantly richer class of assertions ture to a garbage collection traversal, in order to exploit GC
than what past systems could check with a single traversal. well without undue overhead. Furthermore, this computation
Conversely, past systems that can express the same (or more) should not compute results essential to the main program’s
complex assertions as DeAL do so only by suffering orders- control flow, or it would cause much more frequent garbage
of-magnitude higher costs. collection than would be otherwise necessary.
Assertion checking is a good fit for these requirements.
Categories and Subject Descriptors D.2.5 [Testing and Ensuring program properties by way of programmer-defined
Debugging]; D.2.4 [Software/Program Verification]: Asser- assertions is a crucial technique in software development
tion checkers and forms the cornerstone of methodologies such as Design
by Contract [13]. Virtually every programmer with modern
General Terms Algorithms, Reliability training has been well-exposed to the idea that using asser-
tions to trigger program failures is an excellent way to reveal
Keywords heap assertions, garbage collector, reachability, faults. Since assertion checking only queries program data
dominance but does not update them, it requires no computations with
side effects. Additionally assertions should not alter the pro-
gram control flow or compute results that the program will
Permission to make digital or hard copies of all or part of this work for personal or later use.
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation Performing assertion checking during GC time is not
on the first page. To copy otherwise, to republish, to post on servers or to redistribute new. The QVM Java virtual machine [3] allows heap
to lists, requires prior specific permission and/or a fee.
OOPSLA/SPLASH’10, October 17–21, 2010, Reno/Tahoe, Nevada, USA.
probes, which check reachability and dominance properties
Copyright c 2010 ACM 978-1-4503-0203-6/10/10. . . $10.00 at GC time. The GC Assertions system of Aftandilian andGuyer [1] describes a suite of assertions that can be effi- DeAL is an expressive language, as we will demonstrate
ciently checked at GC time. The limitation of these past with many examples. Its generality allows us to write com-
systems is that they either do not offer a general language plex properties (which past systems needed to hard-code)
for writing assertions or they require multiple GC traver- as straightforward logical expressions. For instance, we can
sals to evaluate a single assertion. In this way, they sacri- write the property “all Node objects are dominated by object
fice either efficiency or expressiveness. For instance, the Af- d”, i.e., no path to them exists that does not pass through d:
tandilian and Guyer system includes very specific kinds of
assertions, such as assert-unshared(obj). Recent versions ∀x ∈ Node : ¬Rr0 /d(x)
of the QVM system allow full JML assertions but cannot The special constant r0 refers to the root set of the GC
evaluate those in a single GC traversal, often resulting in nu- traversal. Therefore the property means that no Node object
merous heap traversals in the course of evaluating an asser- is reachable without going through d.
tion. As a consequence, efficiency for QVM assertions relies Using standard boolean connectives we can write inter-
on reducing the cost of every GC via parallelism. Still, per- esting combinations of properties—e.g., the DeAL assertion:
forming multiple GCs, even in parallel, is much more costly
than piggybacking on a single GC (which would need to oc- (∀x ∈ Node : (x.data > threshold) ⇒ Rhotcache(x))∧
cur anyway). (∃y ∈ Node : ¬Rhotcache(y) ∧ y.next = null)
In contrast, we present a Declarative Assertion Language
(DeAL), which fulfills the requirements of computing during (In words, “every Node object with an above-threshold data
GC time: it only allows side-effect-free computations and it value is in the hotcache data structure, and some Node object
can only express programs whose object access patterns are that is not in hotcache has no next link.”) Although non-
a good fit for a GC traversal and incur constant overhead per trivial, this assertion (like any other legal DeAL program, as
object. Our approach is informed by Descriptive Complex- defined in Section 2) is evaluated in a single heap traversal,
ity, where the important complexity classes (LOGSPACE, per DeAL’s locality guarantees.
PTIME, NP, PSPACE, etc.) have been captured by logic- Exploiting the GC allows DeAL assertions to be evalu-
based languages [10]. Although the performance property ated with very low cost. We implemented DeAL by extend-
we are interested in does not correspond to a complexity ing the Jikes RVM and Eclipse Java Compiler and evaluated
class, one can view it as a limited case of linear-time compu- its efficiency extensively. The overhead of evaluating DeAL
tation, where we have the extra guarantee that the program is assertions of course varies with the complexity of the asser-
evaluated in a single traversal of the heap. This is not a prop- tion: if the assertion consists of comparing k field derefer-
erty that is well-expressed in complexity theory, but it is an ence expressions that depend on the current object, it is in-
often-cited property of good locality for garbage collectors. evitable to incur Ω(k) cost per object. For typical small prop-
For instance, Wilson’s GC survey [23] distinguishes algo- erties, however, DeAL imposes at most a few percent over-
rithms based on their locality. A GC algorithm is considered head over the cost of garbage collection, resulting in near-
to have good locality when it only needs to visit each ob- zero (sub-2%) overhead over the end-to-end execution time
ject once. Algorithms that need to perform multiple passes of applications.
over the heap have bad temporal locality, since they always In summary, our paper makes the following contributions:
access the data that have not been accessed the longest. This • We define a declarative language for linear time, single-
behavior is a bad fit for LRU-based cache mechanisms, espe- traversal computations, which results in good locality for
cially for traversals that revisit structures slightly larger than whole-heap traversals. To our knowledge, no past language
the cache. has ever offered this guarantee. The property that all legal
DeAL is a logic-based extension of Java and guarantees programs visit each object exactly once is a general pro-
that any assertion expressible in it runs by traversing each gram property, not limited to the domain of computing dur-
live heap object (and a constant-size neighborhood around ing GC. As such, our language design and overall approach
it) exactly once. From a logic standpoint, DeAL is a single- may find even more applications in the future.
variable first order logic with predicates of the form Rc/d(x), • We designed our language, DeAL, as a natural Java exten-
meaning that object x is reachable from object c without sion for assertion checking and implemented it in a modern
going through object d. For instance, we can express the high-performance JVM. We demonstrate the expressive-
well-formedness property of a doubly-linked list d via a ness of our design with numerous examples, mostly from
logical assertion: the recent research literature.
∀x ∈ Node : Rd(x) ⇒ (x.next.prev = x) • We evaluate the practical effectiveness and efficiency of
DeAL over a suite of assertions and applications. Our
The concrete DeAL syntax uses a textual representation of results demonstrate that the idea of an expressive language
logical operators. For instance, the above property is written: for assertion checking at GC time has excellent properties
forall Node x: reach[d](x) -> x.next.prev == x and promise for wide adoption.primitive-valued expressions can be compared with the stan-
DealAssertion ::= assertion(F) | dard primitive relational operators o |
name without changing the meaning of the formula.
o ≤ o | o ≥ o | o , o | Rt(x) The second important element is that the unary pred-
t ∈ TraversalSrc ::= s/→
−s | /→
−s | t ∨ t icates of the DeAL logic have the form Rs1 /o1 , . . . , oi ∨
s2 /oi+1 , . . . , o j ∨ . . . ∨ sn /ol+1 , . . . , om (x). Note that the ∨s
o ∈ Obj ::= v | o. f | c | x
in the above, although logically designating an “or”, are part
s ∈ SourceObj ::= v | s. f of the predicate name and should not be confused with the
x ∈ LVar is a set of logic variable names top-level logical “or” connective. (For instance, there are
no “and” and “not” counterparts in predicate names.) The
v ∈ JVar is a set of Java variable names
meaning of the above relation is that x is reachable from s1
T ∈ Typ ::= Java types (called a source) via a path that does not include any of the
c ∈ Const ::= Java constants objects o1 , . . . , , oi or x is reachable from s2 via a path that
does include any of the objects o1 , . . . , o j or . . . or x is reach-
f ∈ Field ::= Java field names
able from sn via a path that does not include any of the ob-
jects o1 , . . . , om . (Note that the set of excluded objects keeps
Figure 1. DeAL syntax. Note the distinction between ex- growing monotonically: the objects specified to be excluded
pressions that do and do not contain logic variables (sets for every new source si are in addition to all previously ex-
Obj and SourceObj, respectively). Their only difference is cluded objects.) An empty source si is shorthand for the root
that the former cannot appear as traversal sources in an R set r0 of the garbage collection; thus, we may rewrite our
predicate or be a Java constant. earlier example of ‘all Node objects are dominated by d’ as
∀x ∈ Node : ¬R/d(x).
2. The DeAL Language Together, these design decisions ensure that we can eval-
We next present DeAL’s syntax, semantics, and usage exam- uate queries in a single pass through the heap. First, disal-
ples. lowing nesting ensures that we disallow queries that make
us compare every heap object to every other heap object,
which we cannot do in one pass in general.
2.1 Language Design
Second, the construction of the R ensures that we can find
Syntax. The full abstract syntax of the DeAL language a suitable heap traversal strategy. Intuitively, the key feature
is shown in Figure 1. Since DeAL’s expressions are first- of using a single R predicate to link reachability properties
order logic formulas, we will freely alternate between the only with an or connective is that the predicate value is
logical and the concrete DeAL syntax, as convenient for determined regardless of which path in the object graph was
presentation purposes. The reader should treat the logical used to reach an object. That is, if we reach an object via
notation as a shorthand for the text DeAL/Java syntax, i.e., one path, we do not need to later try to also reach it via other
∀x ∈ T , ∃x ∈ T , Rt, ∧, ∨, ¬, ∈, =, ⇒, ⇔, ,, ≤, ≥ are paths.
synonyms of the actual DeAL operators forall T x, exists We should point out precisely how the language of DeAL
T x, reach[t], &&, ||, !, instanceof, ==, ->, , !=, =, formulas, F, formally fits the mathematical definition of a
respectively. first order logic—e.g., see [10]. From the logic perspective,
The language has three primitives: assertion(F ), Java object expressions are treated as logical constants—
assertDisjoint(F ), and unsafeAssert(F ), where F is a e.g., a Java variable v is a constant from the perspective of the
sentence in first order logic. All three primitives attempt logic, since its meaning does not change for different values
to check whether the formula is true, but they have differ- of the logical variables. Java fields are treated as functions—
ent requirements and soundness guarantees, discussed later. for instance, the expression x.next is equivalent to a logical
All the logical connectives have their usual meaning, with function “next” applied to logical object x.
forall/exists quantification implicitly applying over all live
heap objects. Objects can be compared for equality andType Checking and Assertion Semantics. DeAL asser- specifics on the current DeAL integration in Java. Although
tions can appear as Java statements at any legal point these are largely engineering choices (which could change
in a Java+DeAL program. Each assertion (a member of without altering the essence of the language) they are useful
DealAssertion in the syntax of Figure 1) is a separate DeAL in order for the reader to establish a concrete model of usage.
program, evaluated independently. The DeAL guarantee is
Integration. We have integrated DeAL to Java as a con-
that each assertion requires a single traversal over objects, servative extension. The DeAL language is hidden behind a
and thus can be evaluated during a single garbage collec- library and the user invokes assertions by calling static meth-
tion. (In Section 2.2 we discuss in more detail the various ods assertion, assertDisjoint, and unsafeAssert in the
options on how often assertions are evaluated and how they main DeAL class. The method calls accept DeAL formulas
are integrated in our current implementation.) as regular strings. For instance, the doubly-linked list con-
For a syntactically well-formed DeAL program to also sistency invariant is asserted as:
be valid, it first needs to satisfy Java type constraints on its assertion(
expressions. The types of field reference expressions (either "forall Node x: reach[d](x) -> x.next.prev == x");
j or o in Figure 1) have to be compatible with their uses with
operators =, ≤, etc. That is, a relational expression (r in the This allows unsuspecting Java compilers to process
syntax) should be either a legal Java expression with type DeAL programs by linking to a simple stub library. The
boolean or an instance of predicate R. DeAL compiler, however, fully parses and statically type
More interestingly, DeAL imposes restrictions on how checks the assertion string. (As indicated earlier, most of the
the R predicates are used in different kinds of asser- examples in this paper use the equivalent DeAL syntax of
tions. These restrictions are closely tied to the meaning of Figure 1 instead of the concrete textual syntax. Accordingly,
assertion vs. assertDisjoint vs. unsafeAssert.
we elide the quotes that delimit the assertion string.)
In designing DeAL’s assertion evaluation model, we con-
• In instances of assertion(F ), at most one reacha- sidered two questions: when to evaluate and whether to eval-
bility relation, Rs1 /o1 , . . . , oi ∨ s2 /oi+1 , . . . , o j ∨ . . . ∨ uate. Regarding when to evaluate, we chose to execute asser-
sn /ol+1 , . . . , om (x), may occur in F, although the same re- tion statements at the point they are encountered, rather than
lation may occur multiple times. An assertion primitive delaying until the time when garbage collection would oc-
straightforwardly checks the asserted condition in a single cur anyway. Compared to past systems, this design decision
traversal over all live objects. resembles that of the QVM system [3], as opposed to the
option followed by Aftandilian and Guyer [1], which puts
• In instances of assertDisjoint(F ), multiple reachability off assertion checking until GC time. The latter has the dis-
relations may occur, but, if two such relations occur, then in advantage that the shape of the heap may have changed be-
addition to asserting F we are asserting that the domains of tween the point of assertion statement and the point of its
the two relations are disjoint. If there is a node that satisfies evaluation, resulting in incorrect evaluation (both unsound
both relations then we will detect this and report failure and incomplete).
of the assertion. (If the domains are all disjoint, we can However, the issue with evaluating assertions at the point
evaluate F in a single run through the heap during garbage they are encountered is that if every assertion causes a
collection, and if the domains are not all disjoint then we garbage collection then GC will occur too frequently, with
will detect this during the single traversal of the heap.) high overhead. Therefore, DeAL decides whether to eval-
• In instances of unsafeAssert(F ), multiple R relations uate an assertion by measuring the amount of time since
may occur without a claim of disjointness. In this mode, the last GC. The assertion is evaluated only if a sufficiently
the DeAL system will behave as if the domains of the R long time has elapsed. In this way, assertions can be ignored,
relations mentioned are all disjoint, without checking this resulting in incompleteness.1 By default, we tie our deci-
property. If the relations are not disjoint, then we are not sion of whether an assertion is to be evaluated or not to the
guaranteed to detect this, nor to actually check F. Specifi- same internal metrics that the Jikes RVM GC uses to decide
cally, DeAL will not correctly check formula F if the truth whether to perform a collection. Thus, we evaluate an asser-
value of F depends on objects that simultaneously satisfy tion roughly around the time a GC would be due to happen
multiple R relations. The reason is that DeAL will only anyway. This is what we view as the most likely mode of
traverse such objects once, while processing one of the use of DeAL: assertion checking happens at low frequency
R relations and without knowing whether the object sat- but without imposing heavy overhead. We offer the user the
isfies other Rs. Yet formula F may express different re- option of “throttling” the evaluation frequency of assertions,
quirements for an object depending on which R it satisfies. up to executing every single assertion encountered and suf-
fering the resulting overhead. This allows us to enforce as-
2.2 DeAL Language Usage
1 We choose to view DeAL as a fault detector, hence the meanings of
We next discuss how the user interacts with the DeAL lan- “sound” and “complete” are exactly inverse from what they would be for
guage and offer specific usage scenarios. We begin with a correctness verifier.sertion checking even in pathological cases that our default As mentioned briefly in the Introduction, we can express
heuristic would always skip, such as when specified right af- dominance properties as non-reachability if the dominator is
ter a call to System.gc(). excluded [22]. For instance, we can evaluate a property over
all dominated nodes:
Examples. There are several possibilities for using DeAL
in the course of regular programming. These include local assertion(∀x ∈ Node : ¬R/c(x) ⇒ x.data > 0);
data structure invariants, ownership and dominance prop-
erties, disjointness properties, reachability properties, and Dominance by a single object is an object ownership
combinations of those. We show below several DeAL as- property. Combinations of dominance properties allow more
sertions, most of them taken from the recent literature [6– complex conditions. For instance, we can specify that every
8, 11, 16, 17, 19]. object of type Person has to be dominated by the combina-
One common kind of condition concerns type constraints. tion of data structures males and females. That is, no path
For instance, all objects in a data structure may need to be in the object graph should be able to reach a Person object
Serializable, and no instance of a subtype of Thread may without going through the object called males or the object
be stored in the structure: called females in the program:
assertion(∀x ∈ Node :RHttpSession(x) ⇒ assertion(∀x ∈ Person : ¬R/males,females(x));
x instanceof Serializable ∧ Note that dominance by a set of sources is not equivalent
¬(x instanceof Thread)); to any boolean combination of individual dominance prop-
erties. In particular, “x is dominated by the combined objects
(Note that although class Thread is not Serializable, a males and females” is equivalent to neither “x is dominated
user-defined subclass of it may be.) by males AND x is dominated by females” nor “x is dom-
Local invariants for data structures are quite common and inated by males OR x is dominated by females”. (Consider
useful. We already saw a doubly-linked list invariant, and a objects reachable from both data structures, which are dom-
sortedness invariant is equally easy to express: inated by neither one.)
If we want to state that the two structures are disjoint,
assertion(∀x ∈ Node : Rc(x) ⇒ x.data ≤ x.next.data);
we can do so with an assertDisjoint statement, which will
also allow us to combine other conditions on the objects. For
(Note that we can ignore the case of x.next being null. If
instance, we can have:
a subexpression of a field dereference expression is null
DeAL ignores the corresponding value of the logical vari- assertDisjoint(∀x ∈ Person : Rmales(x)∨Rfemales(x));
able.2 )
It is similarly easy to specify that a data structure rooted This states that any object of type Person is reachable from
at c should have a cycle: one of the two data structures, though there may also be other
paths to it. Since an assertDisjoint allows multiple disjoint
assertion(∀x ∈ Node : Rc(x) ⇒ x.next , null); R predicates in the same assertion, we can easily strengthen
the condition to express “every Person is dominated by
In the same way, we can state data structure invariants for either males or females but not both”, by combining the last
binary and n-ary trees, heaps, and so on. two example formulas:
More interesting properties are those that require the
reachability of all objects that satisfy a condition: assertDisjoint(∀x ∈ Person :
(Rmales(x) ∨ Rfemales(x))∧
assertion(∀x ∈ Node : (x.next = null) ⇒ Rleaves(x));
¬R/males,females(x));
Such properties are not easily expressible in plain Java, be-
Disjointness conditions have a variety of applications, in-
cause they refer to any possible live object (with a given
cluding leak detection (data may be reachable by unexpected
type, in this case) and not just to objects in a specific struc-
data structures) and detection of improper sharing among
ture that could be traversed with a plain for loop.
threads. For instance, we can state that trees from distinct
With an existential quantifier, we can easily assert that a
compilation phases do not share structure:
value satisfying certain properties is found, independently of
which data structure contains it: assertDisjoint(∀x : ¬(RparseTree(x)∧RsyntaxTree(x)));
assertion(∃x ∈ Node : x.data > 3.141∧ x.data < 3.142); The above examples are representative of simple but pow-
erful DeAL assertions. The examples do not include uses of
2 Forexample, if in the quantified expression, “Qx : ϕ,” x.next.data
occurs in ϕ, then the quantifier is restricted to the case that x.next , unsafeAssert, which allows more complex assertions but
null, i.e, (Qx such that x.next , null) : ϕ. with few guarantees, as we describe next.Unsafe Assertions. The unsafeAssert primitive gives the 3.1 Single Traversal Evaluation
user extra flexibility but at the cost of not guaranteeing cor- DeAL assertions are guaranteed to be evaluated in a single
rectness. We do not consider unsafeAssert to be a core part traversal of live heap objects. Although in the rest of the
of the DeAL language, and we have not used it in prac- paper we informally describe this property as “each object is
tice. Nevertheless, it represents an escape clause for the user traversed once”, this assumes an implicit understanding that
to specify richer properties than those that DeAL can stat- an object may be re-examined briefly, only to discover that it
ically guarantee to check in a single traversal. It is the re- was visited earlier. Below we state the desired property more
sponsibility of the user to ensure that such properties make precisely.
sense under DeAL’s single-traversal execution. Therefore, it
is worth elucidating possible usage patterns, since correct- Theorem 1. A valid, safe DeAL assertion (i.e., a sin-
ness depends on understanding them. There are four possi- gle assertion or assertDisjoint statement but not an
bilities for the correctness of evaluating an unsafeAssert: unsafeAssert statement) can be evaluated correctly in a
single traversal that crosses each edge in the object graph
• The formula is evaluated soundly and completely, i.e., exactly once and performs a constant amount of work per
DeAL assertion failures correctly indicate that the formula object. (This constant depends on the size of the assertion.)
is false, and if the DeAL assertion evaluation does not fail,
the formula is true. This is the case when the shape of Proof. Consider first the case of an assertion statement.
the properties or of the heap ensures that the truth value According to Figure 1, an asserted formula is a boolean com-
of the formula does not depend on which objects satisfy
bination of quantified formulas (each starting with ∀ or ∃),
multiple Rs. For a simple example, consider a formula
with each quantified formula being a boolean combination
that treats all objects reachable under two different sets of
of relational expressions, r. Since all quantification is over
conditions the same way: ∀x : (Rs1 /o1 (x) ⇒ x.data >
live objects, it is clear that we only need to evaluate every
0) ∧ (Rs2 /o2 (x) ⇒ x.data > 0). relational expression r once per object. All relational expres-
• The formula is evaluated soundly, i.e., DeAL assertion sions of a form other than Rt(x) can be evaluated correctly
failures correctly indicate that the formula is false. This in constant time for any object. Thus, we can evaluate all
means that some DeAL warnings may be missed, but this such expressions (from the entire asserted formula) at what-
is likely practically acceptable. For example, the formula ever time an object is visited. The difficulty lies in proving
may contain two distinct R predicates with one of them that we can define a traversal (i.e., a loop over live objects)
implying stronger conditions than the other (e.g., ∀x : for which expressions of the form Rt(x) are also correctly
(Rs1 (x) ⇒ x.data > 0) ∧ (Rs2 (x) ⇒ x.data ≥ 0)). DeAL evaluated in constant time. If we also show this, then our
offers no guarantee as to which condition it will check for traversal needs to keep current truth values only for all quan-
objects that satisfy both Rs. tified formulas and at the end of the entire traversal perform
• The formula is evaluated completely: there may be false their boolean combination for the final result. (An optimiza-
assertion failures, but if no assertion fails then the formula tion consists of short-circuiting the evaluation of quantified
was true at the point of evaluation. An example would formulas. That is, if an ∃ or ∀ formula has already had its
be the complement of our previous “unsound” formula: truth value determined before the end of the traversal—this
∃x : Rs1 (x) ∧ x.data ≤ 0 ∨ Rs2 (x) ∧ x.data < 0. can only be true for ∃ and false for ∀—there is no reason to
evaluate for further objects the relational expressions r in the
• The formula is evaluated neither soundly nor
body of this quantified formula.)
completely—e.g., it combines an incomplete and an un- Recall that, per the restrictions of Section 2.1, a valid
sound sub-formula. This scenario may still be valuable if assertion statement can only contain instances of a sin-
it turns out to detect failures often while issuing false re-
gle Rt(x) predicate. The truth value of such a predicate
ports rarely. Practical automatic bug detectors are often
can be computed (in constant time) for a traversal visit-
unsound and incomplete—the authors of ESC/Java have
ing every live object once. The full form of the predicate
argued forcefully that soundness and completeness are
is Rs1 /o1 , . . . , oi ∨ s2 /oi+1 , . . . , o j ∨ . . . ∨ sn /ol+1 , . . . , om (x).
overrated properties compared to practical usefulness in We define the traversal to satisfy the required property as
catching bugs [9]. follows:
3. Single Traversal Property and Language • Mark objects o1 , . . . , oi ignored.
Implementation For example, for query Ra/b ∨ c/d, e(x) in Figure 2, we
We next state and prove DeAL’s single traversal property. mark b as ignored.
This discussion also serves to introduce the way the DeAL • Perform a standard reachability traversal starting at ob-
compiler processes programs and the changes required to the ject s1 . The traversal processes objects that are neither
runtime system so that assertions are evaluated as part of a ignored nor visited and marks objects visited after pro-
GC traversal. cessing them. Predicate Rt(x) is true for each processedroot visited flag and one for ignored).3 From the perspective
of the visited flag, the above traversal is a full replacement
a
❻ c of a GC reachability traversal with only the object visiting
❶ ❷ order modified.
d b e The handling of statement assertDisjoint is quite sim-
❸ ❺ ilar. The difference is that, in addition to the visited and
ignored flags we maintain per object, we need dlog(u)e
bits per object, where u is the number of distinct predi-
Figure 2. Traversals for Ra/b ∨ c/d, e(x), in order. cates Rt(x). These bits form a traversalNum counter. The
traversal proceeds by repeating all but the last step of the
object, since it is reachable from object s1 without going assertion traversal for every Rt(x) predicate, with some
through o1 , . . . , oi . Thus, when processing an object, the small differences. First, while marking an object visited af-
traversal can evaluate (for the current object) all relational ter processing, the traversal sets traversalNum to the current
expressions r in the entire formula. traversal number, which is initially 0 and incremented for ev-
For our example in Figure 2, we execute traversal ¶, ery Rt(x) predicate examined. Second, for each previously
starting from a, ignoring b. visited object encountered, the traversal checks whether the
• Proceed in the same manner for other sources and ex- visited object’s traversalNum is equal to the current traver-
cluded objects. That is, mark objects oi+1 , . . . , o j ignored, sal number (indicating a previously visited object during the
perform a standard reachability traversal starting at object current traversal) or lower (indicating a disjointness error
s2 , etc. In total there are n such steps (one for each object and causing an immediate assertion failure, unless the object
s1 , . . . , sn ). is also ignored). The final traversal (from the root set and
In our running example (Figure 2) we now ignore objects from excluded objects) uses a traversal number of 0, thereby
d and e and then perform traversal ·. avoiding disjointedness errors.
• Remove the ignored flag from all objects o1 , . . . , om . The above assertDisjoint traversal maintains the prop-
In our running example we now remove this flag from b, erties required for the proof. First, the truth value of every R
d, and e. predicate is evident at the time of visit of each object, since
• Perform m reachability traversals, each starting at object the disjointness property guarantees that the only predicate
ok for 1 ≤ k ≤ m, followed by a reachability traversal R that is true for the object is the R predicate whose traversal
starting at the root set (i.e., the default live objects of the reaches the object. Second, the visited flag ensures no edge
program’s execution, such as the stack and the global re- is crossed more than once, by the same argument as for the
gion). Every traversal processes all non-visited objects assertion traversal.
and marks objects visited after processing them. During
each of these m + 1 traversals, predicate Rt(x) is false for Although the above proof is not hard, this is largely due
each processed object, by nature of the traversal: the object to the careful design of the DeAL language, which main-
was not reachable by any si without going through the cor- tains the properties required for a single traversal evaluation
responding excluded objects o1 , . . . , om . When processing of assertions. We note that it is quite easy to be misled with
an object, the traversal can evaluate (for the current object) respect to predicates that can be evaluated in a single traver-
all relational expressions r in the entire formula. sal. For instance, a common mistake is to attempt to integrate
In our running example, we first run traversals ¸ through the inverse of predicate Rc(x) in a single traversal language:
º, though ¹ would be a no-op (since we already visited d It is not possible to add to DeAL a predicate Pc(x) with the
in ¶ before ignoring it). Then we conclude with the root meaning “x can reach c” while maintaining the same single
traversal ». traversal property for arbitrary graphs.
Note that no edge in the object graph (that is, a reference 3.2 DeAL Implementation
field of an object) is processed twice: the standard reachabil- The proof of the single traversal theorem essentially lays out
ity traversal follows edges out of an object exactly once, after the principles of the DeAL implementation. Each traversal
marking the object visited, and the only further operation may need to touch the entire heap, so DeAL must interface
on a visited object is to check whether it is visited. We with a whole-heap garbage collector, such as mark-sweep
invert the meaning (i.e., mapping to a 0 or 1 bit) of visited or the major collection of a generational collector (which
after every collection. (This is a standard technique in mark- performs whole-heap traversals).
sweep garbage collection, avoiding a second visit of nodes
3 This requirement can be reduced to just one bit per object, by using the
just to reset them.) After collection no object is ignored
visited bit to also mean ignored, plus a second bit only for each of the
(since no object is ignored initially.) m ignored objects. The reason for the second bit is that, during the final
The above traversal satisfies all requirements of the proof. traversal from the root set, it is necessary to remember whether an ignored
It also requires space of two bits per object (one for the object has been visited or not.DeAL instructs this garbage collector to perform the Node objects directly referenced by array arr are dominated
traversals described in the proof for assertion and by arr”. This is a property that can be evaluated in a single
assertDisjoint. Statement unsafeAssert is handled identi- traversal, yet it is not expressible in DeAL. One can attempt
cally to assertion with the only difference being that all but to write the property straightforwardly as:
the last step of the assertion traversal are repeated for every
Rt(x) predicate in the formula. Therefore, unsafeAssert
visits every object exactly once but without guaranteeing assertDisjoint(∀x ∈ Node : Rarr(x) ⇒ ¬R/arr(x));
correct evaluation of the assertion, since each object is vis-
ited only as part of a traversal corresponding to a single R Yet this is not the desired property: the Node objects
predicate. directly referenced by arr are only a subset of all the objects
DeAL is implemented on top of Jikes RVM [2], modify- reachable by arr. For instance, the Node objects referenced
ing the included MarkSweep garbage collector. The bulk of by arr may themselves refer to other objects that are not
the implementation consists of changes to the runtime sys- dominated by arr, causing a spurious disjointness error.
tem, and especially interfacing with the collector. To pro- Intuitively, the desired property should be expressible in
duce a clean and general interface, we introduced a small DeAL but is not because of a technicality: Although the
“trace-and-test” language inside the VM. A trace-and-test property is local (it requires traversing just one link to tell
program instructs the garbage collector what to mark recur- that an object is directly referenced by arr), it is not local
sively, what to test for, and what individual objects to ex- from the perspective of the referenced object: when visiting
pressly exclude or include in the traversal. The trace-and- a random object it is not possible to know in constant time
test language has primitives such as exclude-node(o) (to whether it is directly referenced by arr.
exclude an object from the traversal), include-node(o) (to These (as well as many other) expressiveness limitations
unmark an object, so that it is processed when next en- of DeAL can be overcome by combining Java computation
countered), mark-traversal-id(x) (to increase the traver- with DeAL assertion checking. The general programming
sal number), set-expressions(k1 , . . . , kn ) (to set up the sets pattern consists of having Java code compute a property
of relational expressions that need to be evaluated for ev- of objects, summarize the results as a local property (for
ery node, together with information, per set, on whether the instance, by marking a field for all objects satisfying the
expressions are in an existential or universal context), and desired property) and invoke a DeAL assertion that checks
trace(o) (to start a reachability traversal at object o). whether the marked objects have the expected structure.
The DeAL compiler translates an assertion into trace- In our above example, if we add an otherwise unused
and-test language instructions, following closely the traver- field mark in class Node, we can easily express the desired
sal structure described in the proof of Section 3.1. The rela- assertion by combining Java and DeAL evaluation:
tional expressions that need to be evaluated per-object are
translated into plain Java methods that the runtime com-
Object dummy = new Object();
piles with the maximum optimization level allowed by the
Jikes just-in-time compiler. We altered the garbage collec- for (Node n : arr) n.mark = dummy;
tion logic in the virtual machine to execute the trace-and-set assertion(∀x ∈ Node : (x.mark = dummy) ⇒ ¬R/arr(x));
instructions. For marking nodes, we employ 4 bits per object
that are currently unused by the runtime, resulting in a max- This programming pattern is quite powerful. It allows
imum of 8 R predicates per assertDisjoint formula (since nearly arbitrary extension of the properties that can be
one bit is occupied by the ignored flag); in addition we ob- checked, at the expense of performing some of the work
tain the visited flag from a region in the object header re- outside the DeAL traversal. Given the generality of the pat-
served for the existing MarkSweep garbage collector. In the tern, we are considering adding language support for it in fu-
future, the implementation could allocate extra space per ob- ture versions of DeAL. Namely, we can add a labelObject
ject on a side structure, allowing unlimited R predicates. primitive to DeAL and allow the Java program to mark ob-
jects with a finite set of labels, at any point in the execu-
tion. Then, when a DeAL assertion is checked, it can refer to
4. Discussion: Expressiveness Limitations,
whether an object has a certain label or not (labels are unary
Programming Patterns predicates). The markings will be cleared after evaluation of
DeAL is an expressive language but has by design some ex- the assertion. This simple addition to the language is just a
pressiveness limitations to support its single traversal guar- convenience feature that obviates the need for extra fields,
antee. Clearly, DeAL cannot express computations that are such as mark in our earlier example. The markings can be
inherently non-linear, or non-single-traversal. Furthermore, kept on a side structure that the DeAL traversal consults. We
DeAL occasionally cannot support computations that hap- have not added such convenience features in our first version
pen to be computable in a single traversal, if their structure of DeAL, preferring to concentrate on the core language in-
obscures this fact. For instance, consider the property “all stead of on elements that can be easily emulated.5. Experiments 5.2 Controlled Microbenchmarks
We evaluate the implementation of DeAL with several ex- The overhead of DeAL evaluation depends strongly on the
periments: complexity of assertions. For instance, if the user asserts
a property that is a function of the values of k neighbors
• We evaluate the assertions of Section 2.2 in a full heap of every live object, then the runtime cost will necessarily
of objects with no other computation performed in the be Ω(k). Therefore, the interesting question concerns the
application. This is a controlled microbenchmark for the overhead for typical assertions. We used as our evaluation
given assertions, intended to check how much the overhead set the example assertions of Section 2.2 (which mostly
varies with the complexity of typical assertions and with emulate assertions from the recent research literature [6–
the percentage of relevant objects. 8, 11, 16, 17, 19]). The assertions are applied to specially
designed skeletal programs, with appropriate data structures
• We run a wide spectrum of actual applications (includ-
for each assertion.
ing the DaCapo benchmarks [4], SPEC JVM98 [20],
The intention of these microbenchmarks is to show what
and pseudojbb—a fixed-workload version of SPEC
can be expected as a worst-case execution time for repre-
JBB2000 [21]). We measure the overhead of simple
sentative assertions. These programs do nothing aside from
(generic) DeAL assertion checking, when assertions are
building the data structure and running an assertion on it.
evaluated with the same frequency as that of normal
We used three different input sizes, determining the num-
garbage collection in the course of the application. This
ber of objects on the heap (counting only objects with types
is an end-to-end experiment establishing cost under condi-
relevant to the assertion): small (10,000 objects), medium
tions where the user intends to achieve near-zero-overhead
(100,000 objects), and large (1 million objects).4 The com-
execution.
parison is to an unmodified Jikes RVM that runs a regular
• We analyze in more depth the overhead of a custom as- System.gc() every time DeAL would check an assertion.
sertion that reveals a (known) bug in pseudojbb. This is a Figure 3 shows the GC time overhead results for all three
close approximation of a “real use” scenario. input sizes.
The large input size expectedly incurs the most over-
All experiments confirm that the overhead of DeAL is head. The geometric mean of the GC time overhead is 1.25.
small and becomes negligible for assertion evaluation at That is, assertion checking slows down garbage collection by
roughly normal GC frequencies. just 25%. The absolute worst-case GC time slowdowns rise
to 60-70%. Generally, assertions that access fields perform
worse than those that just check disjointedness or reacha-
5.1 Methodology and Setup bility properties. Note that the object type provided in the
assertion does not serve as an efficient filter, since the pro-
Our system is implemented on top of Jikes RVM 3.1.0, so we
gram is specially designed to create only objects matching
compare our results to those of the unmodified Jikes RVM
the type examined in the assertion. Thus, the fields of vir-
3.1.0, both using the full-heap MarkSweep collector.
tually every program-generated object need to be examined.
We use the DaCapo benchmarks versions 2006-10-MR2
(This is unlikely to be the case in a realistic setting: object
and 9.12-bach. For SPEC JVM98, we use the large input size
types will be used to disqualify most objects without needing
(-s100); for DaCapo 2006, DaCapo 9.12, and pseudojbb,
to examine their fields.) The three worst performers are as-
we use the default input size. Many of the DaCapo 9.12
sertions “cyclic”, “doubly linked list”, and “sorted”, repro-
benchmarks do not run on the unmodified Jikes RVM 3.1.0,
duced here:
so we provide results from only those that do. We perform
our experiments on a Core 2 Duo E6750 machine with 2 GB
assertion(∀x ∈ Node : Rc(x) ⇒ x.next , null);
of RAM, running Ubuntu Linux 9.04.
We use the adaptive configuration of Jikes RVM, which
assertion(∀x ∈ Node : Rd(x) ⇒ (x.next.prev = x));
dynamically identifies frequently executed methods and re-
compiles them at higher optimization levels. We iterate each assertion(∀x ∈ Node : Rc(x) ⇒ x.data ≤ x.next.data);
benchmark k times and record the results from the last it-
eration. For microbenchmarks (Section 5.2) k = 10, and As can be seen from Figure 3, even the worst case slow-
for larger programs k = 4. We repeat this twenty times for down is quite low, considering that GC time is a small part
each benchmark. All numbers reported are means of the 20 of total program execution time. Indeed, even though these
runs, and our graphs also include 90% confidence intervals.
4 The choice of sizes follows the sizes of the DaCapo benchmarks. Across
We execute each benchmark with a heap size fixed at two
all DaCapo benchmarks, the geometric mean of the maximum number
times the minimum possible for that benchmark using the of live objects is roughly 100,000. The minimum is about 3,000 and the
MarkSweep collector. We vary the UNIX environment size maximum is 3.2 million [4]. Therefore, our three setups have roughly the
to avoid measurement bias [14]. same spread of sizes as the DaCapo benchmarks.Figure 3. GC time overhead for representative assertions and skeletal programs, for small (top), medium (middle), and large
n n n
ea ea ea
Assert
Assert
Assert
om om o m
Base
Base
Base
ge ge ge
d d d
rte rte rte
so so so
le le le
ab ab ab
liz liz liz
ria ria ria
se se se
pi pi pi
es es es
av t av t av t
le lis le lis le lis
d_ d_ d_
ke ke ke
in in lin
_l _l y_
ly ly bl
ub e ub e u e
do pl do pl do pl
eo eo eo
_p _p _p
ce ce ce
an an an
in in in
m m m
do do do
ce ce ce
an an an
in in in
m ed m ed m ed
do at do at do at
in in in
m m om
do do _d
t_ t_ nt
in in oi
sjo sjo sj
di le
r di le
r di le
r
pi pi pi
m m m
co co _c
o
t_ t_ nt
in in oi
sjo sjo sj
di di di
t t nt
in in oi
sjo sjo sj
di di di
ic ic ic
cl cl cl
cy cy cy
(bottom) input sizes.
200
150
100
50
0
200
150
100
50
0
200
150
100
50
0
Normalized GC time Normalized GC time Normalized GC timeskeletal programs do very little other than assertion check- becomes just ∀x ∈ Order : x.field , specialValue. Since
ing, the total execution time overhead (not shown in Fig- the ∀ iteration is over live objects and we expect our removed
ure 3) has a geometric mean of 1.07 for the small config- objects to be unreachable, no live object should have the spe-
uration, 1.08 for the medium, and 1.13 for the large—i.e., cial value.
assertion checking causes slowdowns of just 7, 8, and 13%, The bug is hard to detect because Order objects are
respectively. removed from the orderTable only at rare program in-
stances. Specifically, the orderTable is occasionally com-
5.3 End-to-end Simple Assertion Cost
pletely emptied and then destroyed. Since the programmer
We ran a large suite of end-to-end benchmarks to quantify might not know how often this happens, we measured the
the overall overhead of our system. Our suite of applica- cost of employing DeAL at high frequency: the assertion is
tions includes the DaCapo benchmarks, SPEC JVM98, and a evaluated once every 5 times it is encountered. The high fre-
fixed-workload version of SPEC JBB2000 called pseudojbb. quency of assertion checking causes much more frequent
Clearly, the challenge of evaluating overheads over a wide GCs than normal program execution without DeAL: the
range of applications is that real assertions are application- mean of the number of GCs (over 20 executions) rises from
specific. To overcome this difficulty, we added a “generic” 2.85 to 12.7, and the total GC cost is 3.37 times higher for the
assertion in our runtime that runs at the same frequency as version employing DeAL. However, even this high GC over-
normally-triggered GC. This assertion checks that there are head is hardly noticeable in terms of total program execution
no objects on the heap that are an instance of a certain un- time. The slowdown is merely 6.1%: the assertion-checking
used class. Thus, the assertion should always return true, but program runs in 3.87s vs. 3.65s for the original program run-
it will cause an instanceof predicate to be tested on every ning on an unmodified VM.
live object. In addition, we allow this dummy assertion to be
combined with a traversal disjointness test, which incurs the
cost of also writing and checking a traversal ID in the header
of every live object. 6. Related work
Figure 4 shows the overhead of DeAL on GC time for Our work is related to a wide range of techniques, both static
the benchmark applications. The figure shows three configu- and dynamic, for checking properties of data structures.
rations: Base (unmodified Jikes RVM), Assert (instanceof Research in static analysis has yielded a significant body
assertion), and AssertDisjoint (instanceof assertion + dis- of sophisticated techniques for modeling the heap and an-
jointness check). alyzing data structures at compile-time. The strength of
Overall, the geometric mean of the slowdown in GC time static analysis is that it explores all possible paths through
for Assert is 9.10% and for AssertDisjoint is 11.96%. These the program: a sound analysis algorithm can prove the ab-
overheads are quite low and become vanishingly small when sence of errors, or even verify the full correctness of a data
computed as a percentage of total execution time, instead structure implementation. The weakness of static analysis is
of just GC time: the geometric mean of the total execution that many properties, particularly those involving pointers
slowdown for both kinds of assertions is below 1.9%. (The and heap-based structures, are undecidable in principle [15],
mean number of garbage collections for these benchmarks is and extremely difficult to approximate precisely in prac-
20.3, with a standard deviation of 25.9.) tice [5, 12, 18, 24]. These problems are particular severe for
languages such as Java, which include many hard-to-analyze
5.4 Representative Example features such as reflection, bytecode rewriting, and dynamic
To evaluate the overhead of a real use of DeAL, we sim- class loading.
ulated, with pessimistic assumptions, the process by which Our work is most closely related to several recent tech-
DeAL would be used to detect a bug in practice. We picked niques for dynamically analyzing heap data structures, in-
the most complex of the bugs previously identified in pseu- cluding two that check properties at garbage collection time.
dojbb using heap assertions [1]. The bug consists of remov- GC Assertions provides a set of heap assertions that, like
ing Order objects from the orderTable data structure. These DeAL, are checked by the garbage collector at runtime [1].
orders should then become unreachable and be garbage This work differs from DeAL in two critical ways. First,
collected. However, the objects are also erroneously refer- it does not provide a full language for heap properties, but
enced from Customer objects, causing a memory leak. (The rather a fixed set of assertions that cannot be composed. Sec-
bug fix consists of unlinking Order objects from Customers ond, it does not check assertions at the time they are encoun-
when they are removed from the orderTable.) Our assertion tered, but instead defers checking until the next regularly
checks whether the objects removed from orderTable can scheduled garbage collection. While this choice improves
be reached. This can be effected with a simple variant of the performance, it limits the kinds of heap properties to those
programming pattern of Section 4. Namely, once an object that can be checked at any time, since they might, for exam-
is removed from the orderTable we mark it by assigning a ple, be deferred to a point at which they are knowingly but
special value to one of its fields. The DeAL assertion then temporarily invalidated.200
Base
Assert
Normalized GC time
150
AssertDisjoint
100
50
0
da
da
da
da
da
da
da
da
da
da
da
da
da 200 lan
da 200 nde
pj
jv
jv
jv _20 mp
jv _20 ss
jv _20 ytra
jv _21
jv _22 vac
jv _22 peg
ge
bb 09
m
m
m
m
m
m
m
m
ca
ca
ca
ca
ca
ca
ca
ca
ca
ca
ca
ca
ca
ca
om
98 jbb
98
98
98
98 9_d ce
98 3_ja
98 2_m
98 7_m aud
20 .xa h
po
po
po
po
po
po
po
po
po
po
po
po
po
po
ea
._
.
.
.
.
.
.
._
0
20
20
20
20
20
20
20
20
20
20
20
2
n
0.
20 000
22 trt
0
06
06
06
06
06
06
06
06
06
06
06
p
1
2
5
8_
9.
9.
_c
_j ress
_r
.an
.b
.ch
.ec
.fo
.h
.jy
.lu
.lu
.p
.x
lu
l
ja
us
lo
sq
m
a
e
a
o
2
b
th
in
se
i
p
lip
tlr
ar
l
ck
e
an
at
d
ld
on
de
ar
ar
t
se
b
ch
c
x
x
io
Figure 4. GC time overhead of our system for end-to-end benchmarks.
The QVM system provides a richer set of heap checks, pair broken data structures [6]. While the checking overhead
called heap probes, which QVM performs using garbage is difficult to evaluate (since it is presented in milliseconds),
collector machinery at the point they are requested [3]. The the paper describes it as “a rather large overhead.”
PHALANX system extends on QVM to add checks for
reachability and dominance [22] while parallelizing heap 7. Conclusions and Future Work
probe execution. Both systems allow the checking of prop- We have presented the DeAL language for assertions that are
erties that can require multiple heap traversals to check, evaluated in a single traversal of live objects, implemented
significantly impacting performance. The authors explore as a variation of a garbage collection traversal. DeAL is a
both sampling (at a user chosen rate) and parallelization to rich language that can express many useful properties as
control the cost. We believe that DeAL represents a sweet logical combinations of reachability conditions and local
spot between GC Assertions and QVM: unlike QVM, DeAL comparisons. At the same time, the overhead of DeAL on a
controls overhead by guaranteeing that all properties can program’s total execution time remains very low—typically
be checked in a single heap traversal, and by only check- well below the 5% level. Our implementation of DeAL is
ing these properties at or near a regularly scheduled collec- available for public download 5 .
tion; but unlike GC Assertions, DeAL offers a full asser- Although the current language forms a nicely closed set
tion language and a clearer semantics regarding the timing of features, there are clear directions of interesting future ex-
of checks. tensions. One possibility is to extend DeAL from an asser-
The Ditto system reduces the cost of data structure check- tion language to a general query language that can produce
ing by incrementalizing the checking code itself [19]. This new data structures. The typical way to define a declarative
system provides a significantly different tradeoff than DeAL: query language is by allowing unbound logical variables. For
it supports arbitrary checks written procedurally in Java it- instance, one can imagine a query primitive, used with first-
self, but even with incrementalization incurs a significant order logic formulas. The matching values for unbound vari-
overhead (in the range of 3X to 10X slowdown). ables become the tuples returned as the result of the query.
A number of systems developed for data structure re- For example, one might produce a new Java set containing
pair include a method for detecting data structure errors. Al- all positive elements of a list by writing:
though these checking subsystems are not the focus of the
work, they highlight the difficulty of detecting such errors.
Set posval =
The STARC system uses programmer-written repOk meth-
ods to detect errors, which incur a very substantial overhead query( x.val, (x ∈ Node ∧ Rd(x)) ⇒ (x.val > 0));
(30X) when run frequently, and even a 20% overhead when The first part, x.val, of the above query determines the
run only at the point of failure [8]. Demsky et al. present a form of the result as a projection of the matched values,
technique based on Daikon to detect heap invariants and re-
5 http://www.cs.umass.edu/
˜creichen/gcassert/index.htmlYou can also read