A Comparison of Bug Finding Tools for Java

Page created by Mildred Chapman
 
CONTINUE READING
A Comparison of Bug Finding Tools for Java∗

                     Nick Rutar                       Christian B. Almazan                    Jeffrey S. Foster
                                             University of Maryland, College Park
                                             {rutar, almazan, jfoster}@cs.umd.edu

                               Abstract                            of the bug in different places (Section 4.1). We also found
                                                                   that many tools produce a large volume of warnings, which
   Bugs in software are costly and difficult to find and fix.      makes it hard to know which to look at first.
In recent years, many tools and techniques have been de-              Even though the tools do not show much overlap in par-
veloped for automatically finding bugs by analyzing source         ticular warnings, we initially thought that they might be cor-
code or intermediate code statically (at compile time). Dif-       related overall. For example, if one tool issues many warn-
ferent tools and techniques have different tradeoffs, but the      ings for a class, then it might be likely that another tool does
practical impact of these tradeoffs is not well understood. In     as well. However, our results show that this is not true in
this paper, we apply five bug finding tools, specifically Ban-     general. There is no correlation of warning counts between
dera, ESC/Java 2, FindBugs, JLint, and PMD, to a variety           pairs of tools. Additionally, and perhaps surprisingly, warn-
of Java programs. By using a variety of tools, we are able         ing counts are not strongly correlated with lines of code.
to cross-check their bug reports and warnings. Our experi-            Given these results, we believe there will always be a
mental results show that none of the tools strictly subsumes       need for many different bug finding tools, and we propose
another, and indeed the tools often find non-overlapping           creating a bug finding meta-tool for automatically combin-
bugs. We discuss the techniques each of the tools is based         ing and correlating their output (Section 3). Using this tool,
on, and we suggest how particular techniques affect the out-       developers can look for code that yields an unusual number
put of the tools. Finally, we propose a meta-tool that com-        of warnings from many different tools. We explored two
bines the output of the tools together, looking for particular     different metrics for using warning counts to rank code as
lines of code, methods, and classes that many tools warn           suspicious, and we discovered that both are correlated for
about.                                                             the highest-ranked code (Section 4.2).
                                                                      For our study, we selected five well-known, publicly
                                                                   available bug-finding tools (Section 2.2). Our study focuses
1     Introduction                                                 on PMD [18], FindBugs [13], and JLint [16], which use
                                                                   syntactic bug pattern detection. JLint and FindBugs also
                                                                   include a dataflow component. Our study also includes
   In recent years, many tools have been developed for
                                                                   ESC/Java [10], which uses theorem proving, and Bandera
automatically finding bugs in program source code, using
                                                                   [6], which uses model checking.
techniques such as syntactic pattern matching, data flow
analysis, type systems, model checking, and theorem prov-             We ran the tools on a small suite of variously-sized Java
ing. Many of these tools check for the same kinds of pro-          programs from various domains. It is a basic undecidabil-
gramming mistakes, yet to date there has been little direct        ity result that no bug finding tool can always report cor-
comparison between them. In this paper, we perform one             rect results. Thus all of the tools must balance finding true
of the first broad comparisons of several Java bug-finding         bugs with generating false positives (warnings about correct
tools over a wide variety of tasks.                                code) and false negatives (failing to warn about incorrect
                                                                   code). All of the tools make different tradeoffs, and these
   In the course of our experiments, we discovered, some-
                                                                   choices are what cause the tools to produce the wide range
what surprisingly, that there is clearly no single “best” bug-
                                                                   of results we observed for our benchmark suite.
finding tool. Indeed, we found a wide range in the kinds of
bugs found by different tools (Section 2). Even in the cases          The main contributions of this paper are as follows:
when different tools purport to find the same kind of bug,
we found that in fact they often report different instances          • We present what we believe is the first detailed com-
                                                                       parison of several different bug finding tools for Java
    ∗ This   research was supported in part by NSF CCF-0346982.        over a variety of checking tasks.
• We show that, even for the same checking task, there           otherwise. The result is most likely a logical error, since
    is little overlap in the warnings generated by the tools.      a programmer might believe this code will result in x=4
    We believe this occurs because all of the tools choose         when it really results in x=2. Depending on later uses of x,
    different tradeoffs between generating false positives         this could be a major error. Used with the right rulesets for
    and false negatives.                                           ensuring that all if statements use braces around the body,
                                                                   PMD will flag this program as suspicious.
  • We also show that the warning counts from different               The following more blatant error is detected by JLint,
    tools are not generally correlated. Given this result,         FindBugs, and ESC/Java:
    we believe that there will always be a need for multiple
    separate tools, and we propose a bug finding meta-tool         String s = new String("I’m not null...yet");
    for combining the results of different tools together          s = null;
    and cross-referencing their output to prioritize warn-         System.out.println(s.length());
    ings. We show that two different metrics tend to rank
    code similarly.                                                This segment of code will obviously cause an exception at
                                                                   runtime, which is not desirable, but will have the effect of
1.1   Threats to Validity                                          halting the program as soon as the error occurs (assuming
                                                                   the exception is not caught). Moreover, if it is on a common
   There are a number of potential threats to the validity of      program path, this error will most likely be discovered when
this study. Foremost is simply the limited scope of the study,     the program is run, and the exception will pinpoint the exact
both in terms of the test suite size and in terms of the selec-    location of the error.
tion of tools. We believe, however, that we have chosen a             When asked which is the more severe bug, many pro-
representative set of Java benchmarks and Java bug finding         grammers might say that a null dereference is worse than
tools. Additionally, there may be other considerations for         not using braces in an if statement (which is often not an
tools for languages such as C and C++, which we have not           error at all). And yet the logical error caused by the lack of
studied. However, since many tools for those languages use         braces might perhaps be much more severe, and harder to
the same basic techniques as the tools we studied, we think        track down, than the null dereference.
that the lessons we learned will be applicable to tools for           These small examples illustrate that for any particular
those languages as well.                                           program bug, the severity of the error cannot be separated
   Another potential threat to validity is that we did not ex-     from the context in which the program is used. With this in
actly categorize every false positive and false negative from      mind, in Section 6 we mention a few ways in which user-
the tools. Doing so would be extremely difficult, given the        specified information about severity might be taken into ac-
large number of warnings from the tools and the fact that          count.
we ourselves did not write the benchmark programs in our
study. Instead, in Section 4.1, we cross-check the results
                                                                   2     Background
of the tools with each other in order to get a general sense
of how accurate the warnings are, and in order to under-
stand how the implementation techniques affect the gener-          2.1    A Small Example
ated warnings. We leave it as interesting future work to
check for false negatives elsewhere, e.g., in CVS revision             The code sample in Figure 1 illustrates the variety and
histories or change logs.                                          typical overlap of bugs found by the tools. It also illustrates
   A final threat to validity is that what we make no distinc-     the problems associated with false positives and false nega-
tion between the severity of one bug versus another. Quanti-       tives. The code in Figure 1 compiles with no errors and no
fying the severity of bugs is a difficult problem, and it is not   warnings, and though it won’t win any awards for function-
the focus of this paper. For example, consider the following       ality, it could easily be passed off as fine. However, four
piece of code:                                                     of the five tools were each able to find at least one bug in
                                                                   this program. (Bandera wasn’t tested against the code for
int x = 2, y = 3;                                                  reasons explained later.)
if (x == y)
                                                                       PMD discovers that the variable y on line 8 is never
   if (y == 3)
                                                                   used and generates an “Avoid unused local variables” warn-
       x = 3;
else                                                               ing. FindBugs displays a “Method ignores results of In-
   x = 4;                                                          putStream.read()” warning for line 12; this is an error be-
                                                                   cause the result of InputStream.read() is the num-
In this example, indentation would suggest that the else           ber of bytes read, and this may be fewer bytes than the pro-
corresponds to the first if, but the language grammar says         grammer is expecting. FindBugs also displays a “Method
1     import java.io.*;                                             Name          Version         Input     Interfaces    Technology
2     public class Foo{                                             Bandera         0.3b2        Source      CL, GUI           Model
3       private byte[] b;                                                          (2003)                                   checking
4       private int length;                                         ESC/Java        2.0a7      Source1       CL, GUI        Theorem
5       Foo(){ length = 40;                                                        (2004)                                     proving
6         b = new byte[length]; }                                   FindBugs         0.8.2    Bytecode      CL, GUI,          Syntax,
7       public void bar(){                                                         (2004)                   IDE, Ant         dataflow
8         int y;
                                                                    JLint              3.0    Bytecode           CL           Syntax,
9         try {
                                                                                   (2004)                                    dataflow
10          FileInputStream x =
                                                                    PMD                1.9       Source     CL, GUI,           Syntax
11              new FileInputStream("z");
                                                                                   (2004)                   Ant, IDE
12          x.read(b,0,length);
13          x.close();}                                           CL - Command Line
14        catch(Exception e){                                     1 ESC/Java works primarily with source but may require bytecode or

15          System.out.println("Oopsie");}                        specification files for supporting types.
16        for(int i = 1; i
on programming style, PMD includes support for select-           tools. However, as we will discuss in Section 3, without
ing which detectors or groups of detectors should be run.        annotations ESC/Java produces a multitude of warnings.
In our experiments, we run PMD with the rulesets recom-          Houdini [9] can automatically add ESC/Java annotations
mended by the documentation: unusedcode.xml, basic.xml,          to programs, but it does not work with ESC/Java 2 [4].
import.xml, and favorites.xml. The number of warnings can        Daikon [8] can also be used as an annotation assistant to
increase or decrease depending on which rulesets are used.       ESC/Java, but doing so would require selecting representa-
PMD is easily extensible by programmers, who can write           tive dynamic program executions that sufficiently cover the
new bug pattern detectors using either Java or XPath.            program paths, which we did not attempt. Since ESC/Java
                                                                 really works best with annotations, in this paper we will
Bandera [6] is a verification tool based on model check-         mostly use it as a point of comparison and do not include it
ing and abstraction. To use Bandera, the programmer anno-        in the meta-tool metrics in Section 4.2.
tates their source code with specifications describing what
should be checked, or no specifications if the programmer        2.3    Taxonomy of Bugs
only wants to verify some standard synchronization prop-
erties. In particular, with no annotations Bandera verifies          We classified all of the bugs the tools find into the groups
the absence of deadlocks. Bandera includes optional slicing      listed in Figure 3. The first column lists a general class of
and abstraction phases, followed by model checking. Ban-         bugs, and the second column gives one common example
dera can use a variety of model checkers, including SPIN         from that class. The last columns indicate whether each
[12] and the Java PathFinder [11].                               tool finds bugs in that category, and whether the tools find
    We included Bandera in our study because it uses a com-      the specific example we list. We did not put Bandera in
pletely different technique than the other tools we looked       this table, since without annotations its checks are limited
at. Unfortunately, Bandera version 0.3b2 does not run on         to synchronization properties.
any realistic Java programs, including our benchmark suite.          These classifications are our own, not the ones used in
The developers of Bandera acknowledge on their web page          the literature for any of these tools. With this in mind, no-
that it cannot analyze Java (standard) library calls, and un-    tice that the largest overlap is between FindBugs and PMD,
fortunately the Java library is used extensively by all of our   which share 6 categories in common. The “General” cate-
benchmarks. This greatly limits the usability and applica-       gory is a catch-all for checks that do not fit in the other cat-
bility of Bandera (future successors will address this prob-     egories, so all tools find something in that category. All of
lem). We were able to successfully run Bandera and the           the tools also look for concurrency errors. Overall, there are
other tools on the small example programs supplied with          many common categories among the tools and many cate-
Bandera. Section 5 discusses the results.                        gories on which the tools differ.
                                                                     Other fault classifications that have been developed are
                                                                 not appropriate for our discussion. Two such classifications,
ESC/Java [10], the Extended Static Checking system for
                                                                 the Orthogonal Defect Classification [3] and the IEEE Stan-
Java, based on theorem proving, performs formal verifica-
                                                                 dard Classification for Software Anomalies [14], focus on
tion of properties of Java source code. To use ESC/Java, the
                                                                 the overall software life cycle phases. Both treat faults at
programmer adds preconditions, postconditions, and loop
                                                                 a much higher-level than we do in this paper. For exam-
invariants to source code in the form of special comments.
                                                                 ple, they have a facility for specifying that a fault is a logic
ESC/Java uses a theorem prover to verify that the program
                                                                 problem, but do not provide specifications for what the logic
matches the specifications.
                                                                 problem leads to or was caused by, such as incorrect syn-
   ESC/Java is designed so that it can produce some useful
                                                                 chronization.
output even without any specifications, and this is the way
we used it in our study. In this case, ESC/Java looks for er-
rors such as null pointer dereferences, array out-of-bounds      3     Experiments
errors, and so on; annotations can be used to remove false
positives or to add additional specifications to be checked.        To generate the results in this paper, we wrote a series
   For our study, we used ESC/Java 2 [5], a successor to the     of scripts that combine and coordinate the output from the
original ESC/Java project. ESC/Java 2 includes support for       various tools. Together, these scripts form a preliminary
Java 1.4, which is critical to analyzing current applications.   version of the bug finding meta-tool that we mentioned in
ESC/Java 2 is being actively developed, and all references       the introduction. This meta-tool allows a developer to ex-
to ESC/Java will refer to the ESC/Java 2, rather than the        amine the output from all the tools in a common format and
original ESC/Java.                                               find what classes, methods, and lines generate warnings.
   We included ESC/Java in our set of tools because its ap-         As discussed in the introduction, we believe that such a
proach to finding bugs is notably different from the other       meta-tool can provide much better bug finding ability than
Bug Category                                                       Example           ESC/Java       FindBugs       JLint   PMD
                                                                                                   √              √           √       √
         General                                                    Null dereference                 *              *           *
                                                                                                   √               √          √       √
         Concurrency                                              Possible deadlock                  *                          *
                                                                                                   √
         Exceptions                                   Possible unexpected exception                  *
                                                                                                    √                        √
         Array                                         Length may be less than zero                                           *
                                                                                                   √                          √
         Mathematics                                                Division by zero                 *
                                                                                                                        √            √
         Conditional, loop                   Unreachable code due to constant guard                                                   *
                                                                                                                        √    √        √
         String                                    Checking equality using == or !=                                           *
                                                                                                                       √     √       √
         Object overriding                  Equal objects must have equal hashcodes                                     *     *       *
                                                                                                                       √
         I/O stream                                   Stream not closed on all paths                                    *
                                                                                                                        √            √
         Unused or duplicate statement                         Unused local variable                                                  *
                                                                                                                       √
         Design                                         Should be a static inner class                                  *
                                                                                                                                     √
         Unnecessary statement                         Unnecessary return statement                                                   *
                             √
                                 - tool checks for bugs in this category   * - tool checks for this specific example

                                       Figure 3. The Types of Bugs Each Tool Finds

the tools in isolation. As Figure 3 shows, there is a lot                  Azureus 2.0.7 Java Bit Torrent client4
of variation even in the kinds of bugs found by the tools.
Moreover, as we will discuss in Section 4.1, there are not                 Megamek 0.29 Online version of BattleTech game5
many cases where multiple tools warn about the same po-
                                                                               Figure 4 lists the size of each benchmark in terms of both
tential problem. Having a meta-tool means that a developer
                                                                           Non Commented Source Statements (NCSS), roughly the
need not rely on the output of a single tool. In particular,
                                                                           number of ’;’ and ’{’ characters in the program, and the
the meta-tool can rank classes, methods, and lines by the
                                                                           number of class files. The remaining columns of Figure 4
number of warnings generated by the various tools. In Sec-
                                                                           list the running times and total number of warnings gener-
tion 4.2, we will discuss simple metrics for doing so and
                                                                           ated by each tool. Section 4 discusses the results in-depth;
examine the results.
                                                                           here we give some high-level comments. Bandera is not in-
    Of course, rather than having a meta-tool, perhaps the
                                                                           cluded in this table, since it does not run on any of these
ideal situation would be a single tool with many different
                                                                           examples. See Section 5.
analyses built-in, and the different analyses could be com-
                                                                               To compute the running times, we ran all of the programs
bined and correlated in the appropriate fashion. However,
                                                                           from the command line, as the optional GUIs can poten-
as a practical matter, the tools tend to be written by a wide
                                                                           tially reduce performance. Execution times were computed
variety of developers, and so at least for now having a sep-
                                                                           with one run, as performance is not the emphasis of this
arate tool to combine their results seems necessary.
                                                                           study. The tests were performed on a Mac OS X v10.3.3
    The preliminary meta-tool we built for this paper is fairly
                                                                           system with a 1.25 GHz PowerPC G4 processor and 512
simple. Its main tasks are to parse the different textual
                                                                           MB RAM. Because PMD accepts only one source file at
output of the various tools (ranging from delimited text to
                                                                           a time, we used a script to invoke it on every file in each
XML) and map the warnings, which are typically given by
                                                                           benchmark. Unfortunately, since PMD is written in Java,
file and line, back to classes and methods. We computed
                                                                           each invocation launches the Java virtual machine sepa-
the rankings in a separate pass. Section 6 discusses some
                                                                           rately, which significantly reduces PMD’s performance. We
possible enhancements to our tool.
                                                                           expect that without this overhead, PMD would be approx-
    We selected as a testbed five mid-sized programs com-
                                                                           imately 20% faster. Recall that we used ESC/Java without
piled with Java 1.4. The programs represent a range of
                                                                           annotations; we do not know if adding annotations would
applications, with varying functionality, program size, and
                                                                           affect ESC/Java’s running time, but we suspect it will still
program maturity. The five programs are
                                                                           run significantly slower than the other tools. Speaking in
Apache Tomcat 5.019 Java Servlet and JavaServer Pages                      general terms, ESC/Java takes a few hours to run, FindBugs
    implementation, specifically catalina.jar1                             and PMD take a few minutes, and JLint takes a few seconds.
                                                                               For each tool, we report the absolute number of warnings
JBoss 3.2.3 J2EE application server2                                       generated, with no normalization or attempt to discount re-
Art of Illusion 1.7 3D modeling and rendering studio3                      peated warnings about the same error. Thus we are mea-
                                                                           suring the total volume of information presented to a de-
  1 http://jakarta.apache.org/tomcat
  2 http://www.jboss.org                                                       4 http://azureus.sourceforge.net
  3 http://www.artofillusion.org                                               5 http://megamek.sourceforge.net
NCSS     Class               Time (min:sec.csec)                         Warning Count
          Name            (Lines)   Files   ESC/Java    FindBugs     JLint     PMD       ESC/Java   FindBugs JLint        PMD
    Azureus 2.0.7         35,549    1053    211:09.00    01:26.14 00:06.87    19:39.00      5474         360 1584         1371
    Art of Illusion 1.7   55,249     676    361:56.00    02:55.02 00:06.32    20:03.00     12813         481 1637         1992
    Tomcat 5.019          34,425     290     90:25.00    01:03.62 00:08.71    14:28.00      1241         245 3247         1236
    JBoss 3.2.3             8,354    274     84:01.00    00:17.56 00:03.12    09:11.00      1539          79     317       153
    Megamek 0.29          37,255     270     23:39.00    02:27.21 00:06.25    11:12.00      6402         223 4353          536

                             Figure 4. Running Time and Warnings Generated by Each Tool

                                                                                                  ESC/    Find
                                                                                                  Java    Bugs    JLint   PMD
                                                                       Concurrency Warnings       126     122     8883    0
                                                                       Null Dereferencing         9120    18      449     0
                                                                       Null Assignment            0       0       0       594
                                                                       Index out of Bounds        1810    0       264     0
                                                                       Prefer Zero Length Array   0       36      0       0

                                                                       Figure 6. Warning Counts for the Categories
                                                                       Discussed in Section 4.1

    Figure 5. Histogram for number of warnings
    found per class
                                                                    counts. Even after restricting ourselves to these three cate-
                                                                    gories, there is still a large number of warnings, and so our
                                                                    manual examination is limited to several dozen warnings.
veloper from each tool. For ESC/Java, the number of gen-
erated warnings is sometimes extremely high. Among the
other tools, JLint tends to report the largest number of warn-      Concurrency Errors All of the tools check for at least
ings, followed by PMD (though for Art of Illusion, PMD              one kind of concurrency error. ESC/Java includes support
reported more warnings than JLint). FindBugs generally              for automatically checking for race conditions and potential
reports fewer warnings than the other tools. In general, we         deadlocks. ESC/Java finds no race conditions, but it issues
found this makes FindBugs easier to use, because there are          126 deadlock warnings for our benchmark suite. After in-
fewer results to examine.                                           vestigating a handful of these warnings, we found that some
    Figure 5 shows a histogram of the warning counts per            of them appear to be false positives. Further investigation
class. (We do not include classes with no warnings.)                is difficult, because ESC/Java reports synchronized
Clearly, in most cases, when the tools find potential bugs,         blocks that are involved in potential deadlocks but not the
they only find a few, and the number of classes with mul-           sets of locks in each particular deadlock.
tiple warnings drops off rapidly. For PMD and JLint, there              PMD includes checks for some common bug patterns,
are quite a few classes that have 19 or more warnings, while        such as the well-known double-checked locking bug in Java
these are rare for FindBugs. For ESC/Java, many classes             [2]. However, PMD does not issue any such warnings for
have 19 or more warnings.                                           our benchmarks. In contrast, both FindBugs and JLint do
                                                                    report warnings. Like PMD, FindBugs also checks for uses
                                                                    of double-checked locking. Interestingly, despite PMD re-
4     Analysis                                                      porting no such cases, FindBugs finds a total of three uses of
                                                                    double-checked locking in the benchmark programs. Man-
4.1    Overlapping Bug Categories                                   ual examination of the code shows that, indeed, those three
                                                                    uses are erroneous. PMD does not report this error because
   Clearly the tools generate far too many warnings to re-          its checker is fooled by some other code mixed in with the
view all of them manually. In this section, we examine the          bug pattern (such as try/catch blocks).
effectiveness of the tools on three checking tasks that sev-            FindBugs also warns about the presence of other concur-
eral of the tools share in common: concurrency, null deref-         rency bug patterns, such as not putting a monitor wait()
erence, and array bounds errors. Even for the same task we          call in a while loop. Examining the results in detail, we
found a wide variation in the warnings reported by differ-          discovered that the warnings FindBugs reports usually cor-
ent tools. Figure 6 contains a breakdown of the warning             rectly indicate the presence of the bug pattern in the code.
What is less clear is how many of the patterns detected cor-        Interestingly, FindBugs discovers a very small set of po-
respond to actual errors. For example, since FindBugs does       tential null dereferences compared to both ESC/Java and
not perform interprocedural analysis (it analyzes a single       JLint. This is because FindBugs uses several heuristics to
method at a time), if a method with a wait() is itself           avoid reporting null-pointer dereference warnings in certain
called in a loop, FindBugs will still report a warning (though   cases when its dataflow analysis loses precision.
this did not happen in our benchmarks). And, of course, not         PMD does not check for null pointer dereferences, but it
all uses of wait() outside of a loop are incorrect.              does warn about setting certain objects to null. We suspect
    On our test suite, JLint generates many warnings about       this check is not useful for many common coding styles.
potential deadlocks. In some cases, JLint produces many          ESC/Java also checks for some other uses of null that vi-
warnings for the same underlying bug. For instance, JLint        olate implicit specifications, e.g., assigning null to a field
checks for deadlock by producing a lock graph and look-          assumed not to be null. In a few cases, we found that PMD
ing for cycles. In several cases in our experiments, JLint       and ESC/Java null warnings coincide with each other. For
iterates over the lock graph repeatedly, reporting the same      example, in several cases PMD reported an object being set
cycle many times. In some cases, the same cycle generated        to null, and just a few lines later ESC/Java issued a warning
several hundred warnings. These duplicates, which make           about assigning null to another object.
it difficult to use the output of JLint, could be eliminated
by reporting a cycle in the lock graph just once. The sheer
quantity of output from JLint makes it difficult to judge the    Array Bounds Errors In Java, indexing outside the
rate of false positives for our benchmark suite. In Section 5    bounds of an array results is a run-time exception. While
we compare finding deadlocks using JLint and Bandera on          a bounds error in Java may not be the catastrophic error
smaller programs.                                                that it can be for C and C++ (where bounds errors over-
                                                                 write unexpected parts of memory), they still indicate a bug
                                                                 in the program. Two of the tools we examined, JLint and
Null Dereferences Among the four tools, ESC/Java,                ESC/Java, include checks for array bounds errors—either
FindBugs, and JLint check for null dereferences. Surpris-        creating an array with a negative size, or accessing an array
ingly, there is not a lot of overlap between the warnings        with an index that is negative or greater than the size of the
reported by the various tools.                                   array.
    JLint finds many potential null dereferences. In order           Like null dereference warnings, JLint and ESC/Java do
to reduce the number of warnings, JLint tries to only iden-      not always report the same warnings in the same places.
tify inconsistent assumptions about null. For example, JLint     ESC/Java mainly reports warnings because parameters that
warns if an object is sometimes compared against null be-        are later used in array accesses may not be within range (an-
fore it is dereferenced and sometimes not. However, we           notations would help with this). JLint has several false pos-
have found that in a fair number of cases, JLint’s null deref-   itives and some false negatives in this category, apparently
erence warnings are false positives. A common example is         because it does not track certain information interprocedu-
when conditional tests imply that an object cannot be null       rally in its dataflow analysis. For example, code such as this
(e.g., because it was not null previously when the condi-        appeared in our benchmarks:
tion held). In this case, JLint often does not track enough
information about conditionals to suppress the warning. Fi-      public class Foo {
nally, in some cases there are warnings about null pointer         static Integer[] ary = new Integer[2];
dereferences that cannot happen because of deeper program
logic; not many static analyses could handle these cases.            public static void assign() {
Currently, there is no way to stop these warnings from be-             Object o0 = ary[ary.length];
ing reported (sometimes multiple times).                               Object o1 = ary[ary.length-1];
    ESC/Java reports the most null pointer dereferences be-          }
cause it often assumes objects might be null, since we did       }
not add any annotations to the contrary. (Interestingly,
ESC/Java does not always report null dereference warnings        In this case, JLint signals a warning that the array index
in the same places as JLint). The net result is that, while      might be out of bounds for the access to o1 (because it
potentially those places may be null pointer errors, there       thinks the length of the array might be 0), but clearly that
are too many warnings to be easily useful by themselves.         is not possible here. On the other hand, there are no warn-
Instead, to make the most effective use of these checks, it      ings for the access to o0, even though it will always be out
seems the programmer should provide annotations. For ex-         of bounds no matter what size the array is.
ample, in method declarations parameters that are never null        FindBugs and PMD do not check for array bounds er-
can be marked as such to avoid spurious warnings.                rors, though FindBugs does warn about returning null from
Correlation                      We studied two metrics for ranking code. As mentioned
               Tools                coefficient                  in Section 2.2, we do not include ESC/Java in this discus-
               JLint vs PMD               0.15                   sion.
               JLint vs FindBugs          0.33                      For the first metric, we started with the number of warn-
               FindBugs vs PMD            0.31
                                                                 ings per class file from each tool. (The same metric can
                                                                 also be used per method, per lexical scope, or per line.)
   Figure 7. Correlation among Warnings from                     For a particular benchmark and a particular tool, we linearly
   Pairs of Tools                                                scaled the per-class warning counts to range between 0 and
                                                                 1, with 1 being the maximum number of per-class warning
                                                                 counts reported by the tool over all our benchmarks.
                                                                    Formally, let n be the total number of classes, and let
a method that returns an array (it may be better to use a        Xi be the number of warnings reported by tool X for class
0-length array).                                                 number i, where i ∈ 1..n. Then we computed a normalized
                                                                 warning count
                                                                                                    n
4.2     Cross-Tool Buggy Code Correlations                                             Xi0 = Xi / max Xi
                                                                                                  i=1
                                                                    Then for class number i, we summed the normalized
    When initially hypothesizing about the relationship
                                                                 warning counts from each tool to compute our first metric,
among the tools, we conjectured that warnings among the
                                                                 the normalized warning total:
different tools were correlated, and that the meta-tool would
show that more warnings from one tool would correspond                    Total i = FindBugs i + JLint i + PMD i
to more warnings from other tools. However, we found
                                                                     In order to avoid affecting the scaling for JLint, we re-
that this is not necessarily the case. Figure 7 gives the cor-
                                                                 duced its warning count for the class with the highest num-
relation coefficients for the number of warnings found by
                                                                 ber of errors from 1979 to 200, and for the next four highest
pairs of tools per class. As these results indicate, the large
                                                                 classes to 199 through 196, respectively (to maintain their
number of warnings reported by some tools are sometimes
                                                                 ranking)
simply anomalous, and there does not seem to be any gen-
                                                                     With this first metric, the warning counts could be biased
eral correlation between the total number of warnings one
                                                                 by repeated warnings about the same underlying bug. In or-
tools generates and the total number of warnings another
                                                                 der to compensate for this possibility, we developed a sec-
tool generates for any given class.
                                                                 ond metric, the unique warning total, that counts only the
    We also wanted to check whether the number of warn-
                                                                 first instance of each type of warning message generated by
ings reported is simply a function of the number of lines
                                                                 a tool. For example, no matter how many null pointer deref-
of code. Figure 8 gives correlation coefficients and scatter
                                                                 erences FindBugs reports in a class, we only count this as 0
plots showing, for each Java source file (which may include
                                                                 (if none were found) or 1 (if one or more were found). In
several inner classes), the NCSS count versus the number of
                                                                 this metric, we sum the number of unique warnings from all
warnings. For JLint, we have removed from the chart five
                                                                 the tools.
source files that had over 500 warnings each, since adding
these makes it hard to see the other data points. As these
plots show, there does not seem to be any general corre-         4.2.2   Results
lation between lines of code and number of warnings pro-         We applied these metrics to our benchmark suite, ranking
duced by any of the tools. JLint has the strongest correlation   the classes according to their normalized and unique warn-
of the three, but it is still weak.                              ing totals. As it turns out, these two metrics are fairly well
                                                                 correlated, especially for the classes that are ranked highest
4.2.1    Two Simple Metrics for Isolating Buggy Code             by both metrics. Figure 9 shows the relationship between
                                                                 the normalized warning count and the number of unique
Given that the tools’ warnings are not generally correlated,     warnings per class. The correlation coefficient for this rela-
we hypothesize that combining the results of multiple tools      tionship is 0.758. Of course, it is not surprising that these
together can identify potentially troublesome areas in the       metrics are correlated, because they are clearly not indepen-
code that might be missed when using the tools in isola-         dent (in particular, if one is non-zero then the other must be
tion. Since we do not have exhaustive information about          as well). However, a simple examination of certain classes
the severity of faults identified by the warnings and rates      shows that the high correlation coefficient between the two
of false positives and false negatives, we cannot form any       is not obvious. For instance, the class catalina.context has a
strong conclusions about the benefit of our metrics. Thus in     warning count of 0 for FindBugs and JLint, but PMD gen-
this section we perform only a preliminary investigation.        erates 132 warnings. (As it turns out, PMD’s warnings are
Figure 8. Comparison of Number of Warnings versus NCSS

                                                                Bugs, 11th for JLint, and 349th for PMD—thus if we were
                                                                only using a single tool, we would be unlikely to examine
                                                                the warnings for it immediately.
                                                                    In general, the normalized warning total measures the
                                                                number of tools that find an unusually high number of warn-
                                                                ings. The metric is still susceptible, however, to cases where
                                                                a single tool produces a multitude of spurious warnings.
                                                                For example, megamek.server.Server has hundreds of null
                                                                dereference warnings from JLint, many likely false posi-
                                                                tives, which is why it is ranked second in this metric. In the
   Figure 9. Normalized Warnings versus the                     case of artofillusion.object.TriangleMesh, 102 out of 140 of
   Unique Warnings per Class                                    the warnings from PMD are for not using brackets in a for
                                                                statement—which it probably not a mistake at all.
                                                                    On the other hand, the unique warning total measures
                                                                the breadth of warnings found by the tools. This metric
uninteresting). This class ranks 11th in normalized warning     compensates for cascading warnings of the same kind, but
total, but 587th in unique warning total (all 132 warnings      it can be fooled by redundancy among the different tools.
are the same kind). Thus just because a class generates a       For example, if by luck a null deference error is caught by
large number of warnings does not necessarily mean that it      two separate tools, then the warning for that error will be
generates a large breath of warnings.                           counted twice. This has a large affect on the unique warn-
    We manually examined the warnings for the top five          ing counts, because they are in general small. An improved
classes for both metrics, listed in Figure 10. For these        metric could solve this problem by counting uniqueness of
classes, Figure 10 shows the size of the class, in terms of     errors across all tools (which requires identifying duplicate
NCSS and number of methods, the normalized warning to-          messages across tools, a non-obvious task for some warn-
tal and rank, the total number of warnings found by each of     ings that are close but not identical).
the tools, and the number of unique warnings and rank. In           We think that both metrics provide a useful gauge that
this table, T-n denotes a class ranked n, which is tied with    allows programmers to go beyond finding individual bugs
at least one other class in the ranking.                        with individual tools. Instead, these metrics can be used
    Recall that the goal of our metrics is to identify code     to find code with an unusually high number and breadth of
that might be missed when using the tools in isolation. In      warnings from many tools—and our results show that both
this table, the top two classes in both metrics are the same,   seem to be correlated for the highest-ranked classes.
catalina.core.StandardContext and megamek.server.Server,
and both also have the most warnings of any class from, re-     5   Bandera
spectively, FindBugs and JLint. Thus these classes, as well
as artofillusion.object.TriangleMesh (with the most warn-           Bandera cannot analyze any of our benchmarks from
ings from PMD), can be identified as highly suspicious by       Section 3, because it cannot analyze the Java library. In
a single tool.                                                  order to compare Bandera to the other tools, we used the
    On the other hand, azureus2.ui.swt.MainWindow could         small examples supplied with Bandera as a test suite, since
be overlooked when considering only one tool at a time.         we knew that Bandera could analyze them.
It is ranked in the top 10 for both of our metrics, but it is       This test suite from Bandera includes 16 programs rang-
4th for FindBugs in isolation, 13th for JLint, and 30th for     ing from 100-300 lines, 8 of which contain a real deadlock.
PMD. As another example, catalina.core.StandardWrapper          None of the programs include specifications—without spec-
(4th for the unique warning metric), is ranked 45th for Find-   ifications, Bandera will automatically check for deadlock.
Total Warnings           Normalized             Unique Warnings
                   Name                 NCSS    Mthds      FB        JL    PMD         Total i Rank    FB    JL PMD Total          Rank
                                                           ∗
    catalina.core.StandardContext        1863     255        34       791      37        2.25      1    9    10       5     24        1
    megamek.server.Server                4363     198         6 ∗ 1979         42        1.48      2    6    10       4     20        2
    azureus2.ui.swt.MainWindow           1517      87        11        90      30        0.99      9    5     8       4     17        3
    catalina.core.StandardWrapper         513      75        10        50       8        0.60    19     6     6       3     15        4
    catalina.core.StandardHost            279      55         4        97       3        0.62    17    10     3       1     14        5
    catalina.core.ContainerBase           518      70        14       849       3        1.42      3    3     7       3     13      T-8
                                                                            ∗
    artofillusion.object.TriangleMesh    2213      59         5        42     140        1.36      4    3     7       3     13      T-8
    megamek.common.Compute               2250     109         0     1076       23        1.16      5    0     7       3     10     T-22
                                           * - Class with highest number of warnings from this tool

                                        Figure 10. Classes Ranked Highly by Metrics

For this test suite, Bandera finds all 8 deadlocks and pro-               sions of the tools because they were not compatible with
duces no messages concerning the other 8 programs.                        the latest version of Java. As mentioned earlier, when we
   In comparison, FindBugs and PMD do not issue any                       initially experimented with ESC/Java, we downloaded ver-
warnings that would indicate a deadlock. PMD reports                      sion 0.7 and discovered that it was not compatible with Java
19 warnings, but only about null assignments and miss-                    1.4. Fortunately ESC/Java 2, which has new developers, is
ing braces around loop bodies, which in this case has                     compatible, so we were able to use that version for our ex-
no effect on synchronization. FindBugs issues 5 warn-                     periments. But we are still unable to use some important
ings, 4 of which were about package protections and the                   other relations of ESC/Java such as Houdini, which is not
other of which warned about using notify() instead of                     compatible with ESC/Java 2. We had similar problems with
notifyAll() (the use of notify() is correct).                             an older version of JLint, which also did not handle Java
   On the other hand, ESC/Java reports 79 warnings, 30 of                 1.4. The lesson for users is probably to rely only on tools
which are for potential deadlocks in 9 of the programs. One               under active development, and the lesson for tool builders
of the 9 programs did not have deadlock. JLint finds po-                  is to keep up with the latest language features lest a tool be-
tential synchronization bugs in 5 of the 8 programs Bandera               come unusable. This may especially be an issue with the
verified to have a deadlock error. JLint issues three differ-             upcoming Java 1.5, which includes source-level extensions
ent kinds of concurrency warnings for these programs: a                   such as generics.
warning for changing a lock variable that has been used in                   In our opinion, tools that provide graphical user inter-
synchronization, a warning for requesting locks that would                faces (GUIs) or plugins for a variety of integrated develop-
lead to a lock cycle, and a warning for improper use of mon-              ment environments have a clear advantage over those tools
itor objects. In all, JLint reported 34 potential concurrency             that provide only textual output. A well-designed GUI can
bugs over 5 programs.                                                     group classes of bugs together and hyperlink warnings to
   Compared to JLint, Bandera has the advantage that it can               source code. Although we did not use them directly in our
produce counterexamples. Because it is based on model                     study, in our initial phase of learning to use the tools we
checking technology, when Bandera finds a potential dead-                 found GUIs invaluable. Unfortunately, GUIs conflict some-
lock it can produce a full program trace documenting the                  what with having a meta-tool, since they make it much more
sequence of operations leading up to the error and a graph-               difficult for a meta-tool to extract the analysis results. Thus
ical representation of the lock graph with a deadlock. Non-               probably the best compromise is to provide both structural
model checking tools such as JLint often are not as well                  text output (for processing by the meta-tool) and a GUI. We
geared to generating counterexample traces.                               leave as future work the development of a generic, easy-to-
                                                                          use GUI for the meta-tool itself.
6     Usability Issues and Improvements to the                               Also, while developers want to find as many bugs as pos-
      Meta-Tool                                                           sible, it is important not to overwhelm the developer with
                                                                          too much output. In particular, one critical ability is to
   In the course of our experiments, we encountered a num-                avoid cascading errors. For example, in some cases JLint
ber of issues in applying the tools to our benchmark suite.               repeatedly warns about dereferencing a variable that may
Some of these issues must be dealt with within a tool, and                be null, but it would be sufficient to warn only at the first
some of the issues can be addressed by improving our pro-                 dereference. These cases may be possible to eliminate with
posed meta-tool.                                                          the meta tool. Or, better yet, the tool could be modified
   In a number of cases, we had difficulty using certain ver-             so that once a warning about a null pointer is issued, the
pointer would subsequently be assumed not to be null (or         benchmarks and proposing a meta-tool to examine the cor-
whatever the most optimistic assumption is) to suppress fur-     relations.
ther warnings. Similarly, sometimes JLint produces a large          Z-ranking [17] is a technique for ranking the output of
number of potential deadlock warnings, even reporting the        static analysis tools so warnings that are more important
same warning multiple times on the same line. In this case,      will tend to be ranked more highly. As our results sug-
the meta-cool could easily filter redundant error messages       gest, having such a facility in the tools we studied would
and reduce them to a single warning. In general, the meta-       be extremely useful. Z-ranking is intended to rank the out-
tool could allow the user to select between full output from     put of a particular bug checker. In our paper, however, we
each of the tools and output limited to unique warnings.         look at correlating warnings across tools and across differ-
   As mentioned throughout this paper, false positives are       ent checkers.
an issue with all of the tools. ESC/Java is the one tool            In general, since many of these Java bug finding tools
that supports user-supplied annotations to eliminate spuri-      have only been developed within the last few years, there
ous warnings. We could incorporate a poor-man’s version          has not been much work comparing them. One article on a
of this annotation facility into the meta-tool by allowing the   developer web log by Jelliffe [15] briefly describes experi-
user to suppress certain warnings at particular locations in     ence using JLint, FindBugs, PMD, and CheckStyle (a tool
the source code. This would allow the user to prune the          we did not study; it checks adherence to a coding style). In
output of tools to reduce false positives. In general such a     his opinion, JLint and FindBugs find different kinds of bugs,
facility must be used extremely carefully, since it is likely    and both are very useful on existing code, while PMD and
that subsequent code modifications might render the sup-         CheckStyle are more useful if you incorporate their rules
pression of warnings confusing or even incorrect.                into projects from the start.
   A meta-tool could also interpret the output of the tools
in complex ways. In particular, it could use a warning from      8   Conclusion
one tool to decide whether another tool’s warning has a
greater probability of being valid. For example, in one case
                                                                    We have examined the results of applying five bug-
we encountered, a PMD-generated warning about a null as-
                                                                 finding tools to a variety of Java programs. Although there
signment coincided with a JLint warning for the same po-
                                                                 is some overlap between the kinds of bugs found by the
tential bug. After a manual check of the bug, we found that
                                                                 tools, mostly their warnings are distinct. Our experiments
both tools were correct in their assessment.
                                                                 do, however, suggest that many tools reporting an unusual
   Finally, as discussed in Section 1.1, it is impossible to
                                                                 number of warnings for a class is correlated with a large
generally classify the severity of a warning without know-
                                                                 breadth of unique warnings, and we propose a meta-tool to
ing the context in which the application is used. However,
                                                                 allow developers to identify these classes.
it might be possible for developers to classify bug sever-
                                                                    As we ran the tools and examined the output, there
ity for their own programs. Initially, warnings would be
                                                                 seemed to be a few things that would be beneficial in gen-
weighed evenly, and a developer could change the weights
                                                                 eral. The main difficulty in using the tools is simply the
so that different bugs were weighed more or less in the
                                                                 quantity of output. In our opinion, the programmer should
meta-tool’s rankings. For example, warnings that in the past
                                                                 have the ability to add an annotation or a special comment
have lead to severe errors might be good candidates for in-
                                                                 into the code to suppress warnings that are false positives,
creased weight. Weights could also be used to adjust for
                                                                 even though this might lead to potential future problems
false positive rates. If a particular bug checker is known
                                                                 (due to changes in assumptions). Such a mechanism seems
to report many false positives for a particular application,
                                                                 necessary to help reduce the sheer output of the tools. In
those warnings can be assigned a lower weight.
                                                                 Section 6 we proposed adding this as a feature of the meta-
                                                                 tool.
7   Related Work                                                    In this paper we have focused on comparing the output
                                                                 of different tools. An interesting area of future work is to
   Artho [1] compares several dynamic and static tools for       gather extensive information about the actual faults in pro-
finding errors in multi-threaded programs. Artho compares        grams, which would enable us to precisely identify false
the tools on several small core programs extracted from a        positives and false negatives. This information could be
variety of Java applications. Artho then proposes extensions     used to determine how accurately each tool predicts faults in
to JLint, included in the version we tested in this paper,       our benchmarks. We could also test whether the two metrics
to greatly improve its ability to check for multi-threaded       we proposed for combining warnings from multiple tools
programming bugs, and gives results for running JLint on         are better or worse predictors of faults than the individual
several large applications. The focus of this paper, in con-     tools.
trast, is on looking at a wider variety of bugs across several      Finally, recall that all of the tools we used are in some
ways unsound. Thus the absence of warnings from a tool                       Productivity, International Symposium of Formal Methods,
does not imply the absence of errors. This is certainly a                    number 2021 in Lecture Notes in Computer Science, pages
necessary tradeoff, because as we just argued, the number                    500–517, Berlin, Germany, Mar. 2001. Springer-Verlag.
of warnings produced by a tool can be daunting and stand              [10]   C. Flanagan, K. R. M. Leino, M. Lillibridge, G. Nelson,
in the way of its use. As we saw in Section 3, without user                  J. B. Saxe, and R. Stata. Extended Static Checking for Java.
                                                                             In Proceedings of the 2002 ACM SIGPLAN Conference on
annotations a tool like ESC/Java that is still unsound yet
                                                                             Programming Language Design and Implementation, pages
much closer to verification produces even more warnings                      234–245, Berlin, Germany, June 2002.
than JLint, PMD, and FindBugs. Ultimately, we believe                 [11]   K. Havelund and T. Pressburger. Model checking JAVA pro-
there is still a wide area of open research in understanding                 grams using JAVA pathfinder. International Journal on Soft-
the right tradeoffs to make in bug finding tools.                            ware Tools for Technology Transfer, 2(4):366–381, 2000.
                                                                      [12]   G. J. Holzmann. The model checker SPIN. Software Engi-
                                                                             neering, 23(5):279–295, 1997.
Acknowledgments                                                       [13]   D. Hovemeyer and W. Pugh.                Finding Bugs Is
                                                                             Easy.       http://www.cs.umd.edu/˜pugh/java/
   We would like to thank David Cok and Joe Kiniry for                       bugs/docs/findbugsPaper.pdf, 2003.
helping us get ESC/Java 2 running. We would also like                 [14]   IEEE. IEEE Standard Classification for Software Anoma-
to thank Cyrille Artho for providing us with a beta version                  lies, Dec. 1993. IEEE Std 1044-1993.
of JLint 3.0. Finally, we would like to thank Atif Memon,             [15]   R. Jelliffe. Mini-review of Java Bug Finders. In O’Reilly
Bill Pugh, Mike Hicks, and the anonymous referees for their                  Developer Weblogs. O’Reilly, Mar. 2004. http://www.
helpful comments on earlier versions of this paper.                          oreillynet.com/pub/wlg/4481.
                                                                      [16]   JLint. http://artho.com/jlint.
                                                                      [17]   T. Kremenek and D. Engler. Z-Ranking: Using Statistical
References                                                                   Analysis to Counter the Impact of Static Analysis Approxi-
                                                                             mations. In R. Cousot, editor, Static Analysis, 10th Interna-
 [1] C. Artho. Finding faults in multi-threaded programs. Mas-               tional Symposium, volume 2694 of Lecture Notes in Com-
     ter’s thesis, Institute of Computer Systems, Federal Institute          puter Science, pages 295–315, San Diego, CA, USA, June
     of Technology, Zurich/Austin, 2001.                                     2003. Springer-Verlag.
 [2] D. Bacon, J. Block, J. Bogoda, C. Click, P. Haahr,               [18]   PMD/Java. http://pmd.sourceforge.net.
     D. Lea, T. May, J.-W. Maessen, J. D. Mitchell,
     K. Nilsen, B. Pugh, and E. G. Sirer.                      The
     “Double-Checked Locking is Broken” Declara-
     tion.       http://www.cs.umd.edu/˜pugh/java/
     memoryModel/DoubleCheckedLocking.html.
 [3] R. Chillarege, I. S. Bhandari, J. K. Chaar, M. J. Halliday,
     D. S. Moebus, B. K. Ray, and M.-Y. Wong. Orthogonal de-
     fect classification - a concept for in-process measurements.
     IEEE Transactions on Software Engineering, 18(11):943–
     956, Nov. 1992.
 [4] D. Cok. Personal communication, Apr. 2004.
 [5] D. Cok and J. Kiniry.             ESC/Java 2, Mar. 2004.
     http://www.cs.kun.nl/sos/research/
     escjava/index.html.
 [6] J. C. Corbett, M. B. Dwyer, J. Hatcliff, S. Laubach, C. S.
     Pasareanu, Robby, and H. Zheng. Bandera: Extracting
     Finite-state Models from Java Source Code. In Proceedings
     of the 22nd International Conference on Software Engineer-
     ing, pages 439–448, Limerick Ireland, June 2000.
 [7] R. F. Crew. ASTLOG: A Language for Examining Abstract
     Syntax Trees. In Proceedings of the Conference on Domain-
     Specific Languages, Santa Barbara, California, Oct. 1997.
 [8] M. D. Ernst, A. Czeisler, W. G. Griswold, and D. Notkin.
     Quickly detecting relevant program invariants. In ICSE
     2000, Proceedings of the 22nd International Conference on
     Software Engineering, pages 449–458, Limerick, Ireland,
     June 7–9, 2000.
 [9] C. Flanagan and K. R. M. Leino. Houdini, an Annotation
     Assitant for ESC/Java. In J. N. Oliverira and P. Zave, edi-
     tors, FME 2001: Formal Methods for Increasing Software
You can also read