National University of Ireland, Maynooth

Page created by Darrell Watkins

Careers

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

National University of Ireland, Maynooth
                          MAYNOOTH, CO. KILDARE, IRELAND.

                        DEPARTMENT OF COMPUTER SCIENCE,
                            TECHNICAL REPORT SERIES

   Generation Strategies for TestSuites of GrammarBased
                          Software

                           Mark Hennessy and James F. Power

                              NUIM-CS-TR-2005-02

http://www.cs.nuim.ie             Tel: +353 1 7083847         Fax: +353 1 7083848

Generation Strategies for Test-Suites of Grammar-Based
Software

Mark Hennessy James F. Power∗
Computer Science Dept. Computer Science Dept.
National University of Ireland National University of Ireland
Maynooth, Co. Kildare, Ireland Maynooth, Co. Kildare, Ireland
markh@cs.nuim.ie jpower@cs.nuim.ie

ABSTRACT sume the presence of an oracle with which to check the re-
The use of statement coverage has proved to be a useful met- sult against. A testing strategy that exploits both methods
ric when testing code with a test-suite. Similarly, the cover- is preferable to ensure a reasonable confidence in the cor-
age of a grammar’s rules is an effective metric when testing rect functioning of the system but sections of code may go
a parser. However when testing a whole parser front-end, untested.
it is not immediately obvious whether there is a correlation
between rule coverage and underlying code coverage. We In testing a parser, we would like to ensure that all valid sen-
use a number of generation strategies to generate a series of tences are accepted while incorrect sentences are rejected.
test-suites. We apply these test-suites to keystone, a parser This is to ensure that the structure of the underlying gram-
front-end for ISO C++ and offer empirical evidence to sug- mar is adequately tested. As there is no regard for the
gest which generation strategy offers the best coverage whilst “meaning” of the sentences, this is known as syntactic cov-
using the least amount of test-cases. erage. However when testing a parser front-end we must
ensure that the sentences passed as input are semantically
correct to ensure that the underlying code is exercised. We
Keywords refer to this as semantic coverage. Furthermore we would
Software Testing, Parser Testing, Rule Coverage, Metrics,
like a test suite to utilise as many of the grammar rules
Purdoms Algorithm.
as possible because a grammar rule represents (through its
associated semantic action) the gateway to the underlying
1. INTRODUCTION code of the parser front-end. In this paper, we test the cover-
The testing of a program or software system is an essential ages of a parser front-end, keystone [3], in both the syntactic
and integral part of the software process. Testing assures us and semantic dimensions using, not only specification-based
that a specification of a program is correct or that a system and implementation-based test-suites but also with a test-
behaves in the intended way. The popularity of grammar- suite derived automatically using Purdom’s algorithm [9].
based tools [6] has ensured that testing these systems for keystone aides in the static analysis of C++ programs and
correct functioning and robustness is crucial. consists of a program processor and a symbol table. The
program processor is responsible for the scanning and pars-
There are a number of methods available when testing a ing and is also responsible for initiating and directing symbol
grammar-based system [10]. Specification-based testing in- table construction and name lookup. The symbol table al-
volves deriving inputs and expected outcomes for each test- lows name-lookup in accordance with Clause 3 of the ISO
case directly from the specification of the system. A draw- C++ standard [1].
back of this method is that some parts of the code may
remain unexercised, thus lowering confidence in the robust- In Section 2, we outline the test-suite generation strategies
ness of the software. With implementation-based testing, and their operation. The methodologies used to determine
input data for a test-case is generated from the implementa- the coverage achieved are outlined in Section 3. Section 4
tion but the expected outcomes cannot be determined from presents the coverages achieved by each of the test-suites for
the implementation. Implementation-based test-suites as- keystone in both the syntactic and semantic domain. Fur-
∗
thermore we show how test-suite generation compares to
On Sabbatical in Clemson University, South Carolina, reduced test-suites with regard to coverage. In Section 5,
USA. we conclude the paper.

2. GENERATION STRATEGIES
To conduct our study, a number of different types of test-
suite were employed. Two existing test-suites that were used
during the development of the current version of keystone
were chosen. These test-suites were augmented with two
other types of test-suite to ensure that the potential maxi-
mum code coverage was achieved. The first of these types
was based on the idea of test-suite reduction [5] and involves

Test-Suite Summary
g++ C++ test-suite from g++.dg DIR of gcc distribution.
ISO C++ test-cases derived directly from the ISO standard
Min. g++ Minimum number of test-cases from g++ test-suite.
Min. ISO Minimum number of test-cases from ISO test-suite.
Purdom C++ test-suite generated using Purdom’s algorithm.
CDRC Purdom C++ test-suite generated using Context Dependant Rule Coverage

Table 1: Summary of the six test-suites used.

taking a large, existing test-suite and reducing it down to a 1: for each test-case tc in test-suite ts do
minimum that provides the same rule coverage. The second 2: Add tc coverage vector to array [ ][ ] a
type involves generating test-cases directly from the gram- 3: end for
mar specification and to this end, we chose Purdom’s semi- 4: minsuite = { }
nal algorithm for the generation of sentences from a context- 5: addColumns ( a )
free grammar [9]. A summary of the six test-suites used can 6: for each column that sums to 1 do
be seen in Table 1. 7: minsuite = minsuite ∪ essential tc
8: end for
9: while not all rules covered do
2.1 Existing Test-suites 10: addRows ( a )
During the testing of keystone, two large existing test-suites 11: addColumns ( a )
for C++ were used. The first of these was the g++.dg 12: minsuite = minsuite ∪ largest-covering tc
test-suite used to test the C++ compiler that forms part 13: end while
of the GNU Compiler Collection, gcc. The second was a
specification-based suite derived from the ISO C++ stan-
Figure 1: Test-Suite Reduction Algorithm
dard [1] which has been used to measure conformance with
the ISO standard [4].
Hence the process will always be heuristic and in our case
2.2 Test-suite Reduction we choose to always use the test-case that contributes the
The notion behind test-suite reduction [5] is a relatively sim- most coverage even though it can be proved that this will
ple one. Given an existing test-suite, we wish to reduce it not guarantee the smallest test-suite.
to the smallest core of test-cases that still provides the same
amount of rule coverage. The algorithm shown in Figure 1.,
operates as follows: 2.3 Purdom’s Algorithm
Purdom’s [9] algorithm and its later interpretation [8] ad-
dress the issue of automatically generating test cases from a
1. For each test-case in the test-suite, a vector containing context-free grammar. The goal of the algorithm is to gen-
an entry for each rule is output. Within the vector, the erate a series of short sentences, such that every grammar
rule is marked as covered or not with a one or zero. rule is used at least once. The algorithm proceeds in two
distinct phases.
2. The vectors are placed together in a 2D array. The
rows are indexed by test-case and the columns are in- The first phase calculates two tables for each non-terminal.
dexed by rule number. The columns are then summed. The first of these, the SHORT table, calculates the rule to
use to derive the shortest sentence starting with the respec-
3. For any column that sums to one, i.e. only one test-
tive non-terminal. The second table called PREV contains
case covers the rule, then this test-case is deemed es-
the rule to use to introduce non-terminal n into the shortest
sential and added to the minimal test-suite. When a
derivation. The second phase of the algorithm utilises these
test-case is added to the minimum suite, all of the rules
tables to generate the sentences. A table known as ONCE
that are covered by this test-case are set to zero. This
keeps track of the rules covered. The algorithm terminates
process is repeated until all the essential test-cases are
when all the grammar rules have been exercised.
added to the min. suite.
4. The rows are then summed. The test-case that con- However rule coverage discloses a grammar’s structure in a
tributes the most coverage is now identified i.e. the weak sense. For large and complex grammars it is desirable
row with the largest sum. This is added to the mini- that valid combinations of productions are utilised to gener-
mum set and its coverages are set to zero. This step ate test cases that reflect more accurately the rich syntactic
is repeated until all columns sum to zero, i.e. all cov- structure of the grammar. A generalisation of rule cover-
erages have been accounted for in the min. suite. age has been proposed [7], such that the context in which
a rule is covered is taken into account. This is known as
Context Dependant Rule Coverage (CDRC) and in essence
It is worth noting that once all the essential test-cases have it ensures that every possible valid combination of rule pairs
been removed, the problem of choosing the minimum test-set are exercised.
that covers the remaining rules is equivalent to the minimum
cardinality hitting set, which is an intractable problem [2]. Using the grammar shown in Table 2 as an example, CDRC

S S S
X
bXXX PPP !aa

b X
X P !! a
A B C A B C A B C
"b ,l S S S
" b , l S S S
a B B b a B C a B C c C
,l
, l @
@
B b C C c C C
S
S S
S S
S
B b c C c C

Figure 2: Sample test-suite achieving CDRC for the Grammar in Table 2. Every rule for each direct occurence
of a non-terminal on the right-hand side of a grammar rule is accounted for.

1 S → ABC end and the code generation of the back-end. The second
2 A → aB suite, ISO, was used in [4] to measure conformance with the
3 B → Bb ISO C++ standard and consists of 440 test-cases sectioned
4 B → C according to the clauses found in the ISO C++ standard[1].
5 C → cC
6 C → To achieve the test-suite reduction, a number of steps were
taken. The first was to modify the parser for keystone to out-
Table 2: Simple grammar. put a single file for each test-case containing a rule number,
one per line, of each rule used during the parse of that test-
case. The test-suite reduction algorithm was implemented
Purdom works as follows: Every non-terminal that appears in the Java programming language using 217 lines of code.
on the right hand side of a grammar rule is noted, e.g. non- The algorithm was applied to both existing test-suites to
terminal B occurs on the right hand side of rule 2, A → a produce two new test-suites which we call Min. g++ and
B . This is known as a direct occurence of B in A. So for a Min. ISO.
test-suite to exhibit CDRC for the simple grammar above,
all rules with B on the left-hand side of the rule must be The syntactic coverage that each test-suite provided was de-
exercised for every direct occurence of B in the grammar. A termined by the following method: each file output by the
sample test-suite achieving CDRC for the above grammar parser was concatenated into a single monolithic file con-
would be: {abbcb, acc, ac} and is shown below in Figure 2. taining all the rule coverages for every test-case in the suite.
This was then sorted using the UNIX tool sort. Finally the
In our extension of Purdom’s original algorithm, we added UNIX tool uniq was used to pare down the sorted file, such
another table called OCCS to the algorithm which keeps that only one instance of every covered rule remained in the
track of all the direct occurences within a grammar. The file. The number of lines in the file output by uniq is the
table is indexed by non-terminals with all the direct oc- number of rules covered by a test-suite.
curences for a given non-terminal making up the entries.
This table along with the existing ONCE table is consulted Purdom’s algorithm was implemented in 673 lines of code
when choosing the next rule to be used. When all the en- in the Python scripting language. The extension to context-
tries in the OCCS table have been covered, the generation dependent rule coverage added an extra 246 lines of code.
of test-cases ceases. This modification to Purdom’s original The number of test-cases output by Purdom’s original algo-
algorithm along with the original Purdom algorithm added rithm was 53 while CDRC Purdom output 71 test-cases.
two more test-suites thus bringing the total number of test-
suites to six. Finally, keystone itself was profiled with the tool gcov, a
profiling tool that is a member of gcc. This tool measures
the statement coverage for a given file when a test-case is
3. METHODOLOGY executed. This is illustrated in Figure 3.
The case study was carried out using six test-suites for the
ISO C++ language standard. A number of tools and pro-
grams were used to generate the test-cases and to reduce 4. RESULTS
the test-suites. All tests were performed on keystone version In this section we present the results of our case study to
0.30. Two large, existing test-suites were first picked. The determine which generation strategy is the most effective at
first of these was the g++ test-suite from gcc version 3.4. achieving maximum coverage in the syntactic and semantic
This implementation-based suite consists of 1183 individual dimension. The results shown are paritioned according to
test-cases partitioned into sections that test the parser front- their domain. It is important to note the distinction between

Both of these modules are heavily dependent on the seman-
tics of the test-case in question and coverages of both can
only be achieved by test-cases that are semantically correct.
It is also worth noting that the coverage results of these
modules is based upon code that is used only for the nor-
mal operation of keystone. Thus user aides for debugging
such as pretty print methods etc. are excluded from the
measurement figures.

The statement coverage figures for the Parser files and the
modules Scope and Type are presented in Figures 5, 6 and 7
respectively. From these results we can see that the coverage
offered by the reduced test-suites is nearly identical to that
of their larger counterparts. The poor results offered by the
Purdom approaches are due to the fact that the test-cases
generated lack semantic correctness and thus they never get
to execute underlying code for the symbol table.

Figure 3: The steps involved in measuring the code
4.2 Existing Test-Suites
The existing test-suites, g++ and ISO consisted of 1183
coverage with gcov. keystone is compiled by gcc
and 440 test-cases respectively and are shown summerised
with extra flags. This outputs a profiled keystone
in Table 4. These were the benchmarks with which the
executable. When one of our test-suites is ran by
other test-suites were measured against. The suite g++ is
this executable, a statistics file corresponding to a
an implementation-based suite and achieved coverage in the
source file is output. gcov can then determine how
syntactic domain of 491 rules out of 536 total rules. This
many lines of code are executed in this correspond-
test-suite is designed to fully test the C++ compiler from
ing source file.
gcc. The presence of GNU C++ extensions in some of the
test-cases means that full coverage is not achieved due to the
fact that keystone was developed with only the C++ stan-
what we define as the syntactic domain and the semantic dard in mind. This suite exhibits the best coverage across
domain. Coverage of the grammar rules alone is classed as the semantic dimension on average due to larger coverages
coverage of the syntactic domain. If all of the grammar rules of the Lexer and Parser code.
are exercised by a test-suite, then that test-suite is said to
achieve full coverage in the syntactic domain. The coverage Test-Suite Num. Rules Rules covered
of the semantic domain is determined by how much of the
g++ 1153 491 / 536
underlying parser front-end code is executed when a test-
ISO 440 430 / 536
suite is run. We expect to see a close correlation between
coverage in the syntactic domain and coverage in the seman-
tic domain due to the fact that the only entry point to the Table 4: Rule coverage for existing test-suites.
underlying code is through the associated semantic action
for a grammar rule. The results are presented in Table 3 Suite ISO consists of 440 test-cases derived directly from the
and are discussed in the rest of this section. clauses of the ISO C++ standard [1]. This is a specification-
based suite and covers 430 of 536 rules. It is interesting to
note that despite the concerted effort to ensure every rule
4.1 Keystone Structure has a test-case, it falls well short of the implementation-
Keystone based suite in the syntactic dimension but the test-cases are
X still well constructed such that the coverage is for module
% XXXX
% Scope is higher than g++ and identical for Type.
Lexer Parser Symboltable

Q
Q The fact that the coverages for the lexer are slightly lower
Scopes Types than the rule coverages is due to the fact that all the test-
cases are lexically positive. Thus there are no deliberately
Figure 4: Keystone Structure. mis-spelled tokens, hence error checking code within the
lexer is never covered.
It is worth pointing out briefly the structure of keystone
and how the results for the semantic domain are interpreted. 4.3 Reduced Test-Suites
The underlying code is separated into four distinct sections The algorithm outlined in Section 2.1 above was applied
as shown in Figure 4. Lexer is the code coverage associated to both the existing test-suites to provide two new suites
with the file output by the tool Flex. Parser is a directory called Min. g++ and Min. ISO. The coverages achieved
containing files generated by the tool BTYacc and associ- by these new suites are identical in the syntactic domain as
ated files to deal with semantic actions. Within the symbol- their larger versions and similarly they give exactly the same
table of keystone are modules to determine scope within a coverage in the semantic domain for Lexer, Scope and Type
program and to aid in type checking and allocation. with Parser coverage being lower.

(a) g++                                                            (b) Min. g++

 (c) ISO                                                            (d) Min. ISO

(e) Purdom                                                       (f) CDRC Purdom

   Figure 5: Code coverage across the six test-suites for the Parser module.

(a) g++                                                           (b) Min. g++

 (c) ISO                                                           (d) Min. ISO

(e) Purdom                                                      (f) CDRC Purdom

   Figure 6: Code coverage across the six test-suites for the Scope module.

(a) g++                                                            (b) Min g++

 (c) ISO                                                            (d) Min. ISO

(e) Purdom                                                       (f) CDRC Purdom

    Figure 7: Code coverage across the six test-suites for the Type module.

No. Syntactic Coverages Semantic Coverages
Test Suite Test Cases Rules Covered (%) Lexer (%) Parser (%) Scope (Avg. %) Type (Avg. %)
g++ 1183 91.6 77.6 86.1 82.4 84.5
Min. g++ 48 91.6 77.6 82.9 82.4 84.5
ISO 440 80.2 68.4 73.9 84.0 84.5
Min. ISO 49 80.2 68.4 72.5 84.0 84.5
Purdom 53 100 72.5 23.2 34.8 37.9
CDRC Purdom 71 100 77.6 25.0 26.2 31.0

Table 3: Summerised Coverages in both dimensions for the six test-suites.

However an interesting finding of this study is the size of 1. The ISO C++ standard [1] defines a grammar that
the reduced test-suites. As stated they give almost the same actually accepts a super-set of C++. Hence any ap-
coverages across the board as their larger counterparts yet proach to automatically generating test-cases from the
the size of the new test-suites is dramatically smaller. We grammar alone will be difficult to produce semantically
find that for Min. g++ there are 1135 less test-cases, a re- correct test-cases.
duction of 96%. For Min. ISO, there are 391 less test-cases,
a saving of 89%. The size of the reductions are summarised 2. The test-cases produced by Purdom’s algorithm are in
in Table 5. the most part short sentences, however for the C++
grammar, in tandem with the small test-cases, a single
Min. No. No. original % large sentence is produced that attempts to cover as
Suite Test-Cases Test-Cases Reduction many rules as possible. It is impossible due to the
g++ 48 1183 96.0 complexity of the file to touch it up by hand, hence
ISO 49 440 88.9 keystone cannot parse this test-case.

Table 5: Percentage reduction achieved by the Test-
Suite reduction algorithm. The fact that this large test-case cannot be parsed accounts
for the low coverage figures that can be observed in modules
Parser, Scope and Type. Module Lexer has the same cov-
4.4 Generated Test-Suites erage due to the fact that keystone maintains a token buffer
By applying the C++ grammar used by keystone to the vari- during a parse, so the lexical code maintains a coverage sim-
ants of Purdom’s algorithm discussed in Section 2.2, two new ilar to the other test-suites.
test-suites are created. These are referred to as Purdom and
CDRC Purdom. As Purdom’s original algorithm only gives We can see from the dramatic difference in underlying code
consideration to producing sentences that are grammatically coverage of keystone that test-cases generated from syntactic
correct, i.e. the derivation of a sentence through repeated considerations alone are not sufficient to fully test a parser
application of the grammar rules, the resulting test-cases front-end.
must be touched up by hand to ensure they can be parsed
by keystone.
An example of this is with the generated test-case : 5. CONCLUSIONS
In this paper we have shown the coverages achieved in both
the syntactic and semantic dimensions exhibited by a num-
ber of test-suites for ISO C++. The main findings of our
“USING NAMESPACE IDENTIFIER SEMI” work are:

would be translated and modified as follows: 1. The test-cases produced by Purdom’s algorithm give
full rule coverage in the syntactic domain. Further-
namespace X{}; more the coverage is achieved by using a series of small
using namespace X; test-cases. However, the test-cases produced are not
semantically correct and thus fail to achieve notewor-
thy coverage in the semantic domain. It is also worth
The declaration of namespace X is essential to the successful noting that the test-cases produced by CDRC Pur-
parse of this test-case. dom offer no extra advantage in terms of the size of
the test-suite or coverage of the semantic dimension.
The number of test-cases output by the test-suite Purdom is
53 and the coverage of the syntactic dimension is 100% (due 2. Test-suite reduction provides an excellent alternative
to the nature of the algorithm). CDRC Purdom outputs 71 to the generation of test-cases. Reducing a large, ex-
test-cases that also fully cover the syntactic dimension. isting test-suite is a simple process. Furthermore the
number of test-cases remaining is comparable to amount
When the results of the semantic coverage are analysed, the of test-cases generated by the Purdom approach, but
results are poor in comparison with the previous approaches. with the added advantage of being semantically cor-
The reasons are two-fold: rect.

3. We have established a correlation between rule cover-
        age in the syntactic domain and code coverage in the
        semantic domain. Generated test-cases, that lacked
        semantic correctness gave poor coverage of the under-
        lying code in comparison to the other test-suites. As
        well as this, reduced test-suites, which have identical
        coverage in the syntactic domain as their larger coun-
        terparts exhibit exactly the same code coverage in the
        semantic dimension as the larger suites.

Our ongoing work includes extending the implementation-
based and specification-based suites to achieve full coverage
in the syntactic domain. We then hope to see an exact
relationship between full coverage in the syntactic dimension
mapping to full coverage in the semantic dimension.

6.     REFERENCES
 [1] ISO/IEC JTC 1. International Standard:
     Programming Languages - C++. Number
     14882:1998(E) in ASC X3. American National
     Standards Institute, first edition, September 1998.
 [2] M.R. Garey and D.S. Johnson. Computers and
     Intractability: A guide to the Theory of
     NP-Completeness. W.H. Freeman, 1979.
 [3] T.H. Gibbs, B.A. Malloy, and J.F. Power. Decorating
     tokens to facilitate recognition of ambiguous language
     constructs. Software - Practice and Experience,
     33(1):19–39, January 2003.
 [4] T.H. Gibbs, B.A. Malloy, and J.F. Power. Progression
     toward conformance of C++ language compilers. Dr.
     Dobbs Journal, 28(11):54–60, September 2003.
 [5] J.A. Jones and M.J. Harrold. Test-suite reduction and
     prioritization for modified condition/decision
     coverage. IEEE Transactions on Software Engineering,
     29(3):195–210, 2003.
 [6] P. Klint, R. Lämmel, and C. Verhoef. Towards an
     engineering discipline for grammarware. Draft,
     Submitted for journal publication; Online since July
     2003, 47 pages, Febuary 2005.
 [7] R. Lämmel. Grammar Testing. In Proc. of
     Fundamental Approaches to Software Engineering
     (FASE) 2001, volume 2029 of LNCS, pages 201–216.
     Springer-Verlag, 2001.
 [8] B.A. Malloy and J.F. Power. An interpretation of
     purdom’s algorithm for automatic generation of test
     cases. In 1st Annual International Conference on
     Computer and Information Science, Orlando, FL.,
     2001.
 [9] P. Purdom. A sentance generator for testing parsers.
     BIT, 12(3):366–375, 1972.
[10] M. Roper. Software Testing. McGraw-Hill, 1994.

You can also read