National University of Ireland, Maynooth

Page created by Darrell Watkins
 
CONTINUE READING
National University of Ireland, Maynooth
                          MAYNOOTH, CO. KILDARE, IRELAND.

                        DEPARTMENT OF COMPUTER SCIENCE,
                            TECHNICAL REPORT SERIES

   Generation Strategies for TestSuites of GrammarBased
                          Software

                           Mark Hennessy and James F. Power

                              NUIM-CS-TR-2005-02

http://www.cs.nuim.ie             Tel: +353 1 7083847         Fax: +353 1 7083848
Generation Strategies for Test-Suites of Grammar-Based
                             Software

                                Mark Hennessy                             James F. Power∗
                              Computer Science Dept.                    Computer Science Dept.
                            National University of Ireland            National University of Ireland
                            Maynooth, Co. Kildare, Ireland            Maynooth, Co. Kildare, Ireland
                              markh@cs.nuim.ie                          jpower@cs.nuim.ie

ABSTRACT                                                         sume the presence of an oracle with which to check the re-
The use of statement coverage has proved to be a useful met-     sult against. A testing strategy that exploits both methods
ric when testing code with a test-suite. Similarly, the cover-   is preferable to ensure a reasonable confidence in the cor-
age of a grammar’s rules is an effective metric when testing     rect functioning of the system but sections of code may go
a parser. However when testing a whole parser front-end,         untested.
it is not immediately obvious whether there is a correlation
between rule coverage and underlying code coverage. We           In testing a parser, we would like to ensure that all valid sen-
use a number of generation strategies to generate a series of    tences are accepted while incorrect sentences are rejected.
test-suites. We apply these test-suites to keystone, a parser    This is to ensure that the structure of the underlying gram-
front-end for ISO C++ and offer empirical evidence to sug-       mar is adequately tested. As there is no regard for the
gest which generation strategy offers the best coverage whilst   “meaning” of the sentences, this is known as syntactic cov-
using the least amount of test-cases.                            erage. However when testing a parser front-end we must
                                                                 ensure that the sentences passed as input are semantically
                                                                 correct to ensure that the underlying code is exercised. We
Keywords                                                         refer to this as semantic coverage. Furthermore we would
Software Testing, Parser Testing, Rule Coverage, Metrics,
                                                                 like a test suite to utilise as many of the grammar rules
Purdoms Algorithm.
                                                                 as possible because a grammar rule represents (through its
                                                                 associated semantic action) the gateway to the underlying
1.    INTRODUCTION                                               code of the parser front-end. In this paper, we test the cover-
The testing of a program or software system is an essential      ages of a parser front-end, keystone [3], in both the syntactic
and integral part of the software process. Testing assures us    and semantic dimensions using, not only specification-based
that a specification of a program is correct or that a system    and implementation-based test-suites but also with a test-
behaves in the intended way. The popularity of grammar-          suite derived automatically using Purdom’s algorithm [9].
based tools [6] has ensured that testing these systems for       keystone aides in the static analysis of C++ programs and
correct functioning and robustness is crucial.                   consists of a program processor and a symbol table. The
                                                                 program processor is responsible for the scanning and pars-
There are a number of methods available when testing a           ing and is also responsible for initiating and directing symbol
grammar-based system [10]. Specification-based testing in-       table construction and name lookup. The symbol table al-
volves deriving inputs and expected outcomes for each test-      lows name-lookup in accordance with Clause 3 of the ISO
case directly from the specification of the system. A draw-      C++ standard [1].
back of this method is that some parts of the code may
remain unexercised, thus lowering confidence in the robust-      In Section 2, we outline the test-suite generation strategies
ness of the software. With implementation-based testing,         and their operation. The methodologies used to determine
input data for a test-case is generated from the implementa-     the coverage achieved are outlined in Section 3. Section 4
tion but the expected outcomes cannot be determined from         presents the coverages achieved by each of the test-suites for
the implementation. Implementation-based test-suites as-         keystone in both the syntactic and semantic domain. Fur-
∗
                                                                 thermore we show how test-suite generation compares to
  On Sabbatical in Clemson University, South Carolina,           reduced test-suites with regard to coverage. In Section 5,
USA.                                                             we conclude the paper.

                                                                 2.     GENERATION STRATEGIES
                                                                 To conduct our study, a number of different types of test-
                                                                 suite were employed. Two existing test-suites that were used
                                                                 during the development of the current version of keystone
                                                                 were chosen. These test-suites were augmented with two
                                                                 other types of test-suite to ensure that the potential maxi-
                                                                 mum code coverage was achieved. The first of these types
                                                                 was based on the idea of test-suite reduction [5] and involves
Test-Suite                                   Summary
                            g++         C++ test-suite from g++.dg DIR of gcc distribution.
                             ISO        C++ test-cases derived directly from the ISO standard
                       Min. g++         Minimum number of test-cases from g++ test-suite.
                        Min. ISO        Minimum number of test-cases from ISO test-suite.
                         Purdom         C++ test-suite generated using Purdom’s algorithm.
                    CDRC Purdom         C++ test-suite generated using Context Dependant Rule Coverage

                                      Table 1: Summary of the six test-suites used.

taking a large, existing test-suite and reducing it down to a      1:   for each test-case tc in test-suite ts do
minimum that provides the same rule coverage. The second           2:     Add tc coverage vector to array [ ][ ] a
type involves generating test-cases directly from the gram-        3:   end for
mar specification and to this end, we chose Purdom’s semi-         4:   minsuite = { }
nal algorithm for the generation of sentences from a context-      5:   addColumns ( a )
free grammar [9]. A summary of the six test-suites used can        6:   for each column that sums to 1 do
be seen in Table 1.                                                7:     minsuite = minsuite ∪ essential tc
                                                                   8:   end for
                                                                   9:   while not all rules covered do
2.1    Existing Test-suites                                       10:     addRows ( a )
During the testing of keystone, two large existing test-suites    11:     addColumns ( a )
for C++ were used. The first of these was the g++.dg              12:     minsuite = minsuite ∪ largest-covering tc
test-suite used to test the C++ compiler that forms part          13:   end while
of the GNU Compiler Collection, gcc. The second was a
specification-based suite derived from the ISO C++ stan-
                                                                         Figure 1: Test-Suite Reduction Algorithm
dard [1] which has been used to measure conformance with
the ISO standard [4].
                                                                  Hence the process will always be heuristic and in our case
2.2    Test-suite Reduction                                       we choose to always use the test-case that contributes the
The notion behind test-suite reduction [5] is a relatively sim-   most coverage even though it can be proved that this will
ple one. Given an existing test-suite, we wish to reduce it       not guarantee the smallest test-suite.
to the smallest core of test-cases that still provides the same
amount of rule coverage. The algorithm shown in Figure 1.,
operates as follows:                                              2.3     Purdom’s Algorithm
                                                                  Purdom’s [9] algorithm and its later interpretation [8] ad-
                                                                  dress the issue of automatically generating test cases from a
  1. For each test-case in the test-suite, a vector containing    context-free grammar. The goal of the algorithm is to gen-
     an entry for each rule is output. Within the vector, the     erate a series of short sentences, such that every grammar
     rule is marked as covered or not with a one or zero.         rule is used at least once. The algorithm proceeds in two
                                                                  distinct phases.
  2. The vectors are placed together in a 2D array. The
     rows are indexed by test-case and the columns are in-        The first phase calculates two tables for each non-terminal.
     dexed by rule number. The columns are then summed.           The first of these, the SHORT table, calculates the rule to
                                                                  use to derive the shortest sentence starting with the respec-
  3. For any column that sums to one, i.e. only one test-
                                                                  tive non-terminal. The second table called PREV contains
     case covers the rule, then this test-case is deemed es-
                                                                  the rule to use to introduce non-terminal n into the shortest
     sential and added to the minimal test-suite. When a
                                                                  derivation. The second phase of the algorithm utilises these
     test-case is added to the minimum suite, all of the rules
                                                                  tables to generate the sentences. A table known as ONCE
     that are covered by this test-case are set to zero. This
                                                                  keeps track of the rules covered. The algorithm terminates
     process is repeated until all the essential test-cases are
                                                                  when all the grammar rules have been exercised.
     added to the min. suite.
  4. The rows are then summed. The test-case that con-            However rule coverage discloses a grammar’s structure in a
     tributes the most coverage is now identified i.e. the        weak sense. For large and complex grammars it is desirable
     row with the largest sum. This is added to the mini-         that valid combinations of productions are utilised to gener-
     mum set and its coverages are set to zero. This step         ate test cases that reflect more accurately the rich syntactic
     is repeated until all columns sum to zero, i.e. all cov-     structure of the grammar. A generalisation of rule cover-
     erages have been accounted for in the min. suite.            age has been proposed [7], such that the context in which
                                                                  a rule is covered is taken into account. This is known as
                                                                  Context Dependant Rule Coverage (CDRC) and in essence
It is worth noting that once all the essential test-cases have    it ensures that every possible valid combination of rule pairs
been removed, the problem of choosing the minimum test-set        are exercised.
that covers the remaining rules is equivalent to the minimum
cardinality hitting set, which is an intractable problem [2].     Using the grammar shown in Table 2 as an example, CDRC
S                                    S                              S
                                  X
                                    bXXX                             PPP                               !aa
                              
                                     b  X
                                         X                              P                           !!    a
                          A                       B       C       A            B           C          A       B       C
                     "b                       ,l              S                                   S             S
                   "   b                      , l              S                                   S             S
                  a               B           B       b      a       B        C                 a       B   C   c       C
                              ,l
                              , l                                              @
                                                                               @
                              B       b       C                       C   c        C                      C              
                          S
                           S              S
                                            S                                     S
                                                                                    S
                       B          b       c       C                           c       C                  

                       C                                                              

                          

Figure 2: Sample test-suite achieving CDRC for the Grammar in Table 2. Every rule for each direct occurence
of a non-terminal on the right-hand side of a grammar rule is accounted for.

                      1       S       →   ABC                             end and the code generation of the back-end. The second
                      2       A       →   aB                              suite, ISO, was used in [4] to measure conformance with the
                      3       B       →   Bb                              ISO C++ standard and consists of 440 test-cases sectioned
                      4       B       →   C                               according to the clauses found in the ISO C++ standard[1].
                      5       C       →   cC
                      6       C       →                                  To achieve the test-suite reduction, a number of steps were
                                                                          taken. The first was to modify the parser for keystone to out-
               Table 2: Simple grammar.                                   put a single file for each test-case containing a rule number,
                                                                          one per line, of each rule used during the parse of that test-
                                                                          case. The test-suite reduction algorithm was implemented
Purdom works as follows: Every non-terminal that appears                  in the Java programming language using 217 lines of code.
on the right hand side of a grammar rule is noted, e.g. non-              The algorithm was applied to both existing test-suites to
terminal B occurs on the right hand side of rule 2, A → a                 produce two new test-suites which we call Min. g++ and
 B . This is known as a direct occurence of B in A. So for a              Min. ISO.
test-suite to exhibit CDRC for the simple grammar above,
all rules with B on the left-hand side of the rule must be                The syntactic coverage that each test-suite provided was de-
exercised for every direct occurence of B in the grammar. A               termined by the following method: each file output by the
sample test-suite achieving CDRC for the above grammar                    parser was concatenated into a single monolithic file con-
would be: {abbcb, acc, ac} and is shown below in Figure 2.                taining all the rule coverages for every test-case in the suite.
                                                                          This was then sorted using the UNIX tool sort. Finally the
In our extension of Purdom’s original algorithm, we added                 UNIX tool uniq was used to pare down the sorted file, such
another table called OCCS to the algorithm which keeps                    that only one instance of every covered rule remained in the
track of all the direct occurences within a grammar. The                  file. The number of lines in the file output by uniq is the
table is indexed by non-terminals with all the direct oc-                 number of rules covered by a test-suite.
curences for a given non-terminal making up the entries.
This table along with the existing ONCE table is consulted                Purdom’s algorithm was implemented in 673 lines of code
when choosing the next rule to be used. When all the en-                  in the Python scripting language. The extension to context-
tries in the OCCS table have been covered, the generation                 dependent rule coverage added an extra 246 lines of code.
of test-cases ceases. This modification to Purdom’s original              The number of test-cases output by Purdom’s original algo-
algorithm along with the original Purdom algorithm added                  rithm was 53 while CDRC Purdom output 71 test-cases.
two more test-suites thus bringing the total number of test-
suites to six.                                                            Finally, keystone itself was profiled with the tool gcov, a
                                                                          profiling tool that is a member of gcc. This tool measures
                                                                          the statement coverage for a given file when a test-case is
3.   METHODOLOGY                                                          executed. This is illustrated in Figure 3.
The case study was carried out using six test-suites for the
ISO C++ language standard. A number of tools and pro-
grams were used to generate the test-cases and to reduce                  4.   RESULTS
the test-suites. All tests were performed on keystone version             In this section we present the results of our case study to
0.30. Two large, existing test-suites were first picked. The              determine which generation strategy is the most effective at
first of these was the g++ test-suite from gcc version 3.4.               achieving maximum coverage in the syntactic and semantic
This implementation-based suite consists of 1183 individual               dimension. The results shown are paritioned according to
test-cases partitioned into sections that test the parser front-          their domain. It is important to note the distinction between
Both of these modules are heavily dependent on the seman-
                                                                 tics of the test-case in question and coverages of both can
                                                                 only be achieved by test-cases that are semantically correct.
                                                                 It is also worth noting that the coverage results of these
                                                                 modules is based upon code that is used only for the nor-
                                                                 mal operation of keystone. Thus user aides for debugging
                                                                 such as pretty print methods etc. are excluded from the
                                                                 measurement figures.

                                                                 The statement coverage figures for the Parser files and the
                                                                 modules Scope and Type are presented in Figures 5, 6 and 7
                                                                 respectively. From these results we can see that the coverage
                                                                 offered by the reduced test-suites is nearly identical to that
                                                                 of their larger counterparts. The poor results offered by the
                                                                 Purdom approaches are due to the fact that the test-cases
                                                                 generated lack semantic correctness and thus they never get
                                                                 to execute underlying code for the symbol table.

Figure 3: The steps involved in measuring the code
                                                                 4.2   Existing Test-Suites
                                                                 The existing test-suites, g++ and ISO consisted of 1183
coverage with gcov. keystone is compiled by gcc
                                                                 and 440 test-cases respectively and are shown summerised
with extra flags. This outputs a profiled keystone
                                                                 in Table 4. These were the benchmarks with which the
executable. When one of our test-suites is ran by
                                                                 other test-suites were measured against. The suite g++ is
this executable, a statistics file corresponding to a
                                                                 an implementation-based suite and achieved coverage in the
source file is output. gcov can then determine how
                                                                 syntactic domain of 491 rules out of 536 total rules. This
many lines of code are executed in this correspond-
                                                                 test-suite is designed to fully test the C++ compiler from
ing source file.
                                                                 gcc. The presence of GNU C++ extensions in some of the
                                                                 test-cases means that full coverage is not achieved due to the
                                                                 fact that keystone was developed with only the C++ stan-
what we define as the syntactic domain and the semantic          dard in mind. This suite exhibits the best coverage across
domain. Coverage of the grammar rules alone is classed as        the semantic dimension on average due to larger coverages
coverage of the syntactic domain. If all of the grammar rules    of the Lexer and Parser code.
are exercised by a test-suite, then that test-suite is said to
achieve full coverage in the syntactic domain. The coverage               Test-Suite    Num. Rules     Rules covered
of the semantic domain is determined by how much of the
                                                                            g++           1153          491 / 536
underlying parser front-end code is executed when a test-
                                                                             ISO           440          430 / 536
suite is run. We expect to see a close correlation between
coverage in the syntactic domain and coverage in the seman-
tic domain due to the fact that the only entry point to the        Table 4: Rule coverage for existing test-suites.
underlying code is through the associated semantic action
for a grammar rule. The results are presented in Table 3         Suite ISO consists of 440 test-cases derived directly from the
and are discussed in the rest of this section.                   clauses of the ISO C++ standard [1]. This is a specification-
                                                                 based suite and covers 430 of 536 rules. It is interesting to
                                                                 note that despite the concerted effort to ensure every rule
4.1   Keystone Structure                                         has a test-case, it falls well short of the implementation-
                       Keystone                                  based suite in the syntactic dimension but the test-cases are
                    X                                          still well constructed such that the coverage is for module
                    % XXXX
                 %                                            Scope is higher than g++ and identical for Type.
             Lexer    Parser      Symboltable
                                   
                                    Q
                                      Q                          The fact that the coverages for the lexer are slightly lower
                                Scopes    Types                  than the rule coverages is due to the fact that all the test-
                                                                 cases are lexically positive. Thus there are no deliberately
            Figure 4: Keystone Structure.                        mis-spelled tokens, hence error checking code within the
                                                                 lexer is never covered.
It is worth pointing out briefly the structure of keystone
and how the results for the semantic domain are interpreted.     4.3   Reduced Test-Suites
The underlying code is separated into four distinct sections     The algorithm outlined in Section 2.1 above was applied
as shown in Figure 4. Lexer is the code coverage associated      to both the existing test-suites to provide two new suites
with the file output by the tool Flex. Parser is a directory     called Min. g++ and Min. ISO. The coverages achieved
containing files generated by the tool BTYacc and associ-        by these new suites are identical in the syntactic domain as
ated files to deal with semantic actions. Within the symbol-     their larger versions and similarly they give exactly the same
table of keystone are modules to determine scope within a        coverage in the semantic domain for Lexer, Scope and Type
program and to aid in type checking and allocation.              with Parser coverage being lower.
(a) g++                                                            (b) Min. g++

 (c) ISO                                                            (d) Min. ISO

(e) Purdom                                                       (f) CDRC Purdom

   Figure 5: Code coverage across the six test-suites for the Parser module.
(a) g++                                                           (b) Min. g++

 (c) ISO                                                           (d) Min. ISO

(e) Purdom                                                      (f) CDRC Purdom

   Figure 6: Code coverage across the six test-suites for the Scope module.
(a) g++                                                            (b) Min g++

 (c) ISO                                                            (d) Min. ISO

(e) Purdom                                                       (f) CDRC Purdom

    Figure 7: Code coverage across the six test-suites for the Type module.
No.        Syntactic Coverages                            Semantic Coverages
     Test Suite       Test Cases      Rules Covered (%)         Lexer (%)     Parser (%) Scope (Avg. %)        Type (Avg. %)
           g++           1183                91.6                 77.6           86.1         82.4                  84.5
      Min. g++            48                 91.6                 77.6           82.9         82.4                  84.5
            ISO          440                 80.2                 68.4           73.9         84.0                  84.5
       Min. ISO           49                 80.2                 68.4           72.5         84.0                  84.5
        Purdom            53                 100                  72.5           23.2         34.8                  37.9
   CDRC Purdom            71                 100                  77.6           25.0         26.2                  31.0

                   Table 3: Summerised Coverages in both dimensions for the six test-suites.

However an interesting finding of this study is the size of             1. The ISO C++ standard [1] defines a grammar that
the reduced test-suites. As stated they give almost the same               actually accepts a super-set of C++. Hence any ap-
coverages across the board as their larger counterparts yet                proach to automatically generating test-cases from the
the size of the new test-suites is dramatically smaller. We                grammar alone will be difficult to produce semantically
find that for Min. g++ there are 1135 less test-cases, a re-               correct test-cases.
duction of 96%. For Min. ISO, there are 391 less test-cases,
a saving of 89%. The size of the reductions are summarised              2. The test-cases produced by Purdom’s algorithm are in
in Table 5.                                                                the most part short sentences, however for the C++
                                                                           grammar, in tandem with the small test-cases, a single
      Min.        No.       No. original      %                            large sentence is produced that attempts to cover as
      Suite    Test-Cases   Test-Cases     Reduction                       many rules as possible. It is impossible due to the
      g++          48          1183          96.0                          complexity of the file to touch it up by hand, hence
      ISO          49           440          88.9                          keystone cannot parse this test-case.

Table 5: Percentage reduction achieved by the Test-
Suite reduction algorithm.                                         The fact that this large test-case cannot be parsed accounts
                                                                   for the low coverage figures that can be observed in modules
                                                                   Parser, Scope and Type. Module Lexer has the same cov-
4.4   Generated Test-Suites                                        erage due to the fact that keystone maintains a token buffer
By applying the C++ grammar used by keystone to the vari-          during a parse, so the lexical code maintains a coverage sim-
ants of Purdom’s algorithm discussed in Section 2.2, two new       ilar to the other test-suites.
test-suites are created. These are referred to as Purdom and
CDRC Purdom. As Purdom’s original algorithm only gives             We can see from the dramatic difference in underlying code
consideration to producing sentences that are grammatically        coverage of keystone that test-cases generated from syntactic
correct, i.e. the derivation of a sentence through repeated        considerations alone are not sufficient to fully test a parser
application of the grammar rules, the resulting test-cases         front-end.
must be touched up by hand to ensure they can be parsed
by keystone.
An example of this is with the generated test-case :               5.     CONCLUSIONS
                                                                   In this paper we have shown the coverages achieved in both
                                                                   the syntactic and semantic dimensions exhibited by a num-
                                                                   ber of test-suites for ISO C++. The main findings of our
      “USING NAMESPACE IDENTIFIER SEMI”                            work are:

would be translated and modified as follows:                            1. The test-cases produced by Purdom’s algorithm give
                                                                           full rule coverage in the syntactic domain. Further-
                     namespace X{};                                        more the coverage is achieved by using a series of small
                   using namespace X;                                      test-cases. However, the test-cases produced are not
                                                                           semantically correct and thus fail to achieve notewor-
                                                                           thy coverage in the semantic domain. It is also worth
The declaration of namespace X is essential to the successful              noting that the test-cases produced by CDRC Pur-
parse of this test-case.                                                   dom offer no extra advantage in terms of the size of
                                                                           the test-suite or coverage of the semantic dimension.
The number of test-cases output by the test-suite Purdom is
53 and the coverage of the syntactic dimension is 100% (due             2. Test-suite reduction provides an excellent alternative
to the nature of the algorithm). CDRC Purdom outputs 71                    to the generation of test-cases. Reducing a large, ex-
test-cases that also fully cover the syntactic dimension.                  isting test-suite is a simple process. Furthermore the
                                                                           number of test-cases remaining is comparable to amount
When the results of the semantic coverage are analysed, the                of test-cases generated by the Purdom approach, but
results are poor in comparison with the previous approaches.               with the added advantage of being semantically cor-
The reasons are two-fold:                                                  rect.
3. We have established a correlation between rule cover-
        age in the syntactic domain and code coverage in the
        semantic domain. Generated test-cases, that lacked
        semantic correctness gave poor coverage of the under-
        lying code in comparison to the other test-suites. As
        well as this, reduced test-suites, which have identical
        coverage in the syntactic domain as their larger coun-
        terparts exhibit exactly the same code coverage in the
        semantic dimension as the larger suites.

Our ongoing work includes extending the implementation-
based and specification-based suites to achieve full coverage
in the syntactic domain. We then hope to see an exact
relationship between full coverage in the syntactic dimension
mapping to full coverage in the semantic dimension.

6.     REFERENCES
 [1] ISO/IEC JTC 1. International Standard:
     Programming Languages - C++. Number
     14882:1998(E) in ASC X3. American National
     Standards Institute, first edition, September 1998.
 [2] M.R. Garey and D.S. Johnson. Computers and
     Intractability: A guide to the Theory of
     NP-Completeness. W.H. Freeman, 1979.
 [3] T.H. Gibbs, B.A. Malloy, and J.F. Power. Decorating
     tokens to facilitate recognition of ambiguous language
     constructs. Software - Practice and Experience,
     33(1):19–39, January 2003.
 [4] T.H. Gibbs, B.A. Malloy, and J.F. Power. Progression
     toward conformance of C++ language compilers. Dr.
     Dobbs Journal, 28(11):54–60, September 2003.
 [5] J.A. Jones and M.J. Harrold. Test-suite reduction and
     prioritization for modified condition/decision
     coverage. IEEE Transactions on Software Engineering,
     29(3):195–210, 2003.
 [6] P. Klint, R. Lämmel, and C. Verhoef. Towards an
     engineering discipline for grammarware. Draft,
     Submitted for journal publication; Online since July
     2003, 47 pages, Febuary 2005.
 [7] R. Lämmel. Grammar Testing. In Proc. of
     Fundamental Approaches to Software Engineering
     (FASE) 2001, volume 2029 of LNCS, pages 201–216.
     Springer-Verlag, 2001.
 [8] B.A. Malloy and J.F. Power. An interpretation of
     purdom’s algorithm for automatic generation of test
     cases. In 1st Annual International Conference on
     Computer and Information Science, Orlando, FL.,
     2001.
 [9] P. Purdom. A sentance generator for testing parsers.
     BIT, 12(3):366–375, 1972.
[10] M. Roper. Software Testing. McGraw-Hill, 1994.
You can also read