National University of Ireland, Maynooth

National University of Ireland, Maynooth

http://www.cs.nuim.ie Tel: +353 1 7083847 Fax: +353 1 7083848 National University of Ireland, Maynooth MAYNOOTH, CO. KILDARE, IRELAND. DEPARTMENT OF COMPUTER SCIENCE, TECHNICAL REPORT SERIES Generation Strategies for TestSuites of GrammarBased Software Mark Hennessy and James F. Power NUIM-CS-TR-2005-02

Generation Strategies for Test-Suites of Grammar-Based Software Mark Hennessy Computer Science Dept. National University of Ireland Maynooth, Co. Kildare, Ireland markh@cs.nuim.ie James F. Power∗ Computer Science Dept. National University of Ireland Maynooth, Co. Kildare, Ireland jpower@cs.nuim.ie ABSTRACT The use of statement coverage has proved to be a useful met- ric when testing code with a test-suite. Similarly, the cover- age of a grammar’s rules is an effective metric when testing a parser. However when testing a whole parser front-end, it is not immediately obvious whether there is a correlation between rule coverage and underlying code coverage. We use a number of generation strategies to generate a series of test-suites. We apply these test-suites to keystone, a parser front-end for ISO C++ and offer empirical evidence to sug- gest which generation strategy offers the best coverage whilst using the least amount of test-cases.

Keywords Software Testing, Parser Testing, Rule Coverage, Metrics, Purdoms Algorithm. 1. INTRODUCTION The testing of a program or software system is an essential and integral part of the software process. Testing assures us that a specification of a program is correct or that a system behaves in the intended way. The popularity of grammar- based tools [6] has ensured that testing these systems for correct functioning and robustness is crucial. There are a number of methods available when testing a grammar-based system [10]. Specification-based testing in- volves deriving inputs and expected outcomes for each test- case directly from the specification of the system. A draw- back of this method is that some parts of the code may remain unexercised, thus lowering confidence in the robust- ness of the software. With implementation-based testing, input data for a test-case is generated from the implementa- tion but the expected outcomes cannot be determined from the implementation. Implementation-based test-suites as- ∗ On Sabbatical in Clemson University, South Carolina, USA.

sume the presence of an oracle with which to check the re- sult against. A testing strategy that exploits both methods is preferable to ensure a reasonable confidence in the cor- rect functioning of the system but sections of code may go untested. In testing a parser, we would like to ensure that all valid sen- tences are accepted while incorrect sentences are rejected. This is to ensure that the structure of the underlying gram- mar is adequately tested. As there is no regard for the “meaning” of the sentences, this is known as syntactic cov- erage. However when testing a parser front-end we must ensure that the sentences passed as input are semantically correct to ensure that the underlying code is exercised. We refer to this as semantic coverage. Furthermore we would like a test suite to utilise as many of the grammar rules as possible because a grammar rule represents (through its associated semantic action) the gateway to the underlying code of the parser front-end. In this paper, we test the cover- ages of a parser front-end, keystone [3], in both the syntactic and semantic dimensions using, not only specification-based and implementation-based test-suites but also with a test- suite derived automatically using Purdom’s algorithm [9]. keystone aides in the static analysis of C++ programs and consists of a program processor and a symbol table. The program processor is responsible for the scanning and pars- ing and is also responsible for initiating and directing symbol table construction and name lookup. The symbol table al- lows name-lookup in accordance with Clause 3 of the ISO C++ standard [1].

In Section 2, we outline the test-suite generation strategies and their operation. The methodologies used to determine the coverage achieved are outlined in Section 3. Section 4 presents the coverages achieved by each of the test-suites for keystone in both the syntactic and semantic domain. Fur- thermore we show how test-suite generation compares to reduced test-suites with regard to coverage. In Section 5, we conclude the paper. 2. GENERATION STRATEGIES To conduct our study, a number of different types of test- suite were employed. Two existing test-suites that were used during the development of the current version of keystone were chosen. These test-suites were augmented with two other types of test-suite to ensure that the potential maxi- mum code coverage was achieved. The first of these types was based on the idea of test-suite reduction [5] and involves

Test-Suite Summary g++ C++ test-suite from g++.dg DIR of gcc distribution. ISO C++ test-cases derived directly from the ISO standard Min. g++ Minimum number of test-cases from g++ test-suite. Min. ISO Minimum number of test-cases from ISO test-suite. Purdom C++ test-suite generated using Purdom’s algorithm. CDRC Purdom C++ test-suite generated using Context Dependant Rule Coverage Table 1: Summary of the six test-suites used. taking a large, existing test-suite and reducing it down to a minimum that provides the same rule coverage. The second type involves generating test-cases directly from the gram- mar specification and to this end, we chose Purdom’s semi- nal algorithm for the generation of sentences from a context- free grammar [9]. A summary of the six test-suites used can be seen in Table 1.

2.1 Existing Test-suites During the testing of keystone, two large existing test-suites for C++ were used. The first of these was the g++.dg test-suite used to test the C++ compiler that forms part of the GNU Compiler Collection, gcc. The second was a specification-based suite derived from the ISO C++ stan- dard [1] which has been used to measure conformance with the ISO standard [4]. 2.2 Test-suite Reduction The notion behind test-suite reduction [5] is a relatively sim- ple one. Given an existing test-suite, we wish to reduce it to the smallest core of test-cases that still provides the same amount of rule coverage. The algorithm shown in Figure 1., operates as follows: 1. For each test-case in the test-suite, a vector containing an entry for each rule is output. Within the vector, the rule is marked as covered or not with a one or zero. 2. The vectors are placed together in a 2D array. The rows are indexed by test-case and the columns are in- dexed by rule number. The columns are then summed. 3. For any column that sums to one, i.e. only one test- case covers the rule, then this test-case is deemed es- sential and added to the minimal test-suite. When a test-case is added to the minimum suite, all of the rules that are covered by this test-case are set to zero. This process is repeated until all the essential test-cases are added to the min. suite.

4. The rows are then summed. The test-case that con- tributes the most coverage is now identified i.e. the row with the largest sum. This is added to the mini- mum set and its coverages are set to zero. This step is repeated until all columns sum to zero, i.e. all cov- erages have been accounted for in the min. suite. It is worth noting that once all the essential test-cases have been removed, the problem of choosing the minimum test-set that covers the remaining rules is equivalent to the minimum cardinality hitting set, which is an intractable problem [2]. 1: for each test-case tc in test-suite ts do 2: Add tc coverage vector to array ] a 3: end for 4: minsuite } 5: addColumns ( a ) 6: for each column that sums to 1 do 7: minsuite = minsuite ∪ essential tc 8: end for 9: while not all rules covered do 10: addRows ( a ) 11: addColumns ( a ) 12: minsuite = minsuite ∪ largest-covering tc 13: end while Figure 1: Test-Suite Reduction Algorithm Hence the process will always be heuristic and in our case we choose to always use the test-case that contributes the most coverage even though it can be proved that this will not guarantee the smallest test-suite.

2.3 Purdom’s Algorithm Purdom’s [9] algorithm and its later interpretation [8] ad- dress the issue of automatically generating test cases from a context-free grammar. The goal of the algorithm is to gen- erate a series of short sentences, such that every grammar rule is used at least once. The algorithm proceeds in two distinct phases. The first phase calculates two tables for each non-terminal. The first of these, the SHORT table, calculates the rule to use to derive the shortest sentence starting with the respec- tive non-terminal. The second table called PREV contains the rule to use to introduce non-terminal n into the shortest derivation. The second phase of the algorithm utilises these tables to generate the sentences. A table known as ONCE keeps track of the rules covered. The algorithm terminates when all the grammar rules have been exercised. However rule coverage discloses a grammar’s structure in a weak sense. For large and complex grammars it is desirable that valid combinations of productions are utilised to gener- ate test cases that reflect more accurately the rich syntactic structure of the grammar. A generalisation of rule cover- age has been proposed [7], such that the context in which a rule is covered is taken into account. This is known as Context Dependant Rule Coverage (CDRC) and in essence it ensures that every possible valid combination of rule pairs are exercised.

Using the grammar shown in Table 2 as an example, CDRC

S XXXXX X b b b       A b b b " " " a B l l , , B S S   B C  b b B l l , , B C S S   c C  b C  S PPPP     A S S   a B C  B C @ @ c C S S   c C  C  S aa a ! ! ! A S S   a B C  B C  C S S   c C  Figure 2: Sample test-suite achieving CDRC for the Grammar in Table 2. Every rule for each direct occurence of a non-terminal on the right-hand side of a grammar rule is accounted for. 1 S → ABC 2 A → a B 3 B → B b 4 B → C 5 C → c C 6 C →  Table 2: Simple grammar. Purdom works as follows: Every non-terminal that appears on the right hand side of a grammar rule is noted, e.g. non- terminal B occurs on the right hand side of rule 2, A → a B . This is known as a direct occurence of B in A. So for a test-suite to exhibit CDRC for the simple grammar above, all rules with B on the left-hand side of the rule must be exercised for every direct occurence of B in the grammar. A sample test-suite achieving CDRC for the above grammar would be: {abbcb, acc, ac} and is shown below in Figure 2. In our extension of Purdom’s original algorithm, we added another table called OCCS to the algorithm which keeps track of all the direct occurences within a grammar. The table is indexed by non-terminals with all the direct oc- curences for a given non-terminal making up the entries. This table along with the existing ONCE table is consulted when choosing the next rule to be used. When all the en- tries in the OCCS table have been covered, the generation of test-cases ceases. This modification to Purdom’s original algorithm along with the original Purdom algorithm added two more test-suites thus bringing the total number of test- suites to six.

3. METHODOLOGY The case study was carried out using six test-suites for the ISO C++ language standard. A number of tools and pro- grams were used to generate the test-cases and to reduce the test-suites. All tests were performed on keystone version 0.30. Two large, existing test-suites were first picked. The first of these was the g++ test-suite from gcc version 3.4. This implementation-based suite consists of 1183 individual test-cases partitioned into sections that test the parser front- end and the code generation of the back-end. The second suite, ISO, was used in [4] to measure conformance with the ISO C++ standard and consists of 440 test-cases sectioned according to the clauses found in the ISO C++ standard[1]. To achieve the test-suite reduction, a number of steps were taken. The first was to modify the parser for keystone to out- put a single file for each test-case containing a rule number, one per line, of each rule used during the parse of that test- case. The test-suite reduction algorithm was implemented in the Java programming language using 217 lines of code. The algorithm was applied to both existing test-suites to produce two new test-suites which we call Min. g++ and Min. ISO.

The syntactic coverage that each test-suite provided was de- termined by the following method: each file output by the parser was concatenated into a single monolithic file con- taining all the rule coverages for every test-case in the suite. This was then sorted using the UNIX tool sort. Finally the UNIX tool uniq was used to pare down the sorted file, such that only one instance of every covered rule remained in the file. The number of lines in the file output by uniq is the number of rules covered by a test-suite.

Purdom’s algorithm was implemented in 673 lines of code in the Python scripting language. The extension to context- dependent rule coverage added an extra 246 lines of code. The number of test-cases output by Purdom’s original algo- rithm was 53 while CDRC Purdom output 71 test-cases. Finally, keystone itself was profiled with the tool gcov, a profiling tool that is a member of gcc. This tool measures the statement coverage for a given file when a test-case is executed. This is illustrated in Figure 3. 4. RESULTS In this section we present the results of our case study to determine which generation strategy is the most effective at achieving maximum coverage in the syntactic and semantic dimension. The results shown are paritioned according to their domain. It is important to note the distinction between

Figure 3: The steps involved in measuring the code coverage with gcov. keystone is compiled by gcc with extra flags. This outputs a profiled keystone executable. When one of our test-suites is ran by this executable, a statistics file corresponding to a source file is output. gcov can then determine how many lines of code are executed in this correspond- ing source file. what we define as the syntactic domain and the semantic domain. Coverage of the grammar rules alone is classed as coverage of the syntactic domain. If all of the grammar rules are exercised by a test-suite, then that test-suite is said to achieve full coverage in the syntactic domain. The coverage of the semantic domain is determined by how much of the underlying parser front-end code is executed when a test- suite is run. We expect to see a close correlation between coverage in the syntactic domain and coverage in the seman- tic domain due to the fact that the only entry point to the underlying code is through the associated semantic action for a grammar rule. The results are presented in Table 3 and are discussed in the rest of this section. 4.1 Keystone Structure Keystone XXXX X % %      Lexer Parser Symboltable Q Q   Scopes Types Figure 4: Keystone Structure. It is worth pointing out briefly the structure of keystone and how the results for the semantic domain are interpreted. The underlying code is separated into four distinct sections as shown in Figure 4. Lexer is the code coverage associated with the file output by the tool Flex. Parser is a directory containing files generated by the tool BTYacc and associ- ated files to deal with semantic actions. Within the symbol- table of keystone are modules to determine scope within a program and to aid in type checking and allocation. Both of these modules are heavily dependent on the seman- tics of the test-case in question and coverages of both can only be achieved by test-cases that are semantically correct. It is also worth noting that the coverage results of these modules is based upon code that is used only for the nor- mal operation of keystone. Thus user aides for debugging such as pretty print methods etc. are excluded from the measurement figures.

The statement coverage figures for the Parser files and the modules Scope and Type are presented in Figures 5, 6 and 7 respectively. From these results we can see that the coverage offered by the reduced test-suites is nearly identical to that of their larger counterparts. The poor results offered by the Purdom approaches are due to the fact that the test-cases generated lack semantic correctness and thus they never get to execute underlying code for the symbol table. 4.2 Existing Test-Suites The existing test-suites, g++ and ISO consisted of 1183 and 440 test-cases respectively and are shown summerised in Table 4. These were the benchmarks with which the other test-suites were measured against. The suite g++ is an implementation-based suite and achieved coverage in the syntactic domain of 491 rules out of 536 total rules. This test-suite is designed to fully test the C++ compiler from gcc. The presence of GNU C++ extensions in some of the test-cases means that full coverage is not achieved due to the fact that keystone was developed with only the C++ stan- dard in mind. This suite exhibits the best coverage across the semantic dimension on average due to larger coverages of the Lexer and Parser code.

Test-Suite Num. Rules Rules covered g++ 1153 491 / 536 ISO 440 430 / 536 Table 4: Rule coverage for existing test-suites. Suite ISO consists of 440 test-cases derived directly from the clauses of the ISO C++ standard [1]. This is a specification- based suite and covers 430 of 536 rules. It is interesting to note that despite the concerted effort to ensure every rule has a test-case, it falls well short of the implementation- based suite in the syntactic dimension but the test-cases are still well constructed such that the coverage is for module Scope is higher than g++ and identical for Type. The fact that the coverages for the lexer are slightly lower than the rule coverages is due to the fact that all the test- cases are lexically positive. Thus there are no deliberately mis-spelled tokens, hence error checking code within the lexer is never covered.

4.3 Reduced Test-Suites The algorithm outlined in Section 2.1 above was applied to both the existing test-suites to provide two new suites called Min. g++ and Min. ISO. The coverages achieved by these new suites are identical in the syntactic domain as their larger versions and similarly they give exactly the same coverage in the semantic domain for Lexer, Scope and Type with Parser coverage being lower.

(a) g++ (b) Min. g++ (c) ISO (d) Min. ISO (e) Purdom (f) CDRC Purdom Figure 5: Code coverage across the six test-suites for the Parser module.

(a) g++ (b) Min. g++ (c) ISO (d) Min. ISO (e) Purdom (f) CDRC Purdom Figure 6: Code coverage across the six test-suites for the Scope module.

(a) g++ (b) Min g++ (c) ISO (d) Min. ISO (e) Purdom (f) CDRC Purdom Figure 7: Code coverage across the six test-suites for the Type module.

No. Syntactic Coverages Semantic Coverages Test Suite Test Cases Rules Covered (%) Lexer (%) Parser (%) Scope (Avg. %) Type (Avg. %) g++ 1183 91.6 77.6 86.1 82.4 84.5 Min. g++ 48 91.6 77.6 82.9 82.4 84.5 ISO 440 80.2 68.4 73.9 84.0 84.5 Min. ISO 49 80.2 68.4 72.5 84.0 84.5 Purdom 53 100 72.5 23.2 34.8 37.9 CDRC Purdom 71 100 77.6 25.0 26.2 31.0 Table 3: Summerised Coverages in both dimensions for the six test-suites. However an interesting finding of this study is the size of the reduced test-suites. As stated they give almost the same coverages across the board as their larger counterparts yet the size of the new test-suites is dramatically smaller. We find that for Min. g++ there are 1135 less test-cases, a re- duction of 96%. For Min. ISO, there are 391 less test-cases, a saving of 89%. The size of the reductions are summarised in Table 5.

Min. No. No. original % Suite Test-Cases Test-Cases Reduction g++ 48 1183 96.0 ISO 49 440 88.9 Table 5: Percentage reduction achieved by the Test- Suite reduction algorithm. 4.4 Generated Test-Suites By applying the C++ grammar used by keystone to the vari- ants of Purdom’s algorithm discussed in Section 2.2, two new test-suites are created. These are referred to as Purdom and CDRC Purdom. As Purdom’s original algorithm only gives consideration to producing sentences that are grammatically correct, i.e. the derivation of a sentence through repeated application of the grammar rules, the resulting test-cases must be touched up by hand to ensure they can be parsed by keystone.

An example of this is with the generated test-case : “USING NAMESPACE IDENTIFIER SEMI” would be translated and modified as follows: namespace X{}; using namespace X; The declaration of namespace X is essential to the successful parse of this test-case. The number of test-cases output by the test-suite Purdom is 53 and the coverage of the syntactic dimension is 100% (due to the nature of the algorithm). CDRC Purdom outputs 71 test-cases that also fully cover the syntactic dimension. When the results of the semantic coverage are analysed, the results are poor in comparison with the previous approaches. The reasons are two-fold: 1. The ISO C++ standard [1] defines a grammar that actually accepts a super-set of C++. Hence any ap- proach to automatically generating test-cases from the grammar alone will be difficult to produce semantically correct test-cases.

2. The test-cases produced by Purdom’s algorithm are in the most part short sentences, however for the C++ grammar, in tandem with the small test-cases, a single large sentence is produced that attempts to cover as many rules as possible. It is impossible due to the complexity of the file to touch it up by hand, hence keystone cannot parse this test-case. The fact that this large test-case cannot be parsed accounts for the low coverage figures that can be observed in modules Parser, Scope and Type. Module Lexer has the same cov- erage due to the fact that keystone maintains a token buffer during a parse, so the lexical code maintains a coverage sim- ilar to the other test-suites.

We can see from the dramatic difference in underlying code coverage of keystone that test-cases generated from syntactic considerations alone are not sufficient to fully test a parser front-end. 5. CONCLUSIONS In this paper we have shown the coverages achieved in both the syntactic and semantic dimensions exhibited by a num- ber of test-suites for ISO C++. The main findings of our work are: 1. The test-cases produced by Purdom’s algorithm give full rule coverage in the syntactic domain. Further- more the coverage is achieved by using a series of small test-cases. However, the test-cases produced are not semantically correct and thus fail to achieve notewor- thy coverage in the semantic domain. It is also worth noting that the test-cases produced by CDRC Pur- dom offer no extra advantage in terms of the size of the test-suite or coverage of the semantic dimension. 2. Test-suite reduction provides an excellent alternative to the generation of test-cases. Reducing a large, ex- isting test-suite is a simple process. Furthermore the number of test-cases remaining is comparable to amount of test-cases generated by the Purdom approach, but with the added advantage of being semantically cor- rect.

3. We have established a correlation between rule cover- age in the syntactic domain and code coverage in the semantic domain. Generated test-cases, that lacked semantic correctness gave poor coverage of the under- lying code in comparison to the other test-suites. As well as this, reduced test-suites, which have identical coverage in the syntactic domain as their larger coun- terparts exhibit exactly the same code coverage in the semantic dimension as the larger suites. Our ongoing work includes extending the implementation- based and specification-based suites to achieve full coverage in the syntactic domain. We then hope to see an exact relationship between full coverage in the syntactic dimension mapping to full coverage in the semantic dimension. 6. REFERENCES [1] ISO/IEC JTC 1. International Standard: Programming Languages - C++. Number 14882:1998(E) in ASC X3. American National Standards Institute, first edition, September 1998. [2] M.R. Garey and D.S. Johnson. Computers and Intractability: A guide to the Theory of NP-Completeness. W.H. Freeman, 1979. [3] T.H. Gibbs, B.A. Malloy, and J.F. Power. Decorating tokens to facilitate recognition of ambiguous language constructs. Software - Practice and Experience, 33(1):19–39, January 2003.

[4] T.H. Gibbs, B.A. Malloy, and J.F. Power. Progression toward conformance of C++ language compilers. Dr. Dobbs Journal, 28(11):54–60, September 2003. [5] J.A. Jones and M.J. Harrold. Test-suite reduction and prioritization for modified condition/decision coverage. IEEE Transactions on Software Engineering, 29(3):195–210, 2003. [6] P. Klint, R. Lämmel, and C. Verhoef. Towards an engineering discipline for grammarware. Draft, Submitted for journal publication; Online since July 2003, 47 pages, Febuary 2005.

[7] R. Lämmel. Grammar Testing. In Proc. of Fundamental Approaches to Software Engineering (FASE) 2001, volume 2029 of LNCS, pages 201–216. Springer-Verlag, 2001. [8] B.A. Malloy and J.F. Power. An interpretation of purdom’s algorithm for automatic generation of test cases. In 1st Annual International Conference on Computer and Information Science, Orlando, FL., 2001. [9] P. Purdom. A sentance generator for testing parsers. BIT, 12(3):366–375, 1972. [10] M. Roper. Software Testing. McGraw-Hill, 1994.

You can also read