An Experiment Measuring the Eects of Personal Software Process PSP Training
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
To appear (as of March 30, 2000) in IEEE Transactions on Software Engineering An Experiment Measuring the E ects of Personal Software Process (PSP) Training Lutz Prechelt (prechelt@computer.org) Barbara Unger (unger@ira.uka.de) Fakultat fur Informatik Universitat Karlsruhe D-76128 Karlsruhe, Germany +49/721/608-3934, Fax: +49/721/608-7343 March 30, 2000 Abstract Keywords: process improvement, quality manage- ment, e ort estimation, reliability, productivity, ex- periment The Personal Software Process is a process im- provement methodology aiming at individual software engineers. It claims to improve soft- 1 The Personal Software Pro- ware quality (in particular defect content), ef- fort estimation capability, and process adapta- cess (PSP) methodology tion and improvement capabilities. We have tested some of these claims in an experiment The Personal Software Process (PSP) method- comparing the performance of participants who ology for improving the software process was had just previously received a PSP course to introduced in 1995 by Watts Humphrey [6]. a di erent group of participants who had re- PSP is an application of the principles of the ceived other technical training instead. Each Capability Maturity Model (CMM, [5]) on the participant of both groups performed the same level of an individual software engineer. In task. contrast to the CMM, however, which allows We found the following positive e ects: only for assessment of process quality, the PSP The PSP group estimated their productivity makes concrete methodological and learning (though not their e ort) more accurately, made suggestions, down to the level of a 15-week fewer trivial mistakes, and their programs per- course with rather speci c procedural content. formed more careful error-checking; further, The goals of the PSP are that an individual the performance variability was smaller in the software engineer learns PSP group in various respects. However, the how to accurately estimate, plan, track, improvements are smaller than the PSP pro- and re-plan the time required for individ- ponents usually assume, possibly due to the ual software development e orts, low actual usage of PSP techniques in the PSP group. how to work according to a well-de ned We conjecture that PSP training alone does not process, automatically realize the PSP's potential ben- how to de ne and re ne the process, e ts (as seen in some industrial PSP success stories) when programmers are left alone with how to use reviews e ectively and e- motivating themselves to actually use the PSP ciently for improving software quality and techniques. productivity (by nding defects early), 1
how to avoid defects, Unfortunately, both kinds of evidence have se- how to analyze measurement data for im- vere drawbacks. Observations directly from the course are distorted in several ways. First, the proving estimation, defect removal, and process de nition underlying the data changes defect prevention, from exercise to exercise, making exact inter- how to identify and tackle other kinds of pretation dicult. Second, the PSP course is process de ciencies. the Hawthorne e ect [9] at its best: the par- ticipants constantly monitor their performance The main basic techniques used are gathering and their central objective is improving this objective measurement data on many aspects performance | in contrast to industrial soft- of the process (for obtaining a solid basis for ware practice, where individual performance process change decisions), spelling out the pro- is only a means to an end. Third, individual cess into forms and scripts (for precise control participants may consciously or subconsciously of such changes and for making the process re- manipulate their time measurement and defect peatable), and analyzing the collected data (for recording towards better results. Fourth, the deciding where and how to change the process). exercises 4 through 10 are not only quite sim- ple in terms of design and implementation com- plexity, but also have unusually clear and easy- 1.1 Previous evidence for PSP e ec- to-understand requirements, which may make tiveness the results overly optimistic. Finally, and most importantly, there is no control. Nobody knows The PSP methodology relies a lot on objective how performance would change during these 10 measurement data and so does the argumenta- programming assignments if no PSP training tion of the PSP proponents. Watts Humphrey and self-monitoring occurs at all. A validation has published data from several PSP courses of the PSP by direct comparison to non-PSP- in 1996 [7] and several experience reports from trained persons is missing. other PSP practitioners report similar results. Since no other information was available, these The industrial case studies are dicult to in- results rely on the very data collected by the terpret (due to their complex context) and also course participants during their course exer- lack control. Either comparisons to non-PSP cises 1 through 10, regarding real and estimated data are missing or they are based on a be- development time as well as inserted and re- fore/after comparison of the same persons (not moved defects, both for each of the various controlling for maturation) or an A/B compar- development phases. These data show for in- ison to di erent projects. Besides many other stance that the estimation accuracy increases relevant di erences, such comparison projects considerably, the number of defects introduced might have engineers that are less capable than per 1000 lines of code (KLOC) decreases by a the vanguard that chose to learn PSP rst. factor of two, the number of defects per KLOC Again, a study with a higher degree of control to be found late during development (i.e., in is called for. test) decreases by a factor of three or more, and productivity is not reduced despite the 1.2 Experiment motivation and substantial overhead factor for bookkeeping in- overview volved when the tasks are as small as they are. See [4] for a detailed analysis of these e ects When we learned the PSP ourselves and then based on data from 23 courses. started teaching it to our graduate students, we Additional evidence comes from industrial suc- soon obtained the impression that the method- cess reports on PSP usage, in which the ex- ology has a lot of bene ts, but is not with- pected PSP bene ts were actually found in sev- out problems, too. In particular, many pro- eral small industrial projects whose engineers grammers appear to be unable to keep up the used PSP [3]. discipline required for the data gathering and 2
for following a spelled-out process script; this 2 Description of the experi- may also underlie some of the PSP data quality problems observed by Johnson and Disney [8]. ment Furthermore, many course participants misun- derstand the concrete techniques and proce- 2.1 Experiment design dures taught in the PSP course as dogmas, al- though they are meant only as starting points The experiment uses a single-factor, posttest- for one's own process development. They do only, inter-subject design [1]; see Table 1 for not understand that (and how) they should an overview. The independent variable was adapt the proposed techniques to their own whether the experimental subjects had just preferences and needs. previously participated in a PSP course (exper- iment group, subsequently called \P") or in an alternative course (comparison group, subse- We estimated that a majority of our course par- quently called \N" for \non-PSP"). Each sub- ticipants would probably use little or nothing ject of either group solved the same task and of the PSP techniques in their later daily work worked under the same conditions. The assign- and we wondered whether, under these circum- ment of subjects to groups could not be ran- stances, PSP training would be more e ective domized; this threat is discussed in Section 2.6. than any other technical training with respect The observed dependent variables for each sub- to the PSP goals. ject were a variety of measures of personal ex- perience, various estimations of development time for the task (in particular estimated to- We hence conducted the experiment as de- tal time), various measurements of the devel- scribed in Section 2, where we compared a opment process (in particular total time), and group of students right after a PSP course to various measurements of the delivered product a group of other students right after a di erent (in particular program reliability). (technical rather than methodological) course; see Section 2.2. Each participant had to solve group N group P the same programming task (see Section 2.3), treatment PSP course KOJAK/other which was quite di erent from all of the assign- task phoneword phoneword ments of either course. observed work time, estim. time, reliability, etc. We observed several aspects of each subject's Table the 1: The experiment design. For details see following subsections. development process and software product to analyze the e ects of the PSP course in com- parison to a more technical course, in particular the accuracy of the subjects' e ort estimation and the reliability of the programs they pro- 2.2 Subjects duced. The results are discussed in Section 3. Overall, 48 persons participated in the exper- iment, 29 in the PSP group P and 19 in the The experiment and its results are described non-PSP group N. All of them were male Com- in more detail in a technical report [11], which puter Science master students. The P group also includes the actual experiment materials. had previously participated in a 15-week grad- The report also contains various less important uate lab course introducing the PSP method- additional measurement results not discussed ology, involving ten small programming assign- in this article. The materials and raw result ments and ve process improvement assign- data are also available in electronic form from ments, both as described in Humphreys book http://wwwipd.ira.uka.de/EIR/. [6, Appendix D]. All but eight members of the 3
N group had participated in a 6-week grad- 2.3 Experiment task uate compact lab course on component soft- ware in Java (KOJAK), involving ve larger The task to be solved in this experiment is programming assignments. This course cov- called phoneword . It consists of writing a pro- ered technical topics such as Swing, Beans, gram that encodes the digits of long telephone and RMI. It was shorter than the PSP course, numbers (program input) into corresponding but much more intensive, so that the over- sequences of words (program output, the words all amount of practical programming experi- come from a large dictionary also provided as ence gained was similar. The other eight of input) according to the xed, given letter-to- the N participants came from other lab courses digit mapping shown in Table 2: A single digit of similar magnitude. In contrast to all oth- is allowed to encode itself in the output i ers, who were obliged to participate (but not no other digit precedes it and no word from to succeed) in the experiment for passing their the dictionary can represent the digits starting course, these eight were volunteers. The sub- from that point. All possible complete encod- jects participated in the experiment between 0 ings must be found and printed. Many phone to 3 months after their respective course was numbers have no complete encoding at all, even nished. There were two instances of the PSP with a large dictionary. Dashes and quotes course (1997 and 1998), both run by the same in the words as well as dashes and slashes in teacher and with the same content. the phone numbers must be ignored for the en- coding but still be printed in the result. Any phone number and any word in the dictionary was known to be at most 50 characters long. On average, these 50 students were in their The dictionary was known to be at most 75000 8th semester at the university, they had a me- words long. Dictionary and phone numbers are dian programming experience of 8 years total read from two text les containing one word or and estimated they had a median of 600 hours phone number per line. of programming practice beyond their assign- ments from the university education and had Here is an example program output for the in- written a median of 20.000 LOC total. None of put \3586-75", using a German dictionary: these measures was signi cantly di erent be- tween the two groups. During the experiment, 3586-75: Dali um 3586-75: Sao 6 um 24 of the participants used Java (JDK), 13 used 3586-75: da Pik 5 C++(g++), 9 used C (gcc), 1 used Modula-2 (mocka), and 1 used Sather-K (sak). The requirements for this program were de- scribed thoroughly in natural language and re- maining ambiguities were resolved by examples 8 participants dropped out of the experiment of correct and incorrect encodings. and will be ignored in the subsequent data The requirements description stated program analysis. They chose to give up after zero to reliability as the single top objective for the three unsuccessful attempts at passing the ac- participants. Productivity, program eciency, ceptance test that forms the end of the experi- etc. were to be less important. For complete ment (see Section 2.4). All dropouts said they text of the requirements speci cation see [11, were frustrated by the diculty of the task; we pp. 66{68]. could not nd any connection to group mem- bership. The fraction of dropouts is the same in the P group (5 out of 29, 17 percent) as in the 2.4 Experimental procedure N group (3 out of 19, 16 percent) so that ignor- ing the dropouts does not introduce a bias into The experiment was run between February the experiment. This leaves 40 participants for 1997 and October 1998, mostly during the the evaluation. semester breaks. Most subjects started at 4
E JNQ RWX DSY FT AM CIV BKU LOP GHZ 0 1 2 3 4 5 6 7 8 9 Table 2: Prescribed letter-to-digit mapping for the phoneword task. The mapping was constructed such as to balance the letter frequency for each digit across the German dictionary used. about 9:30 in the morning. Both groups were The entire expected output ( ) and actual C N handled exactly alike. In particular, the PSP output ( ) was shown, but the acceptance S N subjects were not speci cally asked to use PSP test used a 20946-word dictionary not available techniques. The experiment materials were to the subjects. The output reliability was de- r printed on paper and consisted of two parts. ned as the fraction of correct outputs within Part one was issued at the start of the ex- all actual outputs (whether expected or incor- periment and contained a personal informa- rect), i.e. = j ( ) \ ( )j j ( ) [ ( )j. r S N C N = S N C N tion questionnaire, the task description, and To pass the acceptance test, had to be at least r an e ort estimation questionnaire. After lling 95 percent. these questionnaires in and reading the task de- Misuse of the acceptance test as a convenient scription, the subjects worked on the task using automatic testing facility was avoided by the a speci c Unix account that provided the au- following reward scheme: Each participant re- tomatic monitoring infrastructure, which non- ceived a payment of DM 50 (approx. 30 US intrusively protocoled login/logout times, all dollars) for successful participation, i.e., pass- compiled source versions with timestamps, etc. ing the acceptance test, but for each failed ac- The subjects could modify the setup of the ac- ceptance test, DM 10 were deducted. count as necessary, make work pauses when- ever required, and use any methods and ap- 12 of the successful subjects nished the day proaches for solving the problem they deemed they started, 10 others required a second day, appropriate. In particular, a few subjects im- and the other 18 took between three and eleven ported source code of reusable procedures (for days. Similarly, 15 subjects passed the rst le handling etc.) from other accounts and a acceptance test, 24 the second to fth, only 1 few subjects imported and installed small per- required six. Both of these measures showed no sonal tools. The input, output, and dictionary signi cant di erences between the two groups. data used in the example in the task require- After passing the acceptance test, the subject ments description was provided to the subjects, were given part 2 of the experiment materials, the large 73113-word dictionary later used for a short postmortem questionnaire. After ll- evaluating all programs was also available. ing that in, the subjects were paid and their When a subject thought his program worked participation was complete. correctly, he could call for an acceptance test. This was based on randomly generating a set 2.5 Hypotheses N of 500 phone numbers, computing the mul- tiset of corresponding outputs ( ) using the S N The experiment investigated the following hy- subject's program, and computing the correct potheses (plus a few less important ones not outputs ( ) using a reference implementa- C N discussed here, see [11]). tion (\gold" program). The gold program had been written by the authors using stepwise re- Reliability: PSP-trained programmers nement with semi-formal veri cation based on produce a more reliable program for the preconditions and postconditions. The gold phoneword task than non-PSP-trained program had run correctly right from its rst programmers. test and no defect was ever found in it (de- spite harsh protests from several participants Estimation accuracy: PSP-trained pro- and thorough investigations of their validity). grammers estimate the time they need 5
for solving the phoneword task more ac- Second, the PSP education of our subjects was curately than non-PSP-trained program- only a short time ago. Long-term e ects would mers. be more interesting to see. Third, our task was unusual in several respects We also sort of expect that PSP-trained pro- (small size, precise requirements, acceptance grammers may solve the phoneword task faster test indicates expected outputs). It is unknown than non-PSP-trained programmers. Produc- how these properties might in uence the com- tivity improvement is not an explicit claim or parison. goal of the PSP, but for the given task it might be a side e ect of improved quality, because Finally, professional software engineers may locating a problem detected in the acceptance have di erent levels of skill than our partic- test is relatively dicult. ipants. A higher skill and experience level may leave less room for improvement, but may also sharpen the eye as to where improvements 2.6 Threats to internal validity are most desirable or most easy to achieve with PSP techniques. Conversely, lower skill The control of the independent variable is (which will occur, because our students are threatened by the fact that group assignment more skilled than most of the non-computer- was not done by randomization but rather was scientists that frequently start working as pro- due to earlier self-selection of the course taken. grammers today) may leave more room for im- In principle, there might be systematic di er- provement but may also impede applying PSP ences between the types of programmers tak- techniques correctly or at all. ing either decision. However, we do not believe that such di erences, if any, are substantial. For instance, several of the participants of the 3 Results and discussion KOJAK course and several of the volunteers from other courses later chose the PSP course and vice versa. We will now describe and discuss the results for estimation accuracy, reliability, and productiv- The choice of programming language might ity with respect to the hypotheses. also in uence our results, because the di er- ent languages have di erent frequency in the The data will be presented using boxplots (see two groups. However, the structure of our task the gures below) indicating the individual is such that the language used has only modest data points, the 10% and 90% quantiles (as 1 impact. In particular, neither object-oriented whiskers), the 25% and 75% quantiles (by the language features nor particular memory man- edges of the box), the median (50% quantile, agement mechanisms are very relevant for the by a fat dot), the mean (by a capital M), and given task. (See, however, the discussion of one standard error of the mean (by a dashed program crashes at the end of Section 3.1.) line). The two distributions of the groups N and P are shown side-by-side. 2.7 Threats to external validity Formal tests of the hypotheses are performed by one-sided statistical hypothesis tests for dif- There are several important threats to the ex- ferences Wilcoxon of the mean or the median. We use a rank sum test for comparing medians ternal validity (generalizability) of our experi- and a bootstrap resampling test [2] for compar- ment. ing the means without relying on the assump- First, and most importantly, di erent work tion of a normal distribution. When we report conditions than found in the experiment may for instance \mean test = 0 07" this means p : positively or negatively in uence the e ective- 1 For instance, the 10% quantile of a set of values ness of the PSP training. This is discussed in is an interpolated x such that 10% of the values are Section 4. smaller or equal to x and 90% are larger or equal to x. 6
that the test comparing the means of the two phone numbers of length 1 or 50 had arti cially groups indicates that the observed di erence been made less frequent in the acceptance test. has a 7% probability of being purely acciden- p The results for the standard data set are shown tal (i.e., no real di erence exists). At values at in the rather degenerated box plots of Figure 1. or below 5% we will call such -values \signi - p As we see, the reliability of the programs is gen- cant" and believe that the di erences are real. erally high in both groups, both medians are at See [11, pp.20{25] for a more detailed descrip- 100%. We do not nd any evidence for superior tion of boxplots and the tests. reliability in the P group. Instead, there even is a slight advantage for the N group, but the dif- 3.1 Reliability ference is not signi cant (median test = 0 32, p : mean test = 0 33). p : We measured the reliability (as de ned in Sec- tion 2.4) of the delivered programs on eight dif- ferent randomly generated input data sets, in P o o M o oo oo oooo o which each of the possible letters /, {, 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 had the same proba- M N o ooo ooo o o bility at each position of each number in each le (with one exception mentioned below). The 0 20 40 60 80 100 output reliability on standard data set [percent] data sets contained either 100, 1000, 10000, or 100000 telephone numbers. Since some of the Figure 1: Output reliability (as de ned in Sec- programs were extremely slow (over 15 seconds tion 2.4) for the data set with at least one digit per input), we could not exercise the largest per phone number. tests on all of the programs and will hence re- port the results for the test sets of size 1000 only (but see [11, pp.31{39] for the other re- sults). M P o o o o o o oo o o ooo There were two di erent 1000-number data M N o o o o o o oo o sets, a standard one and a surprising one. The surprising one used exactly the distribution of 0 20 40 60 80 output reliability on surprising data set [percent] 100 telephone numbers mentioned above with uni- formly random lengths between 1 and 50 char- Figure 2: Output reliability for the data set with acters. This set is surprising because when possibly no digit in a phone number. thinking of these inputs as telephone numbers, intuitively one would not expect to see some- The results for the surprising data set are dif- thing not containing any digit at all, such as ferent (Figure 2). Again, some programs work \/" or \{/-/", even though the requirements perfectly, but many other programs from the N description of the task had de ned \A tele- group and also a few from the P group crash2 phone number is an arbitrary(!) string of at the rst telephone number that contained no dashes, slashes, and digits." (emphasis is in the digit, which resulted in a reliability of 10.7% for original). An empty encoding must be output the given data set. As a result the reliability is for a phone number without digits. In contrast, signi cantly lower in the N group (median test the \standard" input set suppressed numbers p = 0 00, mean test = 0 01). : p : not containing any digit and generated a new Note that faulty Java programs were more one until at least one digit was present. likely to crash than faulty C or C++ programs, Note that both of these tests are harder than due to the Java run time checks of array in- the acceptance tests: The dictionary is larger dices etc. Since the fraction of Java programs (73113 words versus 20946 words), there are 2 Our measurement ignores the crash and just records twice as many phone numbers, and the critical its consequences: no further output arrives. 7
M M P o o o o oo P ooooo oo o o oooo oo oooo o o o o o M M N o o o o oo oo N oo oo o o o o oo ooo o o o 0 20 40 60 80 100 0 200 400 600 800 output reliability for only the Java programs [percent] work time estimation error [percent] Figure 3: Output reliability for the data set with Figure 4: , 1 : Amount of mis-estimation twork testim possibly no digit in a phone number, measured of the total working time for the task. Most es- only for the Java programs. timations were too optimistic, only four in each is higher in the N group, the above result may group were too pessimistic. thus be biased in favor of the P group. There- fore, we also compared the Java programs alone (see Figure 3) and found that the above group P M di erence indeed becomes less signi cant, but oo o o o ooooo o oo oo o ooooo o does not disappear (median test = 0 04, mean p : N o o o oo oo o M o o oo o o test = 0 07). For the other languages there o o p : are not enough data points for a meaningful 200 400 600 800 comparison. estimated/actual productivity [percent] These results provide some evidence for higher Figure 5: prodestim prodactual : Quotient of estimated and reliability in the P group. Speci cally, although actual productivity, each measured in lines of P group program reliability is not generally code per hour. higher, the P group programs perform better error checking and handling of unexpected sit- tivity (measured in lines of code produced per uations. See also the discussion of program hour). If we compare the subjects' expected length in Section 3.3. We conclude that the productivity to the actual productivity (Fig- reliability hypothesis is supported, though not ure 5), we nd the estimations of the N group clearly con rmed by the experiment. are clearly worse than in the P group (median test = 0 00, mean test = 0 00). This shows p : p : that the PSP group knows their historical data, 3.2 Estimation accuracy but did not produce a size estimation that was good enough for converting this knowledge into For comparing the e ort estimation capabilities an estimation advantage. We conclude that, of the groups, we consider the estimation of overall, the hypothesis of better time estima- the total working time the subjects made after tion in the PSP group is not supported by the reading the task description. We compute the experiment. mis-estimation in percent from the quotient of actual and estimated time, see Figure 4. The median mis-estimations are essentially 3.3 Productivity identical (median test = 0 48). The worst p : estimations of the N group are worse than in Since all subjects worked on exactly the same the P group, so that the mean mis-estimation task, we might take the view that their produc- tends to be larger in the N group; but the dif- tivity should best be expressed simply as the ference is not signi cant (mean test = 0 18). p : \number of tasks solved per time unit", which This is no convincing evidence that the P group is just the inverse of the total working time and produces better estimates. is shown in Figure 6. PSP estimation is based on program size esti- From this point of view, in contrast to our ex- mation and historical data on personal produc- pectations the productivity is not larger, but 8
M M P oo oo oo o o o oo o oo oo o o o o o o o o P oo oo o ooo o oo o oo o o o ooo o o o M M N ooo o o o o oo o o o o o o o N ooo o o o ooo o o o o o o o 0.05 0.10 0.15 0.20 0.25 0.30 10 20 30 40 50 60 productivity [1/hour] total working time [hours] Figure 6: Figure 8: twork : Total number of working hours. twork : Productivity measured as the in- 1 verse of total working time (number of programs per hour, one could say). As we see, the notion of productivity is slip- rather tends to be smaller in the P group (me- pery for software and must be handled with dian test = 0 10, mean test = 0 06). We do p : p : care. However, the hope that the productivity not know the degree to which the PSP book- of PSP-trained persons would be higher is not keeping overhead accounts for this tendency, supported in the experiment (and is also not but probably the degree is low, because few claimed by the PSP in general). subjects actually had any signi cant PSP over- head (see Section 3.5). However, the P group generally wrote longer 3.4 Variance within each group programs, partly because of more careful error checking. For instance, in the Java programs One rather unexpected insight from this exper- the average number of `catch' statements with iment was that even where no improvement of non-empty exception handlers is signi cantly the average occurred, the variability was usu- larger in the P group than in the N group (4.4 ally smaller in the P group than in the N group. versus 2.5, = 0 02). As we saw above, this p : The e ect occurred for most of the measures additional e ort tended to result in better reli- we have investigated. We can quantify this by ability. So maybe the more conventional view providing a bootstrap resampling test for dif- of productivity as the number of lines of code ferences of the length of the box (\interquartile written per hour is more appropriate. This range", iqr test), which is a rather robust mea- quotient is shown in Figure 7; the productivity sure of variability. di erence has essentially disappeared (median For instance for the productivity and estima- test = 0 42, mean test = 0 30). p : tion comparisons shown in the gures above, p : smaller P group variability is visible as a ten- dency for the reliability on surprising inputs P o oooo o o o (Figure 2, iqr test = 0 12) work time mis- M oo o ooo o o o o oo o o o o p : estimation (Figure 4, iqr test = 0 24) and the p : total work time (Figure 8, iqr test = 0 39) M N o oo o oo o o o o o o o o o o p : and is statistically signi cant for the produc- 10 20 30 40 productivity [LOC/hour] 50 60 tivity estimation (Figure 5, iqr test = 0 01), p : the productivity in terms of working time (Fig- Figure 7: tLOC ure 6, iqr test = 0 05), and the productivity work : Productivity measured as the in LOC per hour (Figure 7, iqr test = 0 04). p : number of statement LOC written per hour. p : Curiously, if we consider the working time di- The reduction of variability in the PSP group rectly (instead of its inverse), as shown in Fig- may be a bene t. Teams with lower interper- ure 8, the N group has a lower median (me- sonal performance variability can be more exi- dian test = 0 10), but a slightly higher mean ble, because any member could take over a task p : due to some very slow subjects (mean test without changing the schedule. Schedule risk p = 0 28). : may also be an issue, as is outlined in [12]. 9
3.5 PSP usage The programs produced by members of the P group are slightly more reliable than For only 6 of the 24 PSP participants (25 per- those of the N group as far as robustness cent), we found evidence3 that they had actu- against unusual (but legal) inputs is con- ally used PSP techniques. This low percentage cerned. For more standard types of inputs may imply that the size of the di erences found we did not nd a reliability di erence. above re ect the degree of actual PSP use more The members of the P group estimated than the e ect of PSP use. their productivity (in lines of code per Furthermore, we found a surprising correlation. hour) better than the N group, but did Among those 5 PSP participants who had given not produce better total e ort estimates. up in the experiment, as many as 4 (or 80 per- cent) showed evidence of actual usage of PSP The total time for nishing the task tended techniques | a signi cantly higher fraction to be longer in the P group than in the than among the successful participants (Fisher N group. However, at least to some de- exact = 0 036). Apparently the least capable gree this additional time is invested in the error checking that leads to the improved p : subjects had a much higher inclination to use PSP techniques, presumably because they feel robustness. The productivity in lines of more clearly that these techniques help them. code per hour was hardly di erent in both groups. 3.6 Other results For many performance metrics the vari- ability within the P group was substan- tially smaller than the variability within We have collected other measures (not directly the N group. connected to our hypotheses) as well and found several areas where some advantage was visible Apparently a majority of the P group par- for the P group. For instance they estimated ticipants did not use PSP techniques at the average time required for xing a defect or all. the reliability of their program more accurately than the N group. Furthermore, they made When one compares these results to some of the fewer trivial mistakes that led to compilation improvements known from the PSP course, one errors and tended to write more comments into may be disappointed; for instance, participants their programs. See [11] for details. of a PSP course on average achieve an at least In the informal postmortem interview, many fourfold reduction of the number of defects to of the P subjects (but none of the N sub- be found in test during their course exercises 1 jects) said something along the lines of \I re- through 10. There were no such dramatic dif- ally should have performed design and code re- ferences in this experiment. We see two major views. Damn that I didn't." reasons. First, the PSP course with its con- stant measurement and feedback is a typical Hawthorne e ect situation [9], so that results from the course overestimate the improvements 4 Conclusion available in the long run. Second, too few of our subjects from the PSP group actually used Our experiment comparing a group of PSP- PSP techniques during the experiment; most trained programmers (P group) to a simi- of them did not keep up the necessary self- lar group of programmers who received other discipline. training (N group) produced the following ma- We o er three explanations for the low degree jor ndings: of actual PSP usage: First, it may be a re- 3 In all cases this evidence included a time and defect sult of di erent temperaments of the program- log, in some cases also a PSP estimation form. mers. In our experience, a few pick up and use 10
the PSP techniques quite easily and enthusias- Acknowledgements tically, a majority can adapt them only with e ort and will later use them to a modest de- We thank Oliver Gramberg for his contribu- gree at best, and some appear to be completely tions to our PSP course and Michael Philippsen unable to maintain the discipline required for for multiple thorough reviews of this paper and applying the PSP techniques. The proportions very helpful criticism. We thank the anony- may be culture-dependent, so subjects from mous reviewers for their detailed and helpful other countries may turn out di erent in this comments. Most of all we thank more than respect. Our course has won several awards for fty patient individuals who guinea-pigged the best teaching (as evaluated by the students) in experimental setup or participated in the ex- our department, so we rule out low quality of periment itself. teaching as a reason. Second, when asked shortly before the end of the course, a majority of the PSP course partic- ipants claimed they would use PSP techniques for \larger" tasks, but not for small ones. If one is willing to believe this statement, it may be that we would have found larger PSP bene ts if your experiment had used a larger task. Third, and maybe most importantly, Watts Humphrey suggests that a working environ- ment which actively encourages PSP usage is a key ingredient for PSP success. Our subjects were working alone and hence had no such en- vironment. Summing up, we conjecture that PSP training alone provides only a fraction of the expected bene ts. Later encouragement towards actual PSP use appears to be necessary before large improvements are realized. Nevertheless, we believe that, in principle, the PSP is worth- while and that our experiment provides sup- port for this opinion. Given the low degree of use of the actual PSP techniques despite a 15-week course, it appears a worthwhile research topic whether and how the PSP bene ts can be obtained with a much smaller training program than the standard PSP course. We are pursuing such research and have already obtained rst results [10]. Further, the experiment shows that we need to understand better how to make people use methods: what technical, social, and organiza- tional means improve the level of actual use of a method as opposed to just the level of train- ing or infrastructure provided? 11
Appendix: Raw result data Author Biographies Data of the 24 successful participants of the Lutz Prechelt worked as senior researcher at PSP-trained group: the School of Informatics, University of Karl- sruhe, where he also received his diploma time estim A reli 3.8 5.0 2 100.0 reliS LOC lang 98.4 258 Java (1990) and his Ph.D. (1995) in Informatics. His 4.0 3.0 2 76.5 92.1 325 C research interests include software engineering 6.9 8.3 1 100.0 100.0 274 C++ (in particular using an empirical research ap- 7.3 5.0 1 100.0 100.0 174 Java proach), compiler construction for parallel ma- 7.3 7.8 6.0 2 99.2 100.0 303 C 5.0 1 100.0 100.0 427 Java chines, measurement and benchmarking issues, 8.2 3.5 2 0.0 89.5 199 C and research methodology. Since April 2000 9.7 6.0 1 100.0 100.0 433 Java he works for abaXX Technology, Stuttgart. 9.8 10.0 2 100.0 10.2 211 Java Prechelt is a member of IEEE CS, ACM, and 10.1 18.0 1 100.0 10.9 10.0 1 100.0 98.4 226 C++ 98.4 365 C GI. 11.2 11.9 6.3 1 100.0 100.0 505 C++ 5.2 2 100.0 100.0 287 C++ Barbara Unger graduated with a diploma de- 13.3 10.0 3 0.0 0.0 245 Sather-K gree in 1995 and is currently working as a PhD 13.8 4.3 2 100.0 10.2 437 Java candidate in the Institute for Program Struc- 15.2 4.0 5 100.0 98.4 274 C++ tures and Data Organization at the University 16.1 6.7 2 98.9 96.8 391 C of Karlsruhe. Her main research interests are in 17.8 10.0 3 100.0 19.1 12.6 2 100.0 10.2 729 Java 98.4 334 C++ empirical software engineering with a focus on 19.6 4.0 4 98.5 100.0 583 C++ design patterns. She is a member of the IEEE 19.7 12.0 3 100.0 100.0 467 Java and the IEEE Computer Society. 20.5 7.2 2 100.0 98.4 386 Modula-2 25.3 11.2 1 98.1 100.0 605 C++ 27.6 11.4 4 98.9 96.3 323 Java References time is the actual work time in hours, estim the estimated work time. A is the number of ac- [1] Larry B. Christensen. Experimental ceptance tests needed. reli is the reliability on Methodology. Allyn and Bacon, Needham the standard data set, reliS is the reliability Heights, MA, 6th edition, 1994. on the \surprising" data set. LOC is the length of the delivered program (excluding only empty [2] Bradley Efron and Robert Tibshirani. An lines), lang is the programming language used. introduction to the Bootstrap. Monographs Data of the 16 successful participants of the on statistics and applied probability 57. non-PSP-trained group: Chapman and Hall, New York, London, 1993. time estim A reli reliS LOC lang 3.0 3.0 1 99.2 98.4 168 C++ [3] Pat Ferguson, Watts S. Humphrey, Soheil 3.5 3.8 2.0 8.0 1 1 100.0 100.0 98.4 10.2 152 249 C++ Java Khajenoori, Susan Macke, and Annette 4.8 6.0 1 100.0 10.2 199 Java Matvya. Introducing the Personal Soft- 5.0 4.0 1 100.0 98.4 267 Java ware Process: Three industry case studies. 6.2 3.5 2 100.0 10.1 204 Java IEEE Computer, 30(5):24{31, May 1997. 7.1 5.0 1 99.2 98.4 396 Java 7.4 7.5 8.0 10.0 1 2 100.0 0.0 10.2 1.1 165 309 Java C++ [4] W. Hayes and J.W. Over. The Per- 8.7 5.0 2 98.9 90.9 185 Java sonal Software Process (PSP): An em- 13.0 5.0 3 100.0 10.2 422 Java pirical study of the impact of PSP on 15.1 6.0 3 100.0 10.2 143 Java individual engineers. Technical Report 18.1 39.7 5.0 16.0 3 6 100.0 100.0 100.0 97.9 355 640 Java Java CMU/SEI-97-TR-001, Software Engineer- 48.9 6.7 2 100.0 10.2 667 Java ing Institute, Carnegie Mellon University, 63.2 6.7 3 99.6 10.2 360 Java Pittsburgh, PA, 1997. 12
[5] Watts S. Humphrey. Managing the Soft- ware Process. SEI series in Software Engi- neering. Addison Wesley, 1989. [6] Watts S. Humphrey. A Discipline for Software Engineering. SEI series in Soft- ware Engineering. Addison Wesley, Read- ing, MA, 1995. [7] Watts S. Humphrey. Using a de ned and measured personal software process. IEEE Software, 13(3):77{88, May 1996. [8] Philip M. Johnson and Anne M. Disney. The personal software process: A caution- ary case study. IEEE Software, 15(6):., November 1998. [9] H.M. Parsons. What happened at Hawthorne? Science, 183(8):922{932, March 1974. [10] Lutz Prechelt and Georg Grutter. Acceler- ating learning from experience: Avoiding defects faster. IEEE Software, 2000. Sub- mitted April 1999. [11] Lutz Prechelt and Barbara Unger. A con- trolled experiment on the e ects of PSP training: Detailed description and evalua- tion. Technical Report 1/1999, Fakultat fur Informatik, Universitat Karlsruhe, Germany, March 1999. ftp.ira.uka.de. [12] Lutz Prechelt and Barbara Unger. How does individual variability in uence sched- ule risk?: A small simulation with ex- periment data. Technical Report 1999- 12, Fakultat fur Informatik, Universitat Karlsruhe, Germany, September 1999. ftp.ira.uka.de. 13
You can also read