THE SWEDISH DRIVING-LICENSE TEST - A Summary of Studies from the Department of Educational Measurement, Umeå University Widar Henriksson Anna ...

 
CONTINUE READING
THE SWEDISH DRIVING-LICENSE TEST

A Summary of Studies from the Department of
 Educational Measurement, Umeå University

              Widar Henriksson
              Anna Sundström
               Marie Wiberg

                Em No 45, 2004

                  ISSN 1103-2685
             ISRN UM-PED-EM--45--SE
INTRODUCTION...................................................................................................... 1
THE DRIVER EDUCATION IN SWEDEN.............................................................. 1
HISTORY OF THE SWEDISH DRIVER EDUCATION AND DRIVING-
LICENSE TESTS....................................................................................................... 2
CRITERION-REFERENCED AND NORM-REFERENCED TESTS...................... 7
IMPORTANT ISSUES IN TEST DEVELOPMENT................................................. 8
       Test specification ................................................................................................................................. 8

       Item specifications............................................................................................................................. 10

       Item format ........................................................................................................................................ 11

       Evaluation of items............................................................................................................................ 12

       Try-out ............................................................................................................................................... 13

       Validity............................................................................................................................................... 14

       Reliability .......................................................................................................................................... 16

       Parallel test versions ......................................................................................................................... 18

       Standard setting................................................................................................................................. 19

       Test administration............................................................................................................................ 21

       Item bank ........................................................................................................................................... 22

EMPIRICAL STUDIES OF THE THEORY TEST................................................. 24

    A new curriculum and a new theory test in 1990................................................ 24

       Judgement of items – difficulty.......................................................................................................... 25

       Parallel test versions ......................................................................................................................... 25

    A theoretical description of the test .................................................................... 27

       Test specifications ............................................................................................................................. 27

       Item format ........................................................................................................................................ 28

       Try-out ............................................................................................................................................... 29

       Standard setting in the theory test..................................................................................................... 29

    Traffic education in upper secondary school – an experiment............................ 30

    Analysis of the structure of the curriculum and the theory test........................... 31
       Judgement       of    items           –        the        relation            between             the         curricula             and          the
        content of the items........................................................................................................................... 31
Aspects of assessment in the practical driving-licence test................................. 34

      A detailed curriculum........................................................................................................................ 34

      A model for judgement of competencies ........................................................................................... 35

    The computerisation of the theory test................................................................ 37

    Methods for standard setting............................................................................... 38

      Standard setting for the theory test used between 1990-1999.......................................................... 38

      Standard setting for the new theory test introduced in 1999............................................................ 38

    Item bank for the theory test ............................................................................... 39

    A sequential approach to the theory test ............................................................. 40

    Results of the Swedish driving-license test......................................................... 41

      Parallel test versions and the relationship between the tests ........................................................... 42

      Private or professional education..................................................................................................... 42

      Validating the results......................................................................................................................... 43

    Driver education’s effect on test performance .................................................... 44

    Driver education in the Nordic countries ............................................................ 45

    Curriculum, driver education and driver testing ................................................. 46

      Assessing the quality of the tests – reliability and validity ............................................................... 47

      Assessment of attitudes and motives ................................................................................................. 49

FURTHER RESEARCH .......................................................................................... 50
Introduction
Since 1990, the Department of Educational Measurement at Umeå
University has been commissioned to study the Swedish driving-
license test by the Swedish National Road Administration, SNRA.
Over the past few years several studies have been conducted in order
to develop and improve the Swedish driving-license test. The focus of
the majority of the studies has been the theory test.

The aims of this paper were threefold: firstly to describe the develop-
ment of the driver education and the driving-license test in Sweden
during the past century; secondly, to summarize the findings of our
research, which is related to important issues in test development; and
finally, to make some suggestions for further research.

              The driver education in Sweden
The present driver education consists of a theory part and a practical
part. Since the driver education is voluntary, the learner-drivers have
the choice of professional and/or private education. Driver instruction
refers to professional education at a driving school and driving prac-
tice refers to lay instructed driver training. In order to engage in driver
instruction or driving practice the learner-driver needs a Learner’s
Permit. In September 1993 the age limit for driving practice was low-
ered from 17 ½ years to 16 years (SFS 1992:1765). It’s common that
the learner-drivers in Sweden combine driver instruction with driving
practice (Sundström, 2004). The learner-drivers get intense driver in-
struction at the driving school and practice the exercises at home, for
example under supervision of their parents. There are certain criteria
that a person has to meet in order to be approved as a lay instructor for
a learner-driver, for example the person has to be at least 24 years old
and have held a driving license for a minimum of five years (SFS
1998:488).

The driver education reflects the curriculum which consists of nine
main parts (VVFS 1996:168). To determine if the student has gained
enough competence according to the curriculum, a driving-license test
is taken. The test consists of two examinations, a theory test and a
practical test. Five of the nine parts of the curriculum are tested in the

                                    1
theory test and the remaining four parts are tested in the practical test
(see Table 1). Test-takers have to be 18 years old and pass the theory
test before they are allowed to take the practical test.
Table. 1. The nine content areas of the driving-license test.
Theory test                                       Practical test
Vehicle-related knowledge                         Vehicle-related knowledge
Traffic regulations                               Manoeuvring the vehicle
Risky situations in traffic                       Driving in traffic
Limitation of driver abilities                    Driving under special conditions
Special applications and other regulations

History of the Swedish driver education and driving-
                   license tests
During the past century, the number of vehicles in traffic has rapidly
increased, which has been reflected in many new constitutions and
regulations. Franke, Larsson and Mårdsjö (1995) described the devel-
opment of the Swedish driver-education system. The increasing mo-
torism has caused a need for assessment of the driver’s knowledge and
abilities. The content of the theory education and the practical driver
education, and the knowledge and abilities required to pass the driv-
ing-license test have increased over time. The trends in the driver
education and the driving-license test are that the focus has changed
from teaching students about the construction and manoeuvring of the
vehicle to judgement of risk and the driver’s behaviour in traffic.

The first regulation for motor traffic was introduced in 1906. In order
to drive a car the person needed a certificate. To obtain the certificate,
the person had to demonstrate his or her theoretical knowledge and
practical ability to a driving examiner. In 1916 the knowledge re-
quired for obtaining a driving license, became more extensive. The
driver now had to demonstrate his or her knowledge of the construc-
tion and management of the vehicle and the most necessary traffic
regulations. The requirements for obtaining a driving license got even
stricter in 1923 when the driving examiner was required to judge if the
person was suitable as a driver. In 1927, the opinion was that the prac-
tical test should be the main part of the driving-license test. The prac-
tical test should be conducted in different traffic situations so that the
driving examiner could assess the driving skill, presence of mind and
judgement of the test-taker (Molander, 1997).
                                           2
In 1948 the education in driving theory was supplemented with some
new parts that dealt with the responsibilities of the driver and acci-
dents in traffic, these new areas were also reflected in the driving-
license test. At the time the driving-license test consisted of three ex-
aminations: a written test, an oral examination and a practical test. The
written test consisted of twenty-five items that should be completed in
fifteen minutes. The purpose of the oral examination was to check the
test-takers understanding of traffic related problems. The practical test
involved at least ten minutes of actual driving where either the test-
taker or the driving examiner decided the route. The results of the
three tests were considered in the final judgement of the test-taker.
There were clear directives on what knowledge was required in order
to pass the test. Later, it was stated that there were some problems
with the practical test. It was found that the difficulty of the practical
test varied a great deal depending on when the test was taken (Franke
et al., 1995).

In the 1950s the responsibilities of the driver were emphasised to a
greater extent than before. This change was based on the opinion that
the personality of the drivers affected their behaviour in traffic. The
purpose of the theory test was to check that the learner-driver had
knowledge that improved his or her judgement in traffic. For a long
time, the focus of the education in driving theory had been how the
vehicle was constructed. Now, the consideration of other road users
and the judgement in traffic were considered the most important parts
of the theory test. Even though it was important to improve the
judgement of the learner-driver in traffic, the focus was still the prac-
tical education. In the end of the 60s the theory education and the
practical education were integrated. It became important that the
learner-driver understood the content of the education in driving the-
ory, rather than just learning it (Franke et al., 1995).

In 1971 a new curriculum was introduced and two years later a new
differentiated theory test was employed. The theory test was com-
posed of a basic test and one or more supplementary tests. The basic
test had to be taken by all test-takers, irrespective of the type of cer-
tificate applied for. The supplementary tests were selected according
to the type of certificate (motorcycle, car/light truck, heavy truck etc.)
applied for. The theory test was a written test that contained 80 multi-
ple-choice items for AB-applicants (car/light truck). The basic test
comprised 60 items and the cut-off score was set to 51 (85%). The

                                    3
supplementary test for AB-applicants consisted of 20 items and the
cut-off score was 15 (75%). The scoring model was conjunctive,
which means that the test-taker had to pass both theory tests. The
items consisted of a question and three options. Only one of the op-
tions was correct. The content of the test was not changed very often,
so eventually test-takers came to know many of the items before the
test (Franke et al., 1995; Spolander, 1974).

In 1989 the curriculum was changed (Trafiksäkerhetsverket, 1988)
and both the practical and the theory test were altered. The practical
test was meant to cover the content of the curriculum to a greater ex-
tent than before. Five areas of competence (speed, manoeuvring,
placement, traffic behaviour and attentiveness) were introduced. The
judgement of traffic situations in the practical test should be related to
these competences. The judgement of the practical test was changed
from an assessment where the test-taker obtained a grade on a scale
from one to five, to an assessment where the test-taker either passed or
failed the test.

In the field of the theory education, two new content areas were intro-
duced (Trafiksäkerhetsverket, 1988). These areas focused on risky
situations in traffic and the limitations of driver abilities. The driver
education was extended and the theory education was planned to be
more effective. The new objectives of the curriculum had to be cov-
ered by the new test, so at the same time as the new curriculum was
introduced a new theory test was constructed. When the new curricu-
lum was introduced, it was decided that the test-takers had to pass the
theory test before they were allowed to take the practical test (Matts-
son, 1990).

The new theory test was introduced in January 1990, nine months af-
ter the introduction of the curriculum (Mattsson, 1990). The test was
administered in six versions and each version consisted of forty items.
All items, except for one, were multiple-choice items. The one item
that had a different item format consisted of four descriptions of the
meaning of four traffic signs that should be paired with four out of
eight pictures of traffic signs. The number of options in the new test
was increased from three to four and the test-takers did not know how
many options were correct. In order to get one point on an item the
test-taker had to identify all the correct options. The test-taker did not
get a point if he or she answered three out of four options correctly.

                                    4
The item format used in the test is rarely used in other countries,
partly because of the inconsistency of the item format, which is a re-
sult of that the number of correct answers is sometimes known and
sometimes unknown.

The content areas of the theory test were given different weights ac-
cording to the curriculum. The weights of the different parts were
regulated through the number of items. The content area that con-
tained most items was “traffic regulations”. In order to pass the theory
test, most of the criteria in the curriculum should be met. The test-
taker was not allowed to be found lacking in any content area. The
scoring of the test was both compensatory and conjunctive, which
means that the test-takers could pass the test in two ways. One way to
pass the test was if the test-taker’s score was 36 out of 40 (90% of the
total score) or higher. Another way to pass the test was if the test-taker
reached the specific cut-off score for each content area and had a score
of 30 or more (Mattsson, 1993).
Table 2. Number of items and cut-off score for the different content areas of the
         theory test (1990-1999).
  Content area                         Number of items       Cut-off score
  Traffic regulations                       14                11 (79 %)
  Risky situations in traffic                8                 5 (63 %)
  Limitation of driver abilities             8                 5 (63 %)
  Vehicle-related knowledge                  3                 1 (33 %)
  Special applications and other regu-       7                 4 (57 %)
  lations                                                  + 4 items correct
  Total                                       40              30 (75 %)
                                                           or 36/40 (90 %)

The curriculum introduced in 1989 was used until 1996, when the
curriculum was revised to include more environmental aspects (VVFS
1996:168). In June 1999 a new theory test was introduced. The new
test had the same content areas as the old test but a different item-
format (VVFS 1999:32). The new test consists of sixty-five multiple-
choice items with only one option correct for each item. Mainly, the
items have four options and the items are proportionally distributed
over the five content areas with the old theory test as a model, i.e. the
relation between the content areas is the same in the new test as in the
old theory test. “Traffic regulations” is still the area that contains most

                                       5
items. Five try-out items that do not count to the score are also put
into each test. The cut-off score is set to 52 out of the 65 (80%) and
the basis for this decision was that there should not be any change in
the level of difficulty between the old and new theory test. The scor-
ing model is compensatory. Lack of knowledge in some area can be
compensated with greater knowledge in the other areas (Wolming,
2000b). There are various methods to use for standard setting, but the
decision to set the cut-off score at 52 was not based on any of these
methods (Wiberg & Henriksson, 2000). Instead a statistical model,
which was based on data for the same test-takers taking the old and
new theory test, was used.

A practical test is set in order to test the four main parts of the curricu-
lum that relate to the practical driving. The performance of the test-
taker is assessed with respect to five competences (VVFS 1996:168)
that are related to the driver’s awareness of risks in traffic. The first
competence is the driver’s speed adaptation in different situations in
traffic. The second competence is the driver’s ability to manoeuvre the
vehicle. The third area of competence is the driver’s placement of the
vehicle in traffic. The fourth area is the driver’s traffic behaviour and
the fifth competence is the driver’s attentiveness to various situations
in traffic.

During the practical test different traffic situations are observed. These
traffic situations are divided into five types of situations; handling the
vehicle, driving in a built-up area, driving in a non built-up area, com-
bination of driving in a built-up area and in a non built-up area and
driving in special conditions e.g. darkness and slippery roads. The
performance in these situations is related to the five competences
mentioned earlier. If the test-taker fails in any competence the driving
examiner notes in what traffic situation the error did occur. One error
is sufficient to fail the test-taker.

In the following sections the process of test construction and impor-
tant issues in test development will be considered. When constructing
a test it is important to consider if the performance of the test-taker is
to be compared with the performance of other test-takers or with some
external criterion, i.e. if the test is a criterion- or norm-referenced test.

                                     6
Criterion-referenced and norm-referenced tests
In general, tests can provide information, that aids individual and in-
stitutional decision-making. Tests can also be used to describe indi-
vidual differences and the extent of mastery of basic knowledge and
skills. These two general areas of test application lead to two ap-
proaches to measurement and, as a consequence, also two kinds of
test; norm-referenced tests (NRT) and criterion-referenced tests
(CRT). This formal differentiation of two general approaches to test
construction and interpretation has its origin in an article by Glaser
(1963). This article outlines these two formal approaches to test con-
struction and interpretation.

The main difference between these approaches is that a CRT is used to
ascertain a test-taker’s status with respect to a well-defined criterion
and a NRT is used to ascertain a test-taker’s status with respect to the
performance of other test-takers on that test.

From a more detailed perspective Popham (1990) also defines two
major distinctions between CRT and NRT. The first relates to the cri-
terion with CRT focusing mostly on a well-defined criterion and a
well-defined content domain. The specification and description of the
domain, and the concept of domain, is described in terms of learner
behaviours. The specification of the instructional objectives associated
with these behaviours is central in CRT. The criterion is performance
with regard to the domain. NRT is focusing more on general content
or process domains such as vocabulary and reading comprehension.
Thus, the difference rests on a tighter, more complete definition of the
domain for CRT, as compared to NRT. In some cases CRT also in-
cludes specification of the performance standard and this performance
standard may for example take the form of specifying number of items
to be answered correctly or number of objectives to be mastered.

The other major distinction relates to the interpretation of a test-
taker’s score. CRT describes the score with respect to a criterion and
NRT the score with respect to the score of other test-takers. An exam-
ple of a NRT in Sweden is the SweSAT which is used for selection to
higher education (Andersson, 1999) and an example of a CRT is the
national tests that are used as an aid in the grading procedure for
teachers in upper secondary school (Lindström, Nyström & Palm,
1996).

                                   7
A closer look at the theory test, the instructional objectives of the
driver education (as they are defined in the curriculum issued by the
SNRA) and the interpretation of test scores, leads to the conclusion
that the theory test can be characterised as a CRT. The curriculum
represents the criterion and the theory test consists of five different
parts (Table 1) that are connected to the curriculum. The purpose of
the test is to determine if a test-taker has acquired a certain level of
knowledge compared with the defined criterion and standard setting is
used to define this level of knowledge (Mattsson, 1993).

           Important issues in test development
The first and most important step in test development is to define the
purpose of the test or the nature of the inferences intended from test
scores. The measurement literature is filled with breakdowns and clas-
sifications of purposes of tests and in most cases the focus is on the
decision, i.e., the decision that is made on the basis of the test infor-
mation (see for example Bloom, Hastings and Madus, 1971; Mehrens
& Lehman, 1991; Gronlund, 1998; Thissen & Wainer, 2001).

The setting for the theory test is that this test is used to make decisions
about test-taker performance with reference to an identified curricular
domain. Curricular domain is defined here as the skills and knowledge
intended or developed as a result of formal, or non-formal, instruction
on identifiable curricular content.

Test specification
When the purpose of the test is clarified the next logical step in test
development is to specify important attributes of the test. Test content
is, in most cases, the main attribute. Other important attributes include
for example test and item specification, item format and design of the
whole test, as well as psychometric characteristics, evaluation and
selection of items and standard setting procedures. These attributes are
also dependant on external factors, such as how much testing time is
available and how the test can be administered. Millman & Green
(1989), for example, distinguished between external contextual factors
(for instance who will be taking the test and how the test will be ad-
ministered) and internal test attributes (for instance, desired dimen-
sionality of the content and distribution among content components,

                                    8
item formats, evaluation of items and desirable psychometric charac-
teristics of both individual items and the whole test). With reference to
internal attributes Henriksson (1996a) also made a distinction between
two kinds of models, a theoretical model and an empirical model. The
theoretical model is based mainly on judgements but also on state-
ments about, for example, the number of items in the test and item
type, and the empirical model is based on empirical data describing
psychometric characteristics of items as well as the whole test. The
theoretical and empirical model is summarised in test specifications.

One effective way to ensure adequate representation of items in a test
is to develop a two-way grid called a test blueprint or a table of speci-
fication (Nitko, 1996; Haladyna, 1997). In most cases the two-way
grid includes content and the types of mental behaviours required of
the test-taker when responding to each item. Haladyna (1999), for
example, suggested that all content can be classified as representing
one of four categories: fact, concept, principle, or procedure. He also
defined five cognitive operations: recall, understanding, prediction,
evaluation and problem solving. Another well-known hierarchical
system is the taxonomy by Bloom (1956) consisting of six major cate-
gories. This hierarchical system has also been elaborated (Andersson
et al, 2001). However, the behaviour dimensions should not be too
complex and it can be claimed that the Bloom taxonomy has never
achieved any greater success as a tool for test construction, maybe
because it is too complex. Perhaps the revised model will be a step
forward in that respect?

But, as Henriksson (1996a) pointed out, the matrix schemes for the
composition of a test need not be limited to the dimensions of content
and process. More dimensions can be added by considering for exam-
ple the item’s reading level, the amount of text and the formation of
distractors. Other factors that also can be considered are surplus in-
formation, degree of non-verbal information, abstract-concrete and so
forth. The effort to create these theoretical attributes and to establish a
theoretical model for the test is based on judgements by experts. It can
also be added that these added dimensions give more guidance for the
test developer and, at the same time, the model for the whole test be-
comes more exact.

                                    9
Item specifications
There is dependence between test and item specifications since the
theoretical as well as the empirical model for a certain test are related
to the attributes of the item. Therefore, most of the item specifications
are outlined when the test specifications are defined. An item specifi-
cation includes sources of item format, item content, descriptions of
the problem situations, characteristics of the correct response and in
the case of multiple-choice items: characteristics of the incorrect re-
sponses. The use of item specifications is particularly advantageous
when a large item pool should be created and when different item
writers will construct the items. If each writer sticks to the item speci-
fication, a large number of parallel items can be generated for an ob-
jective within a relatively short time (Crocker & Algina, 1986).

Different types of information should be stored for each item. First,
information used to access the item from a number of different points
of view should be stored. This information usually consists of key-
words describing the item content, its curricular content, its behav-
ioural classification and any other salient features; for example the
textual and graphical portions of the item. Different kinds of judge-
ments by experts give this theoretical information (Henriksson,
1996b). Second, psychometric data should be stored, such as the item
difficulty and item discrimination indices. Third, and of relevance to
the theory test - the number of times the item has been used in a given
period, the date of the last use of the item, and identification of the last
test-version the item appeared in, i.e. different indices of exposure for
each item.

It should also be noted that the storage of empirical item statistics also
represents a measurement problem. Under classical test theory, item
statistics are group dependent and, therefore, must be interpreted
within the context of the group tested (Linn, 1989). It should also be
mentioned that when using item response theory (IRT) as a basis for
empirical item statistics this disadvantage of group dependence is
eliminated, i.e. it is possible to characterise or describe an item, inde-
pendently of any sample of test-takers who might respond to the item
(see for example Lord, 1980; Hambleton et al, 1991; Thissen & Or-
lando, 2001).

                                    10
Item format
Generally speaking, the test developer faces the issue of what to
measure and how to measure it. For most large-scale testing pro-
grammes, test blueprints and cognitive demands specify content and
demands in terms of what to measure. Regarding the question of how
to measure, one dilemma facing test developers is the choice of item
format. This issue is, according to Rodriguez (2002), significant in a
number of ways. One factor is that interpretations vary according to
item format and a second factor is that the cost of scoring open-ended
questions can be enormous compared with multiple-choice items. A
third factor is that the consequences of using a certain item format
may effect instruction in ways that foster, or hinder, the development
of cognitive skills measured by tests. The significance of format selec-
tion is also related to validity, either as a unitary construct (Frederik-
sen & Collins 1989; Messick, 1989) or as an aspect of consequential
validity (Messick, 1994).

In view of the statements mentioned in the previous paragraph the
conclusion is that it is useful to distinguish between what is measured
and how it is measured; between substance and form; between content
and format. The two are not independent, for form affects substance,
and, to some extent, substance dictates form. Nevertheless, the em-
phasis here is on form; on how items are presented. First, a set of at-
tributes of item formats is offered that can serve to classify item types.
Second, the importance of an item’s format is discussed: its relation-
ship to what is measured and its effect on item parameters (Linn,
1989).

The issues surrounding item format selection, and test design more
generally, are also critically tied to the nature of the construct being
measured. In line with this statement Martines (1999), reviewing the
literature on cognition and the question of item format, concluded that
no single format is appropriate for all educational purposes. Referring
to the driving-license test, we might assert that driving ability can (and
should) be measured via a driving-ability performance test and not a
multiple-choice exam, but knowledge about driving (procedures, regu-
lations and local laws) can be measured by a multiple-choice exam.

The item format is described in the item specifications. For optimal
performance tests (for example the theory test) there is a variety of

                                   11
item formats that could be considered. The item formats can be di-
vided into two major categories; those that require the test-taker to
generate the response and those that provide two or more possible
responses and require the test-taker to make a selection. Because the
latter can be scored with little subjectivity, they are often called objec-
tive test items (Crocker & Algina, 1986).

It is also worth mentioning that open-ended questions, i.e., questions
for which the test-taker constructs the answer using his or her own
words, are often preferred because of a belief that they may directly
measure some cognitive process more readily, or because of a belief
that they may more readily tap a different aspect of the outcome do-
main. The consequence has been that popular notions of authentic and
direct assessment have politicised the item-writing profession (Rodri-
guez, 2002). This tendency to include less objective formats in tests
give rise to subjectivity and this conclusion is based on the fact that
multiple-choice items can be scored with significant certainty and
with objectivity. But the crucial question is whether multiple-choice
items and open-ended items measure the same cognitive behaviour or
not? Rodriguez (2002, p 214) briefly formulated his standpoint in the
following way: “They do if we write them to do so”.

In line with the arguments for multiple-choice items Ebel (1951,
1972) suggested that every aspect of cognitive educational achieve-
ment is testable through the multiple-choice format (or true-false
items). His conclusion is also that the things measured by these items
are far more determined by their content than by their form. Many of
the recent authors refer to the wise advice in Ebel’s writing regarding
test development and item writing. See for example Carey (1994);
Osterhof (1994); Kubiszyn & Borich (1996); Payne (1997); McDon-
ald (1999).

Evaluation of items
The problem of deciding which items to use in a test is related to the
theoretical and empirical model as well as to the test and item specifi-
cation. The summarised conclusion is that quality items are desired.
Consequently, evaluation and judgement procedures based on theo-
retical and empirical data are important to weed out flawed items.

                                    12
An often-used procedure in item construction is that external item-
writers deliver items, which then are examined and scrutinised by test
developers. This model is used by the SNRA. The result of this is that
in many cases the proposals, which the item writers have, are to be
changed in one way or another in order to meet the requirements for
good items. The test developer, who is an expert in test- and item-
construction, makes these changes and improvements. When this
process is finished, item evaluation is the next step.

The term theoretical evaluation is used for the process when the items
are judged against stated and defined criteria. The procedure requires
that the items are written but not necessary administered to a represen-
tative sample of test-takers in a try-out. Common for all methods for
theoretical evaluation is that one or more judges evaluate items against
the criteria. A decision must be made about which criterion or criteria
should be addressed, and the priority between those criteria. Tech-
niques and methods for evaluating the judgements must be decided
upon as well. This process of judgement can be related to the item per
se as to the theoretical and empirical model for the test. Henriksson
(1996b) defined and described accuracy, difficulty, importance, bias
and conformity as assessment criteria. The judgement can also be fo-
cused on the classification of items according to item parameters.
These item parameters are included in the model for the theoretical
component of the total model for the test and in this respect the basic
aim of the judgement and evaluation is to get indications about the
reliability of classification.

To obtain information about certain items, item analysis is used. Item
analysis is the computation of the statistical properties of an item re-
sponse distribution. Item difficulty (p) is the proportion of test-takers
answering the item correctly. Item discrimination is used to assess
how well performance on one item relates to some other criterion, e.g.
the total test score. Two statistical formulas that are commonly used
are point biserial correlation (rpbis) and biserial correlation coefficient
(rbis) (Crocker & Algina, 1986).

Try-out
It’s important to pre-test the items before they are put in the actual test
since it’s difficult to anticipate how an item will work in the actual
test. Before the try-out is carried out it is important to describe what

                                    13
information the try-out should result in. It’s also important to be aware
that some items probably will not be good enough and that several try-
outs are necessary to end up with a collection of good items. If the test
will consist of several parallel test versions, an extensive domain of
pre-tested items is required.

When selecting the group for the try-out it is important to consider if
they are representative of the group that takes the actual test. One
should also consider their motivation to do the test and the size and
the availability of the group. Of course there are many reasons why
the apparent difficulty might be expected to change between item try-
out and actual testing. One might, for example, expect the test-takers
to be more motivated during the actual testing, or one might believe
that there were changes in instruction during the intervening period.

The try-out can be done separately from the actual test, or in combina-
tion with items in the actual test. If the try-out items are a part of the
actual test the test-taker can either be informed that they are working
with try-out items or not. The advantage with this design is that the
try-out is done in the proper group of test-takers and that they are
probably fully motivated.

Validity
The traditional approach to validity implies that validity is classified
into three different types of validity: content-related evidence of valid-
ity, criterion-related evidence of validity and construct-related evi-
dence of validity.

Content-related evidence of validity refers to the extent to which the
content of test items represents the entire body of content. This body
of content is often called the content universe or domain. The basic
issue in content validation is representativeness. In other words, how
adequately does the content of the test represent the entire body of
content to which the test user intends to generalise? The word “con-
tent” refers, in this context and according to Anastasi (1988), to both
the subject-matter included in the test and the cognitive processes that
test-takers are expected to apply to the subject matter. Hence, in col-
lecting evidence of content-related evidence of validity it is necessary
to determine what kinds of mental operations are elicited by the prob-
lems presented in the test, as well as what subject-matter topics have

                                   14
been included or excluded. The key ingredient in securing content-
related evidence of validity is human judgement.

Criterion-related evidence of validity is based on the extent to which
the test score allows inferences about the performance on a criterion
variable. In this context the criterion is the variable of primary inter-
est. If the information about the criterion can be available at the same
time as the test information the validity is called concurrent validity.
Concurrent-related evidence of validity is, for example, frequently
used to establish that a new test is an acceptable substitute for a more
expansive measure. If the criterion information is available after a
certain time, for example a year or more, the validity is called predic-
tive validity. Thus, predictive-related evidence of validity refers to
how well a test predicts or estimates some future performance on a
certain criterion. The degree to which scores on the test being vali-
dated predict successful performance on the criterion is estimated by a
correlation coefficient. This coefficient is called validity coefficient.

Construct-related evidence of validity refers to the relation between
test score and a theoretical construct, i.e. a measure of a psychological
characteristic of interest. Theoretical constructs are: intelligence, criti-
cal thinking, creativity, introversion, self-esteem, aggressiveness and
achievement motivation etc. Reasoning ability, reading comprehen-
sion, mathematical reasoning ability and scholastic aptitude are other
examples of constructs. Such characteristics are referred to as con-
structs because they are theoretical constructions about the nature of
human behaviour.

Construct validation is the process of collecting evidence to support
the assertion that a test measures the construct that it is supposed to
measure. Construct-related evidence of validity can seldom be in-
ferred from a single empirical study or from one logical analysis of a
measure. Rather, judgements of validity must be based on an accumu-
lation of evidence. Construct-related evidence of validity is investi-
gated through rational, analytical, statistical and experimental proce-
dures. The development or use of theory that relates various elements
of the construct under investigation is central. Hypotheses based on
theory are derived and predictions are made about how the test scores
should relate to specified variables. In a classical article Cronbach &
Meel (1955) suggested five types of evidence that might be assembled
in support for construct validity. These types were also succinctly

                                    15
stated by Helmstadter (1964) and Payne (1997). Both evidence of con-
tent-related validity and evidence of criterion-related validity are used
in this process. In that sense, content validation and criterion valida-
tion become part of construct validation.

 This latter conclusion (i.e. that content-related, criterion-related and
construct-related evidence of validity are not separate and independent
types of validity, but rather different categories of evidence that are
each necessary and cumulative) represents the integrated view of va-
lidity. This integrated and unitary view of validity is described, for
example, in Messick’s (1989) treatment of validity. Recent trends in
validation research have also stressed that validity is a unitary concept
(see, for example, Wolming, 2000a; Nyström, 2004). Thus, validity-
related evidence concerns the extent to which test scores lead to ade-
quate and appropriate inferences, decisions and actions. It concerns
evidence for test use and judgement about potential consequences of
score interpretation and use. However, it can also be added that, in a
very real sense, validity is not strictly a characteristic of the instrument
itself but of the inference that is to be made from the test scores de-
rived from the instrument.

Reliability
When a test is administered, the test user would like some assurance
that the test is reliable and that the results could be replicated if the
same individuals were tested again under similar conditions (Crocker
& Algina, 1986). Reliability refers to the degree to which test scores
are free from errors of measurement. There are several procedures to
estimate test score reliability.

The alternate form method requires constructing two similar versions
of a test and administering both versions to the same group of test-
takers. In this case, the errors of measurement that primarily concern
test users are those due to differences in content of the test versions.
The correlation coefficient between the two sets of scores is then
computed (Crocker & Algina, 1986). If two versions of a test measure
exactly the same trait and measure it consistently, the scores of a
group of individuals on the two test versions would show perfect cor-
relation. The lack of perfect correlation between test versions is due to
the errors of measurement. The greater the errors of measurement, the
lower the correlation (Wainer et al., 1990).

                                    16
The test-retest method is used to control how consistently test-takers
respond to the test at different times. In this situation measurement
errors of primary concern are fluctuations of a test-takers’ observed
score around the true score because of temporary changes in the test-
takers’ state. To estimate the test-retest reliability the test constructor
administers the test to a group of test-takers, waits, and readministers
the same test to the same group. Then the correlation coefficient be-
tween the two sets of scores is estimated.

Internal consistency is an index of both item homogeneity and item
quality. In most testing situations the examiner is interested in gener-
alizing from the specific items to a larger content domain. One way to
estimate how consistently the performance of the test-takers relates to
the domain of items that might have been asked is to determine how
consistently the test-takers performed across items or subsets of items
on a single test version. The internal consistency estimation proce-
dures estimate the correlation between separately scored halves of a
test. It is reasonable to think that the correlation between subsets of
items provides some information about the extent to which they were
constructed according to the same specifications. If test-takers’ per-
formance is consistent across subsets of items within a test, the exam-
iner can have some confidence that this performance would generalize
to other possible items in the content domain (Crocker & Algina,
1986).

The techniques for estimation of reliability mentioned above have
been developed largely for norm-referenced measurement. Other
techniques have been suggested for criterion-referenced tests. Crocker
and Algina (1986) presented some reliability coefficients for criterion-
referenced measurement. Wiberg (1999a) found that the statistical
techniques used to evaluate the reliability in norm-referenced test also
could be used to evaluate the reliability in criterion-referenced tests.
However, the usage and interpretation of the results must be handled
with caution. The variation in test scores among test-takers constitutes
an important foundation for the statistical techniques estimating reli-
ability in norm-referenced tests. Only when the items in a criterion-
referenced test fulfil the assumptions underlying classical test theory
would it be recommendable to use these statistical methods.

                                    17
Parallel test versions
If a test has two or more versions and the test-taker’s score from the
test is used for decisions (which is the case for the theory test) all of
them must be parallel. This means that different versions contain dif-
ferent items but are built to the same test and item specifications and
the same models. From a perspective of a test-taker this means that the
obtained test result should be exactly the same, irrespective of the ver-
sion that is administered. The need for parallel test versions is moti-
vated by the need for test security and for the sake of fairness. It is
also a fundamental requirement if repeated test taking is permitted.
There are formal and theoretical definitions of parallel test forms (see
for example Thissen & Wainer, 2001) and sometimes a distinction is
made between parallel, equivalent and alternate forms. But, it can
also be added that, for example, Hanna (1993) used parallel, equiva-
lent and alternate forms synonymously.
Thus, parallel, equivalent or alternate forms1 have identical weight
allocations among topics and mental processes, but the particular test
questions differ. Ideally, parallel test versions should have equivalent
raw score means, variability, distribution shapes, reliabilities, and cor-
relation with other variables. To estimate the reliability between two
or more versions of the same test the alternate form method is used
(Crocker & Algina, 1986). If the versions are parallel regarding item
difficulty there is a high correlation between them.
When put into practice, however, the construction and evaluation of
parallel test versions give rise to a number of problems and it is neces-
sary to examine the property two versions should have that would
qualify them for use interchangeably. The concept of parallel versions
sets the ground for a discussion of practical problems in constructing
two (or more than two) parallel test versions that we are willing to
regard as interchangeable.

1
    The term parallel test versions is used in this report.

                                             18
Standard setting
The idea of standard setting is to find a method that minimises the
number of wrong decisions about the test-taker. There are two types of
wrong decision. The first is if a test-taker that does not have the
knowledge passes the test. The other is if a test-taker that has the
knowledge fails the test (Berk, 1996).

The cut-off score in a test represents a line between confirmed knowl-
edge and a lack of knowledge in a certain area. If the test-taker’s total
score is equal to or higher than the cut-off score he or she has the
knowledge that is measured by the test. If the test-taker’s total score is
less than the cut-off score he or she does not have the knowledge
measured by the test (Crocker & Algina, 1986).

There are various methods that one can use in standard setting and
depending on the format of the test different methods are good to use
(Berk, 1986). The methods can be categorized from their definition of
competence. Some methods assume that the test-takers either have the
knowledge or they do not. Other methods view competence as a char-
acteristic that is continuously distributed, and that a test-taker’s
knowledge can be seen as a value within an interval in this distribu-
tion. These latter methods of standard setting can be divided into dif-
ferent groups depending on the amount of judgement in the decision.
Jaeger (1989) proposed two main categories that are based on per-
formance-data of the test-takers; test-centred continuum models that
are mainly based on judgements and examinee-centered continuum
models that are mainly based on test-taker’s performance on the test.
In addition to these models there are judgemental continuum models
that are mainly based on judgement. In the last few years a fourth
category, “multiple models”, has been introduced. This model is used
for standard setting when the test has multiple item formats or multi-
ple cut-off scores.

It can also be added that there are basically three general methods for
applying standards: disjunctive, conjunctive and compensatory (see
for example Gulliksen, 1950; Mehrens, 1990; Haladyna & Hess,
1999). In the disjunctive and conjunctive approaches, performance
standards are set separately for the individual assessment, for example
a subtest. In the compensatory procedure, performance standards are

                                   19
set for a composite or index that reflects a combination of subtest
scores.

With the disjunctive model, test-takers are classified as an overall pass
if they pass any one of the subtests by which they are assessed. This
approach is applied rather seldom and seems most appropriate when
the subtests involved in a test battery are parallel versions, or in some
other way are believed to measure the same construct. Haladyna &
Hess (1999), for example, pointed out that the disjunctive approach is
employed in assessment programmes that allow a test-taker to retake a
failed test.

With a conjunctive model for decision-making, test-takers are classi-
fied as having passed only if they pass each of the subtests by which
they are assessed. The use of the conjunctive approach seems most
appropriate when the subtests assess different constructs, or aspects of
the same construct, and each aspect of the construct is highly valued.
Failing only one assessment yields an overall fail because the content
standards measured by each assessment are considered essential to
earn an overall pass. The application of a conjunctive strategy to stan-
dard setting results in test-takers being classified into the lowest cate-
gory attained on any one measure employed.

With a compensatory model, test-takers are classified as pass or fail
based on performance standards set in combination of the separate
subtests employed. Data are combined in a compensatory approach by
means of an additive algorithm that allows high scores on some sub-
tests to compensate for low scores on others. The use of a compensa-
tory strategy seems, according to Ryan (2002), appropriate when the
composite of the separate subtests has important substantial meaning,
a meaning that is not represented by subtests taken separately.

A useful combination of the compensatory and conjunctive model can
also be employed. Such an approach sets minimal standards on each
subtest that is applied in a conjunctive fashion. This means that the
test-taker must yield a minimal pass-level on each subtest before a
compensatory approach is applied, and a final rating is determined.
This combined conjunctive-compensatory approach sets minimum
standards that are necessary on each subtest but not sufficient for the
subtests taken together. This approach prevents very low levels of

                                   20
performance on one subtest being balanced by exceptional perform-
ance on other subtests (Mehrens, 1990).

Test administration
There are basically three different ways of presenting items to the test-
takers; by paper-and-pencil-tests, computerised tests or computerised
adaptive tests.

In a paper-and-pencil-test all test-takers get the same number of
items. The items are answered with a pencil on paper. The test is often
administered to a large number of test-takers a limited number of
times because of the item exposure. The item analysis is mainly done
with classical test theory (Wainer et al., 1990).

A computerised test is mainly the same as a paper-and-pencil-test. The
difference is that a computerised test is carried out with a computer,
which makes it possible to randomise the order of the items and the
options for each test-taker. An advantage with computerised tests is
that the administration takes less time since the scoring can be done
during the test. Another advantage is that the security of the test is
increased when there are no paper copies of the test.

With computerised tests there’s the possibility of using new innova-
tive types of items (van der Linden & Glas, 2000). Different types of
items are created from combinations of item format, response actions,
media and interactivity. An example of an item format is a multiple-
choice item. A response action could be that the test-taker answers the
item with a joystick. The items can contain different media, for exam-
ple animations and sound. Media can be used both in the item and in
the options. An example of interactivity is an item where the test-taker
can answer the item by marking a text or a point.

It is important to be aware of new measurement errors that can occur
in computerised testing. For example, a computerised test could imply
problems for test-takers who are not used to working with computers.
Another possible measurement error is that bad graphics on the com-
puter monitor can result in blurred pictures.

In computerised adaptive tests (CAT) the test-taker obtains items of
different difficulty depending on how the person answered the previ-

                                   21
ous items in the test. CAT has the possibility to give the test-takers a
test that fits their ability (Umar, 1997). Which items are selected for a
test-taker also depends on the content and difficulty of the test and the
item discrimination. For each response the test-taker gives, the com-
puter program estimates the test-taker’s ability and how reliable the
estimate is. When the predetermined reliability is achieved the test is
finished and the test-taker obtains the final estimate of his or her abil-
ity level (Wainer et al., 1990).

Tests based on CAT are often analysed with Item Response Theory,
(IRT). IRT can be used to describe test-takers, items and the relation
between them. IRT takes into account that the items in a test can vary
with respect to item difficulty. There are different models in IRT that
can be used to create scale-points. The one-parameter logistic (or
Rasch) model is the simplest model where only an “item difficulty”
parameter is estimated. The two-parameter logistic model estimates
not only a “difficulty” parameter but also a “discrimination” parame-
ter. The three-parameter logistic model includes a “guessing” parame-
ter as well as “discrimination” and “difficulty” parameters (Birnbaum,
1968).

There are three basic assumptions in IRT (Crocker & Algina, 1986).
The test has to be unidimensional, which means that all items measure
the same trait. The assumption of local independence means that the
answer on one item by a randomly picked test-taker is independent of
his or her answers on other items. The third assumption is that the
relationship between the proportion of test-takers that answered an
item correctly and the latent trait can be described with an item-
characteristic curve for each item. With IRT models we can deter-
mine the relationship between a test-taker’s score on the test and the
latent trait, which is assumed to determine the test-taker’s result on the
test. A test-taker with higher ability is more likely to answer an item
correctly than a test-taker with lower ability. If these three conditions
are met, test-takers can be compared even if they did not take parallel
test versions.

Item bank
An item bank is a collection of items that can be used to construct a
computerised test or a test based on CAT. An item bank should con-
sist of a large number of pre-tested items so that varied tests can be

                                   22
You can also read