Comparison of Asymptotic and Bootstrap Item Fit Indices in Identifying Misfit to the Rasch Model National Conference on Measurement in Education ...

Comparison of Asymptotic and Bootstrap Item
Fit Indices in Identifying Misfit to the Rasch

National Conference on Measurement in Education
New Orleans, LA

Edward W. Wolfe
Michael T. McGill

April 2011
ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                          1

                                       Author Note

       Edward W. Wolfe, Research Services, Assessment & Information, Pearson.

      Correspondence concerning this article should be addressed to Edward W. Wolfe, 2510
N. Dodge St., Mailstop 125, Iowa City, IA 52245-9945. E-mail:
ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                                 2


Rule-of-thumb critical values are not suitable for flagging items for misfit to the Rasch model

because the null distributions of mean-square fit indices are not well understood. Bootstrap

procedures may be better suited for identifying appropriate critical values, but the accuracy of

those procedures has not been studied. In this study, data were generated according to the

dichotomous Rasch model, and violations of the lower asymptote and common slope

assumptions were introduced into the simulated data while altering the number of examinees,

number of items, and item difficulty/person ability distribution offset. For each cell of the

experimental design, the proportion of items that satisfied the Rasch model assumptions flagged

for misfit (Type I errors) and the proportion of items modeled to violate assumptions that were

not flagged for misfit (Type II errors) were compared to the analogous flag rates using rule-of-

thumb and distribution corrected critical values. Results suggest that Type II errors are much

lower for critical values based on bootstrap procedures and that distribution offset and type of

misfit influence the accuracy of misfit diagnosis.

                        Keywords: Rasch model, model-data fit, bootstrap
ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                                     3

                               Comparison of Asymptotic and Bootstrap Item Fit Indices

                                          in Identifying Misfit to the Rasch Model

      Item fit statistics are commonly used in applications of the Rasch model to aid in selection

of items after field testing or retention of items in operational contexts. Four of those fit statistics

include the weighted and unweighted mean-squared fit statistics and the standardized versions of

these two fit statistics (Smith, 2000). The mean squared fit statistics (Wright & Masters, 1982)

are based on the standardized residual of the observed response for each person and item

combination from the modeled expectation, given the parameter estimates,

      zni 
               xni  Eni 

      where        xni = the observed response of person n to item i,

        Eni   k nik , the expected response of person n to item i,
                 k 0

        Wni    k  Eni  ,
                 k 0

        k = the scored responses, ranging from 0 to m, and

                   nik = the model-based probability that person n will have an observed response in

                                     category k.

Unweighted mean squared fit statistics for items are computed as the average of the squared

standardized residuals across all persons associated with that item,


                        z      2
        UMSi           n 1
ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                                      4

Weighted mean squared fit statistics for items are computed as the average of the squared

standardized residuals across all persons associated with that item, each squared standardized

residual weighted by its variance,


                 z W     2
                          ni        ni
        WMSi    n 1
                   n 1

Each of these statistics can also be standardized via the Wilson-Hilferty cube root transformation

(Wilson & Hilferty, 1931) to obtain the standardized unweighted and weighted mean square fit

statistics (ZUMS and ZWMS) (Wright & Masters, 1982). Analogous person fit statistics are obtained

by averaging the unweighted or weighted squared standardized residuals for a particular person

across all associated items.

      Historically, rule-of-thumb upper and lower limits for acceptable mean square fit values

have been established for flagging items, such as 0.70 and 1.30 for multiple-choice items and

0.60 and 1.40 for rating scales (Wright & Linacre, 1994). Unfortunately, simulation studies have

shown that these rule-of-thumb values may be inappropriate for many applied situations (Smith,

1991; Smith, Schumacker, & Bush, 1998; Wang & Chen, 2005). Hence, users are faced with a

quandary. How does one interpret a fit statistic if the distribution of the values of that statistic,

and hence the range of reasonable values, is not known?

      This article compares application of these rule-of-thumb critical values to analogous

bootstrap critical values to the identification of model-data misfit. Specifically, we generate

simulated data that contains violations of the common item slope and zero lower asymptote

assumptions upon which the Rasch model is based, apply bootstrap and rule-of-thumb critical

values to identify simulated cases of misfit, and compare the Type II error rates and statistical
ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                                   5

power of these two methods for specifying item fit critical values while altering sample size, test

length, and item/person distribution offset.

Rasch Fit Research

     It has long been known that the null distributions of commonly employed fit statistics do

not follow a distribution with a known parametric form (Karabatsos, 2000; Molenaar & Hoijtink,

1990; Smith, 1988, 1991; Wang & Chen, 2005). Variability of null distributions of the mean

square fit indices vary as a function of the number of observations and the shapes of the

distributions of persons and items. In addition, distributions of the standardized mean square fit

statistics deviate from the assumed mean and variance of 0 and 1, respectively, when their

computation is based on estimated person and item parameters (Smith, 1991). As an example of

the problems associated with interpreting fit indices, in data simulated to fit the Rasch

dichotomous model, appropriate critical values for UMS may vary from 0.75 to 1.30 for lower

and upper critical values, respectively, for relatively small sample sizes (150) and 0.95 to 1.10

for larger sample sizes (1000) (Smith, Schumacker, & Bush, 1998). Similar variable ranges have

been observed for ZUMS. Hence, any interpretation of those statistics cannot rely on a single set of

critical values and must instead take into account characteristics of the dataset (Smith,

Schumacker, & Bush, 1998). Unfortunately, this fact is typically ignored in Rasch measurement


     Adjustments have been proposed for the deviations of empirical fit statistics from their

assumed distributions. For example Smith (1991) conducted a simulation study to determine the

distributional properties of the mean-squared item fit statistics, and he determined that an

adjustment of these indices—one that takes into account the number of items, number of persons,

and the offset between the item difficulty and abilities of the persons responding to that item—
ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                                         6

produces indices that are closer to hypothesized distributions. A follow up study (Smith,

Schumacker, & Bush, 1998) yielded a simpler recommended correction for cases where the

means of the person ability and item difficulty distributions were comparable. Specifically, they

recommend critical values for WMS equal to

      WMS *  1       ,                                                                          (4)

and for UMS equal to

      UMS *  1      .                                                                           (5)

These adjustments suggest upper critical values for item fit equal to 1.20, 1.09, and 1.06 for

WMS when sample sizes equal 100, 500, and 1000, respectively—values considerably smaller

than the rule-of-thumb.

        More recently, Dimitrov and Smith (2006) evaluated an adjustment of fit statistics that

replaces the Rasch estimate of the probability of a particular response with a more accurate

estimate proposed by Van den Wollenberg (1982), finding that the adjustment resulted in a small

but consistent improvement in Type II error rates. Similarly, research conducted by Stone (Stone,

2003; Stone & Zhang, 2003) evaluated a Bayesian adjustment for item-response model

parameter estimates in the analysis of fit and found an improvement in both Type I and Type II

error rates, particularly for small sample sizes. An alternative to these approaches, which adjust

the values of the fit statistics or their critical values, is to utilize non-parametric measures of fit.

Karabatsos (2003) found that, of 36 person fit statistics, four of the five indices that were best at

identifying aberrant person responses were non-parametric indices. Unfortunately, neither

adjusted nor non-parametric fit indices are readily available in commercial measurement
ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                                     7

packages, and they will likely not make their way into measurement practice until adopted by

software publishers.

     Bootstrap procedures for identifying critical values for fit statistics are easy to implement

and, in fact, have been implemented in three separate programs that interface with commercial

software to identify reasonable fit statistics, given a set of estimated item and person parameters

(Stone, 2007; Su, Sheu, & Wang, 2007; Wolfe, 2008). The remainder of this article explains how

those procedures are implemented and compares decisions based on those procedures to those

based on traditional rule-of-thumb values when evaluation cases of model-data misfit to the

Rasch model.

Bootstrap Procedure

     Efron (1979) described the nonparametric bootstrap as a method for estimating the

sampling distribution of a random variable through empirical resampling methods. Specifically,

the nonparametric bootstrap constructs an empirical estimate of the unknown sampling

distribution by generating a probability distribution of the statistic across a large number of

resamplings of an original sample via sampling with replacement. That is, the discrete and

empirical distribution of the original observed sample is treated as a population from which a

large number (B, typically about 1,000) of resamples of size N are drawn repeatedly. The statistic

of interest is computed for each of these resamples, and the distribution of these statistics serves

as an empirical estimate of the sampling distribution of the statistic. A similar procedure, known

as parametric bootstrapping, can be performed by resampling from a hypothetical distribution

(e.g., a normal distribution) rather than from a single empirical sample. Bootstrap methods such

as these are known to produce sampling distribution estimates that exhibit bias, spread, and

shape similar to that of the parametric sampling distribution, but the empirical methods are not
ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                                    8

subject to the assumptions that are required by parametric methods for estimating sampling

distributions (Hesterberg, Moore, Monaghan, Clipson, & Epstein, 2005).

     In the context of item response models, the bootstrap procedure can be extended to

determine the shapes of null distributions for fit statistics by computing fit statistics from datasets

that are generated to fit the model in question. In this context, the analyst would (1) estimate item

and person parameters based on the original sample, (2) randomly select values of item and

person parameters from those estimated values, (3) for each of the B resamples, generate

simulated datasets that fit the item response model, (4) compute the statistic of interest for each

of the resamples, (5) compute averages of the statistic of interest across the B resamples, and (6)

compare the value of the statistic of interest to the averaged bootstrap values. Such a comparison

depicts the degree to which the values of the statistic in question produced by the original data

deviate from those that would be observed if the data demonstrated expected fit to the item

response model.


       In this study, data were generated according to the Rasch dichotomous model for non-

studied items and the three-parameter logistic model for studied items. These data were then

scaled to the Rasch dichotomous model, and bootstrap and rule-of-thumb critical values were

applied to the original estimated model fit indices in order to assess the accuracy of model-data

fit diagnosis in the presence and absence of violated model assumptions. In addition, we altered

the number of simulated examinees, number of items, and item difficulty/person ability

distribution mean offset in the simulated data. For each cell of the experimental design, the

proportion of items that satisfy the Rasch model assumptions that were flagged for misfit (Type I
ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                                  9

errors) and the proportion of items modeled to violate assumptions that were not flagged for

misfit (Type II errors) were compared using bootstrap and rule-of-thumb critical values.


     The independent variables manipulated in this study included the number of simulated

examinees [100, 200, 500, 1000] and the number of simulated items [20, 40, 80, 160]. We also

varied the difference in the means of the simulated examinee ability and item difficulty

distributions [-1.00, 0.00, 1.00]. The nature of misfit for the single studied item in each simulated

data file was controlled by altering the generating item slope and lower asymptote. For non-

studied items, the generating item slope and lower asymptote were set to 1.00 and 0.00,

respectively. For the studied item, slope could take on the values of [0.50, 1.00, 2.00], and the

lower asymptote could take on the values of [0.00, 0.25]. We did not generate data for the

studied item that conformed to the Rasch model assumptions (i.e., slope = 1.00 and lower

asymptote = 0.00). This resulted in an experimental design containing 250 cells: 4 sample sizes ×

4 test lengths × 3 distribution offsets × 5 combinations of model-data misfit. Each cell was

replicated 50 times.

Simulation Process

     Examinee ability was generated from a N(0,1) distribution, and item difficulty was

generated from a N(,1) distribution, where  varied depending on the level of the distribution

offset variable. Once the original data were generated, parameters and fit indices were estimated

for the simulated data file using Winsteps (Linacre, 2009). Based on those parameter estimates,

50 bootstrap samples (i.e., sampling from the estimated item and ability parameter values with

replacement) were generated according to the Rasch dichotomous model. For each bootstrap
ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                                10

sample, parameters and fit indices were estimated using Winsteps, and the 2.5th and 97.5th

percentile values of each fit index (UMS, ZUMS, WMS, and ZWMS) within each bootstrap data

file were determined. Bootstrap critical values were determined by averaging the 2.5th and 97.5th

percentile values for each fit index across the 50 bootstrap samples that were generated for each

original data file.

      Rule-of-thumb and bootstrap critical values were applied to the studied and non-studied

item fit statistics produced for each original data file. Each non-studied item was compared to the

rule-of-thumb and bootstrap critical values, and the item was declared to exhibit misfit (a Type I

error) if the value of the estimated item fit index was more extreme than the critical value limits.

Misfit decisions for the non-studied items were coded 0 (not declared to misfit) or 1 (declared to

misfit), and these codes were averaged within each data file (across non-studied items) to

determine the within data file Type I error rate. Type I error rates within cells of the experimental

design were determined by averaging these within data set Type I error rates. Similarly, the

estimated fit value for each studied item was compared to the rule-of-thumb and bootstrap

critical values, and the item was declared to misfit if the estimated value was more extreme than

the critical value limits. Misfit decisions for the studied items were coded 0 (declared to misfit)

or 1 (not declared to misfit). Type II error rates within cells of the experimental design were

determined by averaging these 0/1 codes across data sets within a cell of the experimental



      Table 1 provides the descriptive statistics, collapsed across cells of the experimental

design, for each fit index from the original and bootstrap samples and the upper and lower

critical values generated through the bootstrap process. It is clear that the bootstrap samples
ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                                     11

exhibited fit index distributions that were nearly identical to those obtained from the original

samples from which the bootstraps were drawn. It is also interesting to note that the bootstrap

lower and upper critical values are considerably narrower than the rule-of-thumb values that are

typically adopted in practice. For example, the UMS and WMS intervals range from lows around

0.85 to highs around 1.20, compared to rule-of-thumb values of 0.70 and 1.30. Similarly, the

bootstrap critical values for the standardized versions of these fit indices are considerably

narrower than the rule-of-thumb values of ±2.00.

     Table 2 presents the Type I and Type II error rates based on the rule-of-thumb and the

bootstrap critical values. Rule-of-thumb critical values produced Type I error rates that were

considerably less than what is typically adopted as a desired error rate (e.g., 0.05). In all cases,

bootstrap critical values produced higher Type I error rates, although those error rates were

closer to the optimal rate of 0.05 for UMS and ZUMS. The bootstrap critical values for WMS

and ZWMS, on the other hand, produced Type I error rates that were considerably higher than

expected. We conducted general linear modeling analyses to determine whether the independent

variables were related to the WMS Type I error rates produced by the rule-of-thumb and

bootstrap critical values.1 The five-way model for the bootstrap critical values produced an R-

squared value of .04 suggesting that those critical values were not influenced by sample size, test

length, distribution offset, guessing, or item discrimination.

     The five-way model for the rule-of-thumb critical values, on the other hand, produced an

R-squared value of .32. In this model, the sample size-by-test length interaction was statistically

significant, although the effect size (based on the Type III sum of squares) was not substantial;

F1,11968 = 10.24, p = .001, 2 = .0006. Both of the main effects for these variables were also

statistically significant with small effect sizes. Figure 1 displays the two-way interaction between
ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                                     12

sample size and test length as they relate to the Type I error rates produced by the rule-of-thumb

critical values. Clearly, as test length and sample size increase, Type I error rates decrease.

However, Type I error rates are highest for short tests administered to small samples. When

sample size is large, test length does not have a profound impact on Type I error rate.

      Concerning the Type II error rates shown in Table 2, it is clear that the bootstrap critical

values produced considerably more powerful misfit decisions. In fact, the rule-of-thumb critical

values did not identify a single case of the misfit that we simulated. On the other hand, the

statistical power of the fit indices ranged from a low of .64 (ZUMS) to a high of .72 (ZWMS).

We conducted a logistic regression to determine whether bootstrap Type II error rate was

associated with the independent variables that we simulated. Those analyses indicated that the

three-way interaction between distribution offset, slope, and lower asymptote produced a

statistically significant result that was not statistically significant, but did have a large effect size;

12 = 0.57, p = .45, OR = 2.35. Figure 2 summarizes the relevant mean WMS Type II error rates

based on the bootstrap critical values for the three-way interaction between distribution offset,

studied item slope, and studied item lower asymptote. That figure indicates that when slope

equals 0.50, Type II error rates for MSW range between 0.16 and 0.32 with the error rate being

fairly consistent across levels of distribution offset. Similarly, when slope equals 2.00 and the

asymptote equals 0.00, Type II error rates for MSW are at their lowest, ranging from 0.08 to 0.13

also being consistent across levels of item distribution offset. However, when items exhibited

high discrimination (i.e., slope = 2.00) and the lower asymptote equals 0.25, an item distribution

offset of -1.00 (i.e., making items relatively easy for the examinees), the Type II error rate

increased slightly when compared to offsets of 0.00 and 1.00. On the other hand, when items had

moderate discriminations (i.e., slope = 1.00) and the lower asymptote equals 0.25, increasing
ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                                  13

item difficulty (i.e., offset = 1.00) increased Type I error rate considerably. In fact, these Type II

error rates were the highest observed, ranging from a low of .44 to a high of .70.


     Our results indicate that bootstrap critical values allow for greater statistical power in

diagnosing item misfit caused by varying item slopes and lower asymptotes. Rule-of-thumb

critical values were generally wider than those produced by bootstrap procedures, and the

validity of those critical values varied as a function of sample size and test length, which is

consistent with previous research conducted by Smith (1988; 1991). In our simulations, the rule-

of-thumb critical values did not detect any of the simulated item misfit. On the other hand,

bootstrap critical values produced relatively low Type II error rates for all combinations of misfit

except one. Specifically, the average Type II error rate was around .30 for all four fit indices, and

was greater than 0.50 only when the misfitting item was modeled to exhibit moderate

discrimination and guessing.

     Our study is limited because we conducted a relatively small number of iterations per cell

of the experimental design (50) and conducted a relatively small number of bootstraps per

iteration (also 50). Our results are also limited by the fact that we simulated only a single

misfitting item in each data file. While this approach is consistent with the typical methodology

utilized in studies of differential item functioning, it is unlikely that this mimics real world

applications of the Rasch model to dichotomous data. Future studies should consider the

proportion of misfitting items as a potential independent variable.
ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                                14


Dimitrov, D.M., & Smith, R.M. (2006). Adjusted Rasch person-fit statistics. Journal of Applied

       Measurement, 7, 170-183.

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7,


Hesterberg, T., Moore, D.S., Monaghan, S., Clipson, A., & Epstein, R. (2005). Bootstrap

       methods and permutation tests. 2nd ed., from

Karabatsos, G. (2000). A critique of Rasch residual fit statistics. Journal of Applied

       Measurement, 1, 152-176.

Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-six

       person fit statistics. Applied Measurement in Education, 16, 277-298.

Linacre, J.M. (2009). WINSTEPS Rasch measurement computer program. Chicago:

Molenaar, I.W., & Hoijtink, H. (1990). The many null distributions of person fit indices.

       Psychometrika, 55, 75-106.

Smith, R.M. (1988). The distributional properties of Rasch standardized residuals. Educational

       and Psychological Measurement, 48, 657-667.

Smith, R.M. (1991). The distributional properties of Rasch item fit statistics. Educational and

       Psychological Measurement, 51, 541-565.

Smith, R.M. (2000). Fit analysis in latent trait measurement models. Journal of Applied

       Measurement, 1, 199-218.
ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                             15

Smith, R.M., Schumacker, R.E., & Bush, M.J. (1998). Using item mean squares to evaluate fit to

       the Rasch model. Journal of Outcome Measurement, 2, 66-78.

Stone, C.A. (2007). IRTFIT_RESAMPLE: A computer program for assessing goodnes of fit of

       item response theory models based on posterior expectations. Applied Psychological

       Measurement, 28, 143-144.

Stone, C.A., & Zhang, B. (2003). Assessing goodness of fit of item response theory models: A

       comparison of traditional and alternative procedures. Journal of Educational

       Measurement, 40, 331-352.

Su, Y.H., Sheu, C.F., & Wang, W.C. (2007). Computing Confidence Intervals of Item Fit in the

       Family of Rasch Models Using the Bootstrap Method. Journal of Applied Measurement,

       8, 190-203.

Van den Wollenberg, A.L. (1982). Two tests statistics for the Rasch model. Psychometrika, 47,


Wang, W.C., & Chen, C.T. (2005). Item parameter recovery, standard error estimates, and fit

       statistics of the Winsteps program for the family of Rasch models. Educational and

       Psychological Measurement, 65, 376-404.

Wilson, E.B., & Hilferty, M.M. (1931). The distribution of chi-square. Proceedings of the

       National Academy of Sciences of the United States of America, 17, 684-688.

Wolfe, E.W. (2008). (Rasch Bootstrap Fit): A SAS macro for estimating critical values

       for Rasch model fit statistics. Applied Psychological Measurement, 32, 585-586.

Wright, B.D., & Linacre, M. (1994). Reasonable mean-square fit values. Rasch Measurement

       Transactions, 8, 370.
ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                        16

Wright, B.D., & Masters, G.N. (1982). Rating scale analysis: Rasch measurement. Chicago, IL:

ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                                17


1        In this article, we report only the results of analysis of the MSW index, chosen primarily

because of its high level of statistical power in the bootstrap analyses. However, the results that

we report are consistent with those obtained for the remaining three fit indices considered in this

ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                              18

Table 1

Descriptive statistics for item fit indices

                               Original         Bootstrap          Bootstrap        Bootstrap
  Index       Statistic
                               Samples           Samples             LCV              UCV

   UMS          Mean             1.00              1.00              0.90             1.12

                 SE                --               --               0.02             0.03

                 SD              0.05              0.06              0.04             0.05

              Minimum            0.88              0.87              0.75             1.05

             Maximum             1.12              1.17              0.96             1.29

  ZUMS          Mean             -0.03             0.00              -1.45            1.58

                 SE                --               --               0.44             0.48

                 SD              0.79              0.91              0.14             0.15

              Minimum            -1.97            -2.26              -1.96            1.04

             Maximum             1.90              2.89              -0.93            2.17

  WMS           Mean             1.00              1.00              0.84             1.21

                 SE                --               --               0.09             0.17

                 SD              0.13              0.14              0.07             0.11

              Minimum            0.72              0.69              0.52             1.06

             Maximum             1.48              1.52              1.02             2.20

  ZWMS          Mean             -0.03             0.00              -1.24            1.41

                 SE                --               --               0.57             0.72

                 SD              0.88              0.99              0.14             0.16

              Minimum            -1.89            -2.17              -1.72            0.76

             Maximum             2.32              3.20              -0.72            2.06
Note: These represent the averaged values, across 12,000 replications, of the within cell
descriptive statistics. LCV = Lower critical value. UCV = Upper critical value.
ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                                19

Table 2

Error rates

                        Type I                           Type II
              ROT CVs      Bootstrap CVs      ROT CVs        Bootstrap CVs

  UMS           0.00             0.04            1.00              0.34

 ZUMS           0.02             0.07            1.00              0.36

  WMS           0.03             0.13            1.00              0.31

 ZWMS           0.03             0.13            1.00              0.28
Note: These represent the averaged error rates across 12,000 replications. CV = critical value.
ASYMPTOTIC & BOOTSTRAP FIT INDICES                                                     20

                                      Figure Captions

Figure 1. ROT WMS Type I Error Rate Sample Size-by-Test Length Interaction

Figure 2. Bootsrap WMS Type II Error Rate Displacement by Type of Misfit Interaction
You can also read