Comparison of Asymptotic and Bootstrap Item Fit Indices in Identifying Misfit to the Rasch Model National Conference on Measurement in Education ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Comparison of Asymptotic and Bootstrap Item Fit Indices in Identifying Misfit to the Rasch Model National Conference on Measurement in Education New Orleans, LA Edward W. Wolfe Michael T. McGill April 2011
ASYMPTOTIC & BOOTSTRAP FIT INDICES 1
Author Note
Edward W. Wolfe, Research Services, Assessment & Information, Pearson.
Correspondence concerning this article should be addressed to Edward W. Wolfe, 2510
N. Dodge St., Mailstop 125, Iowa City, IA 52245-9945. E-mail: ed.wolfe@pearson.com.ASYMPTOTIC & BOOTSTRAP FIT INDICES 2
Abstract
Rule-of-thumb critical values are not suitable for flagging items for misfit to the Rasch model
because the null distributions of mean-square fit indices are not well understood. Bootstrap
procedures may be better suited for identifying appropriate critical values, but the accuracy of
those procedures has not been studied. In this study, data were generated according to the
dichotomous Rasch model, and violations of the lower asymptote and common slope
assumptions were introduced into the simulated data while altering the number of examinees,
number of items, and item difficulty/person ability distribution offset. For each cell of the
experimental design, the proportion of items that satisfied the Rasch model assumptions flagged
for misfit (Type I errors) and the proportion of items modeled to violate assumptions that were
not flagged for misfit (Type II errors) were compared to the analogous flag rates using rule-of-
thumb and distribution corrected critical values. Results suggest that Type II errors are much
lower for critical values based on bootstrap procedures and that distribution offset and type of
misfit influence the accuracy of misfit diagnosis.
Keywords: Rasch model, model-data fit, bootstrapASYMPTOTIC & BOOTSTRAP FIT INDICES 3
Comparison of Asymptotic and Bootstrap Item Fit Indices
in Identifying Misfit to the Rasch Model
Item fit statistics are commonly used in applications of the Rasch model to aid in selection
of items after field testing or retention of items in operational contexts. Four of those fit statistics
include the weighted and unweighted mean-squared fit statistics and the standardized versions of
these two fit statistics (Smith, 2000). The mean squared fit statistics (Wright & Masters, 1982)
are based on the standardized residual of the observed response for each person and item
combination from the modeled expectation, given the parameter estimates,
zni
xni Eni
(1)
Wni
where xni = the observed response of person n to item i,
m
Eni k nik , the expected response of person n to item i,
k 0
m
Wni k Eni ,
k 0
k = the scored responses, ranging from 0 to m, and
nik = the model-based probability that person n will have an observed response in
category k.
Unweighted mean squared fit statistics for items are computed as the average of the squared
standardized residuals across all persons associated with that item,
N
z 2
ni
UMSi n 1
(2)
NASYMPTOTIC & BOOTSTRAP FIT INDICES 4
Weighted mean squared fit statistics for items are computed as the average of the squared
standardized residuals across all persons associated with that item, each squared standardized
residual weighted by its variance,
N
z W 2
ni ni
WMSi n 1
N
(3)
W
n 1
ni
Each of these statistics can also be standardized via the Wilson-Hilferty cube root transformation
(Wilson & Hilferty, 1931) to obtain the standardized unweighted and weighted mean square fit
statistics (ZUMS and ZWMS) (Wright & Masters, 1982). Analogous person fit statistics are obtained
by averaging the unweighted or weighted squared standardized residuals for a particular person
across all associated items.
Historically, rule-of-thumb upper and lower limits for acceptable mean square fit values
have been established for flagging items, such as 0.70 and 1.30 for multiple-choice items and
0.60 and 1.40 for rating scales (Wright & Linacre, 1994). Unfortunately, simulation studies have
shown that these rule-of-thumb values may be inappropriate for many applied situations (Smith,
1991; Smith, Schumacker, & Bush, 1998; Wang & Chen, 2005). Hence, users are faced with a
quandary. How does one interpret a fit statistic if the distribution of the values of that statistic,
and hence the range of reasonable values, is not known?
This article compares application of these rule-of-thumb critical values to analogous
bootstrap critical values to the identification of model-data misfit. Specifically, we generate
simulated data that contains violations of the common item slope and zero lower asymptote
assumptions upon which the Rasch model is based, apply bootstrap and rule-of-thumb critical
values to identify simulated cases of misfit, and compare the Type II error rates and statisticalASYMPTOTIC & BOOTSTRAP FIT INDICES 5
power of these two methods for specifying item fit critical values while altering sample size, test
length, and item/person distribution offset.
Rasch Fit Research
It has long been known that the null distributions of commonly employed fit statistics do
not follow a distribution with a known parametric form (Karabatsos, 2000; Molenaar & Hoijtink,
1990; Smith, 1988, 1991; Wang & Chen, 2005). Variability of null distributions of the mean
square fit indices vary as a function of the number of observations and the shapes of the
distributions of persons and items. In addition, distributions of the standardized mean square fit
statistics deviate from the assumed mean and variance of 0 and 1, respectively, when their
computation is based on estimated person and item parameters (Smith, 1991). As an example of
the problems associated with interpreting fit indices, in data simulated to fit the Rasch
dichotomous model, appropriate critical values for UMS may vary from 0.75 to 1.30 for lower
and upper critical values, respectively, for relatively small sample sizes (150) and 0.95 to 1.10
for larger sample sizes (1000) (Smith, Schumacker, & Bush, 1998). Similar variable ranges have
been observed for ZUMS. Hence, any interpretation of those statistics cannot rely on a single set of
critical values and must instead take into account characteristics of the dataset (Smith,
Schumacker, & Bush, 1998). Unfortunately, this fact is typically ignored in Rasch measurement
applications.
Adjustments have been proposed for the deviations of empirical fit statistics from their
assumed distributions. For example Smith (1991) conducted a simulation study to determine the
distributional properties of the mean-squared item fit statistics, and he determined that an
adjustment of these indices—one that takes into account the number of items, number of persons,
and the offset between the item difficulty and abilities of the persons responding to that item—ASYMPTOTIC & BOOTSTRAP FIT INDICES 6
produces indices that are closer to hypothesized distributions. A follow up study (Smith,
Schumacker, & Bush, 1998) yielded a simpler recommended correction for cases where the
means of the person ability and item difficulty distributions were comparable. Specifically, they
recommend critical values for WMS equal to
2
WMS * 1 , (4)
N
and for UMS equal to
6
UMS * 1 . (5)
N
These adjustments suggest upper critical values for item fit equal to 1.20, 1.09, and 1.06 for
WMS when sample sizes equal 100, 500, and 1000, respectively—values considerably smaller
than the rule-of-thumb.
More recently, Dimitrov and Smith (2006) evaluated an adjustment of fit statistics that
replaces the Rasch estimate of the probability of a particular response with a more accurate
estimate proposed by Van den Wollenberg (1982), finding that the adjustment resulted in a small
but consistent improvement in Type II error rates. Similarly, research conducted by Stone (Stone,
2003; Stone & Zhang, 2003) evaluated a Bayesian adjustment for item-response model
parameter estimates in the analysis of fit and found an improvement in both Type I and Type II
error rates, particularly for small sample sizes. An alternative to these approaches, which adjust
the values of the fit statistics or their critical values, is to utilize non-parametric measures of fit.
Karabatsos (2003) found that, of 36 person fit statistics, four of the five indices that were best at
identifying aberrant person responses were non-parametric indices. Unfortunately, neither
adjusted nor non-parametric fit indices are readily available in commercial measurementASYMPTOTIC & BOOTSTRAP FIT INDICES 7
packages, and they will likely not make their way into measurement practice until adopted by
software publishers.
Bootstrap procedures for identifying critical values for fit statistics are easy to implement
and, in fact, have been implemented in three separate programs that interface with commercial
software to identify reasonable fit statistics, given a set of estimated item and person parameters
(Stone, 2007; Su, Sheu, & Wang, 2007; Wolfe, 2008). The remainder of this article explains how
those procedures are implemented and compares decisions based on those procedures to those
based on traditional rule-of-thumb values when evaluation cases of model-data misfit to the
Rasch model.
Bootstrap Procedure
Efron (1979) described the nonparametric bootstrap as a method for estimating the
sampling distribution of a random variable through empirical resampling methods. Specifically,
the nonparametric bootstrap constructs an empirical estimate of the unknown sampling
distribution by generating a probability distribution of the statistic across a large number of
resamplings of an original sample via sampling with replacement. That is, the discrete and
empirical distribution of the original observed sample is treated as a population from which a
large number (B, typically about 1,000) of resamples of size N are drawn repeatedly. The statistic
of interest is computed for each of these resamples, and the distribution of these statistics serves
as an empirical estimate of the sampling distribution of the statistic. A similar procedure, known
as parametric bootstrapping, can be performed by resampling from a hypothetical distribution
(e.g., a normal distribution) rather than from a single empirical sample. Bootstrap methods such
as these are known to produce sampling distribution estimates that exhibit bias, spread, and
shape similar to that of the parametric sampling distribution, but the empirical methods are notASYMPTOTIC & BOOTSTRAP FIT INDICES 8
subject to the assumptions that are required by parametric methods for estimating sampling
distributions (Hesterberg, Moore, Monaghan, Clipson, & Epstein, 2005).
In the context of item response models, the bootstrap procedure can be extended to
determine the shapes of null distributions for fit statistics by computing fit statistics from datasets
that are generated to fit the model in question. In this context, the analyst would (1) estimate item
and person parameters based on the original sample, (2) randomly select values of item and
person parameters from those estimated values, (3) for each of the B resamples, generate
simulated datasets that fit the item response model, (4) compute the statistic of interest for each
of the resamples, (5) compute averages of the statistic of interest across the B resamples, and (6)
compare the value of the statistic of interest to the averaged bootstrap values. Such a comparison
depicts the degree to which the values of the statistic in question produced by the original data
deviate from those that would be observed if the data demonstrated expected fit to the item
response model.
Method
In this study, data were generated according to the Rasch dichotomous model for non-
studied items and the three-parameter logistic model for studied items. These data were then
scaled to the Rasch dichotomous model, and bootstrap and rule-of-thumb critical values were
applied to the original estimated model fit indices in order to assess the accuracy of model-data
fit diagnosis in the presence and absence of violated model assumptions. In addition, we altered
the number of simulated examinees, number of items, and item difficulty/person ability
distribution mean offset in the simulated data. For each cell of the experimental design, the
proportion of items that satisfy the Rasch model assumptions that were flagged for misfit (Type IASYMPTOTIC & BOOTSTRAP FIT INDICES 9
errors) and the proportion of items modeled to violate assumptions that were not flagged for
misfit (Type II errors) were compared using bootstrap and rule-of-thumb critical values.
Variables
The independent variables manipulated in this study included the number of simulated
examinees [100, 200, 500, 1000] and the number of simulated items [20, 40, 80, 160]. We also
varied the difference in the means of the simulated examinee ability and item difficulty
distributions [-1.00, 0.00, 1.00]. The nature of misfit for the single studied item in each simulated
data file was controlled by altering the generating item slope and lower asymptote. For non-
studied items, the generating item slope and lower asymptote were set to 1.00 and 0.00,
respectively. For the studied item, slope could take on the values of [0.50, 1.00, 2.00], and the
lower asymptote could take on the values of [0.00, 0.25]. We did not generate data for the
studied item that conformed to the Rasch model assumptions (i.e., slope = 1.00 and lower
asymptote = 0.00). This resulted in an experimental design containing 250 cells: 4 sample sizes ×
4 test lengths × 3 distribution offsets × 5 combinations of model-data misfit. Each cell was
replicated 50 times.
Simulation Process
Examinee ability was generated from a N(0,1) distribution, and item difficulty was
generated from a N(,1) distribution, where varied depending on the level of the distribution
offset variable. Once the original data were generated, parameters and fit indices were estimated
for the simulated data file using Winsteps (Linacre, 2009). Based on those parameter estimates,
50 bootstrap samples (i.e., sampling from the estimated item and ability parameter values with
replacement) were generated according to the Rasch dichotomous model. For each bootstrapASYMPTOTIC & BOOTSTRAP FIT INDICES 10
sample, parameters and fit indices were estimated using Winsteps, and the 2.5th and 97.5th
percentile values of each fit index (UMS, ZUMS, WMS, and ZWMS) within each bootstrap data
file were determined. Bootstrap critical values were determined by averaging the 2.5th and 97.5th
percentile values for each fit index across the 50 bootstrap samples that were generated for each
original data file.
Rule-of-thumb and bootstrap critical values were applied to the studied and non-studied
item fit statistics produced for each original data file. Each non-studied item was compared to the
rule-of-thumb and bootstrap critical values, and the item was declared to exhibit misfit (a Type I
error) if the value of the estimated item fit index was more extreme than the critical value limits.
Misfit decisions for the non-studied items were coded 0 (not declared to misfit) or 1 (declared to
misfit), and these codes were averaged within each data file (across non-studied items) to
determine the within data file Type I error rate. Type I error rates within cells of the experimental
design were determined by averaging these within data set Type I error rates. Similarly, the
estimated fit value for each studied item was compared to the rule-of-thumb and bootstrap
critical values, and the item was declared to misfit if the estimated value was more extreme than
the critical value limits. Misfit decisions for the studied items were coded 0 (declared to misfit)
or 1 (not declared to misfit). Type II error rates within cells of the experimental design were
determined by averaging these 0/1 codes across data sets within a cell of the experimental
design.
Results
Table 1 provides the descriptive statistics, collapsed across cells of the experimental
design, for each fit index from the original and bootstrap samples and the upper and lower
critical values generated through the bootstrap process. It is clear that the bootstrap samplesASYMPTOTIC & BOOTSTRAP FIT INDICES 11
exhibited fit index distributions that were nearly identical to those obtained from the original
samples from which the bootstraps were drawn. It is also interesting to note that the bootstrap
lower and upper critical values are considerably narrower than the rule-of-thumb values that are
typically adopted in practice. For example, the UMS and WMS intervals range from lows around
0.85 to highs around 1.20, compared to rule-of-thumb values of 0.70 and 1.30. Similarly, the
bootstrap critical values for the standardized versions of these fit indices are considerably
narrower than the rule-of-thumb values of ±2.00.
Table 2 presents the Type I and Type II error rates based on the rule-of-thumb and the
bootstrap critical values. Rule-of-thumb critical values produced Type I error rates that were
considerably less than what is typically adopted as a desired error rate (e.g., 0.05). In all cases,
bootstrap critical values produced higher Type I error rates, although those error rates were
closer to the optimal rate of 0.05 for UMS and ZUMS. The bootstrap critical values for WMS
and ZWMS, on the other hand, produced Type I error rates that were considerably higher than
expected. We conducted general linear modeling analyses to determine whether the independent
variables were related to the WMS Type I error rates produced by the rule-of-thumb and
bootstrap critical values.1 The five-way model for the bootstrap critical values produced an R-
squared value of .04 suggesting that those critical values were not influenced by sample size, test
length, distribution offset, guessing, or item discrimination.
The five-way model for the rule-of-thumb critical values, on the other hand, produced an
R-squared value of .32. In this model, the sample size-by-test length interaction was statistically
significant, although the effect size (based on the Type III sum of squares) was not substantial;
F1,11968 = 10.24, p = .001, 2 = .0006. Both of the main effects for these variables were also
statistically significant with small effect sizes. Figure 1 displays the two-way interaction betweenASYMPTOTIC & BOOTSTRAP FIT INDICES 12
sample size and test length as they relate to the Type I error rates produced by the rule-of-thumb
critical values. Clearly, as test length and sample size increase, Type I error rates decrease.
However, Type I error rates are highest for short tests administered to small samples. When
sample size is large, test length does not have a profound impact on Type I error rate.
Concerning the Type II error rates shown in Table 2, it is clear that the bootstrap critical
values produced considerably more powerful misfit decisions. In fact, the rule-of-thumb critical
values did not identify a single case of the misfit that we simulated. On the other hand, the
statistical power of the fit indices ranged from a low of .64 (ZUMS) to a high of .72 (ZWMS).
We conducted a logistic regression to determine whether bootstrap Type II error rate was
associated with the independent variables that we simulated. Those analyses indicated that the
three-way interaction between distribution offset, slope, and lower asymptote produced a
statistically significant result that was not statistically significant, but did have a large effect size;
12 = 0.57, p = .45, OR = 2.35. Figure 2 summarizes the relevant mean WMS Type II error rates
based on the bootstrap critical values for the three-way interaction between distribution offset,
studied item slope, and studied item lower asymptote. That figure indicates that when slope
equals 0.50, Type II error rates for MSW range between 0.16 and 0.32 with the error rate being
fairly consistent across levels of distribution offset. Similarly, when slope equals 2.00 and the
asymptote equals 0.00, Type II error rates for MSW are at their lowest, ranging from 0.08 to 0.13
also being consistent across levels of item distribution offset. However, when items exhibited
high discrimination (i.e., slope = 2.00) and the lower asymptote equals 0.25, an item distribution
offset of -1.00 (i.e., making items relatively easy for the examinees), the Type II error rate
increased slightly when compared to offsets of 0.00 and 1.00. On the other hand, when items had
moderate discriminations (i.e., slope = 1.00) and the lower asymptote equals 0.25, increasingASYMPTOTIC & BOOTSTRAP FIT INDICES 13
item difficulty (i.e., offset = 1.00) increased Type I error rate considerably. In fact, these Type II
error rates were the highest observed, ranging from a low of .44 to a high of .70.
Discussion
Our results indicate that bootstrap critical values allow for greater statistical power in
diagnosing item misfit caused by varying item slopes and lower asymptotes. Rule-of-thumb
critical values were generally wider than those produced by bootstrap procedures, and the
validity of those critical values varied as a function of sample size and test length, which is
consistent with previous research conducted by Smith (1988; 1991). In our simulations, the rule-
of-thumb critical values did not detect any of the simulated item misfit. On the other hand,
bootstrap critical values produced relatively low Type II error rates for all combinations of misfit
except one. Specifically, the average Type II error rate was around .30 for all four fit indices, and
was greater than 0.50 only when the misfitting item was modeled to exhibit moderate
discrimination and guessing.
Our study is limited because we conducted a relatively small number of iterations per cell
of the experimental design (50) and conducted a relatively small number of bootstraps per
iteration (also 50). Our results are also limited by the fact that we simulated only a single
misfitting item in each data file. While this approach is consistent with the typical methodology
utilized in studies of differential item functioning, it is unlikely that this mimics real world
applications of the Rasch model to dichotomous data. Future studies should consider the
proportion of misfitting items as a potential independent variable.ASYMPTOTIC & BOOTSTRAP FIT INDICES 14
References
Dimitrov, D.M., & Smith, R.M. (2006). Adjusted Rasch person-fit statistics. Journal of Applied
Measurement, 7, 170-183.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7,
1-26.
Hesterberg, T., Moore, D.S., Monaghan, S., Clipson, A., & Epstein, R. (2005). Bootstrap
methods and permutation tests. 2nd ed., from
http://bcs.whfreeman.com/ips5e/content/cat_080/pdf/moore14.pdf.
Karabatsos, G. (2000). A critique of Rasch residual fit statistics. Journal of Applied
Measurement, 1, 152-176.
Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-six
person fit statistics. Applied Measurement in Education, 16, 277-298.
Linacre, J.M. (2009). WINSTEPS Rasch measurement computer program. Chicago:
Winsteps.com.
Molenaar, I.W., & Hoijtink, H. (1990). The many null distributions of person fit indices.
Psychometrika, 55, 75-106.
Smith, R.M. (1988). The distributional properties of Rasch standardized residuals. Educational
and Psychological Measurement, 48, 657-667.
Smith, R.M. (1991). The distributional properties of Rasch item fit statistics. Educational and
Psychological Measurement, 51, 541-565.
Smith, R.M. (2000). Fit analysis in latent trait measurement models. Journal of Applied
Measurement, 1, 199-218.ASYMPTOTIC & BOOTSTRAP FIT INDICES 15
Smith, R.M., Schumacker, R.E., & Bush, M.J. (1998). Using item mean squares to evaluate fit to
the Rasch model. Journal of Outcome Measurement, 2, 66-78.
Stone, C.A. (2007). IRTFIT_RESAMPLE: A computer program for assessing goodnes of fit of
item response theory models based on posterior expectations. Applied Psychological
Measurement, 28, 143-144.
Stone, C.A., & Zhang, B. (2003). Assessing goodness of fit of item response theory models: A
comparison of traditional and alternative procedures. Journal of Educational
Measurement, 40, 331-352.
Su, Y.H., Sheu, C.F., & Wang, W.C. (2007). Computing Confidence Intervals of Item Fit in the
Family of Rasch Models Using the Bootstrap Method. Journal of Applied Measurement,
8, 190-203.
Van den Wollenberg, A.L. (1982). Two tests statistics for the Rasch model. Psychometrika, 47,
123-139.
Wang, W.C., & Chen, C.T. (2005). Item parameter recovery, standard error estimates, and fit
statistics of the Winsteps program for the family of Rasch models. Educational and
Psychological Measurement, 65, 376-404.
Wilson, E.B., & Hilferty, M.M. (1931). The distribution of chi-square. Proceedings of the
National Academy of Sciences of the United States of America, 17, 684-688.
Wolfe, E.W. (2008). RBF.sas (Rasch Bootstrap Fit): A SAS macro for estimating critical values
for Rasch model fit statistics. Applied Psychological Measurement, 32, 585-586.
Wright, B.D., & Linacre, M. (1994). Reasonable mean-square fit values. Rasch Measurement
Transactions, 8, 370.ASYMPTOTIC & BOOTSTRAP FIT INDICES 16
Wright, B.D., & Masters, G.N. (1982). Rating scale analysis: Rasch measurement. Chicago, IL:
MESA.ASYMPTOTIC & BOOTSTRAP FIT INDICES 17
Footnotes
1 In this article, we report only the results of analysis of the MSW index, chosen primarily
because of its high level of statistical power in the bootstrap analyses. However, the results that
we report are consistent with those obtained for the remaining three fit indices considered in this
study.ASYMPTOTIC & BOOTSTRAP FIT INDICES 18
Table 1
Descriptive statistics for item fit indices
Original Bootstrap Bootstrap Bootstrap
Index Statistic
Samples Samples LCV UCV
UMS Mean 1.00 1.00 0.90 1.12
SE -- -- 0.02 0.03
SD 0.05 0.06 0.04 0.05
Minimum 0.88 0.87 0.75 1.05
Maximum 1.12 1.17 0.96 1.29
ZUMS Mean -0.03 0.00 -1.45 1.58
SE -- -- 0.44 0.48
SD 0.79 0.91 0.14 0.15
Minimum -1.97 -2.26 -1.96 1.04
Maximum 1.90 2.89 -0.93 2.17
WMS Mean 1.00 1.00 0.84 1.21
SE -- -- 0.09 0.17
SD 0.13 0.14 0.07 0.11
Minimum 0.72 0.69 0.52 1.06
Maximum 1.48 1.52 1.02 2.20
ZWMS Mean -0.03 0.00 -1.24 1.41
SE -- -- 0.57 0.72
SD 0.88 0.99 0.14 0.16
Minimum -1.89 -2.17 -1.72 0.76
Maximum 2.32 3.20 -0.72 2.06
Note: These represent the averaged values, across 12,000 replications, of the within cell
descriptive statistics. LCV = Lower critical value. UCV = Upper critical value.ASYMPTOTIC & BOOTSTRAP FIT INDICES 19
Table 2
Error rates
Type I Type II
Index
ROT CVs Bootstrap CVs ROT CVs Bootstrap CVs
UMS 0.00 0.04 1.00 0.34
ZUMS 0.02 0.07 1.00 0.36
WMS 0.03 0.13 1.00 0.31
ZWMS 0.03 0.13 1.00 0.28
Note: These represent the averaged error rates across 12,000 replications. CV = critical value.ASYMPTOTIC & BOOTSTRAP FIT INDICES 20
Figure Captions
Figure 1. ROT WMS Type I Error Rate Sample Size-by-Test Length Interaction
Figure 2. Bootsrap WMS Type II Error Rate Displacement by Type of Misfit InteractionASYMPTOTIC & BOOTSTRAP FIT INDICES 21
ASYMPTOTIC & BOOTSTRAP FIT INDICES 22
You can also read