Comparison of Asymptotic and Bootstrap Item Fit Indices in Identifying Misfit to the Rasch Model National Conference on Measurement in Education ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Comparison of Asymptotic and Bootstrap Item Fit Indices in Identifying Misfit to the Rasch Model National Conference on Measurement in Education New Orleans, LA Edward W. Wolfe Michael T. McGill April 2011
ASYMPTOTIC & BOOTSTRAP FIT INDICES 1 Author Note Edward W. Wolfe, Research Services, Assessment & Information, Pearson. Correspondence concerning this article should be addressed to Edward W. Wolfe, 2510 N. Dodge St., Mailstop 125, Iowa City, IA 52245-9945. E-mail: ed.wolfe@pearson.com.
ASYMPTOTIC & BOOTSTRAP FIT INDICES 2 Abstract Rule-of-thumb critical values are not suitable for flagging items for misfit to the Rasch model because the null distributions of mean-square fit indices are not well understood. Bootstrap procedures may be better suited for identifying appropriate critical values, but the accuracy of those procedures has not been studied. In this study, data were generated according to the dichotomous Rasch model, and violations of the lower asymptote and common slope assumptions were introduced into the simulated data while altering the number of examinees, number of items, and item difficulty/person ability distribution offset. For each cell of the experimental design, the proportion of items that satisfied the Rasch model assumptions flagged for misfit (Type I errors) and the proportion of items modeled to violate assumptions that were not flagged for misfit (Type II errors) were compared to the analogous flag rates using rule-of- thumb and distribution corrected critical values. Results suggest that Type II errors are much lower for critical values based on bootstrap procedures and that distribution offset and type of misfit influence the accuracy of misfit diagnosis. Keywords: Rasch model, model-data fit, bootstrap
ASYMPTOTIC & BOOTSTRAP FIT INDICES 3 Comparison of Asymptotic and Bootstrap Item Fit Indices in Identifying Misfit to the Rasch Model Item fit statistics are commonly used in applications of the Rasch model to aid in selection of items after field testing or retention of items in operational contexts. Four of those fit statistics include the weighted and unweighted mean-squared fit statistics and the standardized versions of these two fit statistics (Smith, 2000). The mean squared fit statistics (Wright & Masters, 1982) are based on the standardized residual of the observed response for each person and item combination from the modeled expectation, given the parameter estimates, zni xni Eni (1) Wni where xni = the observed response of person n to item i, m Eni k nik , the expected response of person n to item i, k 0 m Wni k Eni , k 0 k = the scored responses, ranging from 0 to m, and nik = the model-based probability that person n will have an observed response in category k. Unweighted mean squared fit statistics for items are computed as the average of the squared standardized residuals across all persons associated with that item, N z 2 ni UMSi n 1 (2) N
ASYMPTOTIC & BOOTSTRAP FIT INDICES 4 Weighted mean squared fit statistics for items are computed as the average of the squared standardized residuals across all persons associated with that item, each squared standardized residual weighted by its variance, N z W 2 ni ni WMSi n 1 N (3) W n 1 ni Each of these statistics can also be standardized via the Wilson-Hilferty cube root transformation (Wilson & Hilferty, 1931) to obtain the standardized unweighted and weighted mean square fit statistics (ZUMS and ZWMS) (Wright & Masters, 1982). Analogous person fit statistics are obtained by averaging the unweighted or weighted squared standardized residuals for a particular person across all associated items. Historically, rule-of-thumb upper and lower limits for acceptable mean square fit values have been established for flagging items, such as 0.70 and 1.30 for multiple-choice items and 0.60 and 1.40 for rating scales (Wright & Linacre, 1994). Unfortunately, simulation studies have shown that these rule-of-thumb values may be inappropriate for many applied situations (Smith, 1991; Smith, Schumacker, & Bush, 1998; Wang & Chen, 2005). Hence, users are faced with a quandary. How does one interpret a fit statistic if the distribution of the values of that statistic, and hence the range of reasonable values, is not known? This article compares application of these rule-of-thumb critical values to analogous bootstrap critical values to the identification of model-data misfit. Specifically, we generate simulated data that contains violations of the common item slope and zero lower asymptote assumptions upon which the Rasch model is based, apply bootstrap and rule-of-thumb critical values to identify simulated cases of misfit, and compare the Type II error rates and statistical
ASYMPTOTIC & BOOTSTRAP FIT INDICES 5 power of these two methods for specifying item fit critical values while altering sample size, test length, and item/person distribution offset. Rasch Fit Research It has long been known that the null distributions of commonly employed fit statistics do not follow a distribution with a known parametric form (Karabatsos, 2000; Molenaar & Hoijtink, 1990; Smith, 1988, 1991; Wang & Chen, 2005). Variability of null distributions of the mean square fit indices vary as a function of the number of observations and the shapes of the distributions of persons and items. In addition, distributions of the standardized mean square fit statistics deviate from the assumed mean and variance of 0 and 1, respectively, when their computation is based on estimated person and item parameters (Smith, 1991). As an example of the problems associated with interpreting fit indices, in data simulated to fit the Rasch dichotomous model, appropriate critical values for UMS may vary from 0.75 to 1.30 for lower and upper critical values, respectively, for relatively small sample sizes (150) and 0.95 to 1.10 for larger sample sizes (1000) (Smith, Schumacker, & Bush, 1998). Similar variable ranges have been observed for ZUMS. Hence, any interpretation of those statistics cannot rely on a single set of critical values and must instead take into account characteristics of the dataset (Smith, Schumacker, & Bush, 1998). Unfortunately, this fact is typically ignored in Rasch measurement applications. Adjustments have been proposed for the deviations of empirical fit statistics from their assumed distributions. For example Smith (1991) conducted a simulation study to determine the distributional properties of the mean-squared item fit statistics, and he determined that an adjustment of these indices—one that takes into account the number of items, number of persons, and the offset between the item difficulty and abilities of the persons responding to that item—
ASYMPTOTIC & BOOTSTRAP FIT INDICES 6 produces indices that are closer to hypothesized distributions. A follow up study (Smith, Schumacker, & Bush, 1998) yielded a simpler recommended correction for cases where the means of the person ability and item difficulty distributions were comparable. Specifically, they recommend critical values for WMS equal to 2 WMS * 1 , (4) N and for UMS equal to 6 UMS * 1 . (5) N These adjustments suggest upper critical values for item fit equal to 1.20, 1.09, and 1.06 for WMS when sample sizes equal 100, 500, and 1000, respectively—values considerably smaller than the rule-of-thumb. More recently, Dimitrov and Smith (2006) evaluated an adjustment of fit statistics that replaces the Rasch estimate of the probability of a particular response with a more accurate estimate proposed by Van den Wollenberg (1982), finding that the adjustment resulted in a small but consistent improvement in Type II error rates. Similarly, research conducted by Stone (Stone, 2003; Stone & Zhang, 2003) evaluated a Bayesian adjustment for item-response model parameter estimates in the analysis of fit and found an improvement in both Type I and Type II error rates, particularly for small sample sizes. An alternative to these approaches, which adjust the values of the fit statistics or their critical values, is to utilize non-parametric measures of fit. Karabatsos (2003) found that, of 36 person fit statistics, four of the five indices that were best at identifying aberrant person responses were non-parametric indices. Unfortunately, neither adjusted nor non-parametric fit indices are readily available in commercial measurement
ASYMPTOTIC & BOOTSTRAP FIT INDICES 7 packages, and they will likely not make their way into measurement practice until adopted by software publishers. Bootstrap procedures for identifying critical values for fit statistics are easy to implement and, in fact, have been implemented in three separate programs that interface with commercial software to identify reasonable fit statistics, given a set of estimated item and person parameters (Stone, 2007; Su, Sheu, & Wang, 2007; Wolfe, 2008). The remainder of this article explains how those procedures are implemented and compares decisions based on those procedures to those based on traditional rule-of-thumb values when evaluation cases of model-data misfit to the Rasch model. Bootstrap Procedure Efron (1979) described the nonparametric bootstrap as a method for estimating the sampling distribution of a random variable through empirical resampling methods. Specifically, the nonparametric bootstrap constructs an empirical estimate of the unknown sampling distribution by generating a probability distribution of the statistic across a large number of resamplings of an original sample via sampling with replacement. That is, the discrete and empirical distribution of the original observed sample is treated as a population from which a large number (B, typically about 1,000) of resamples of size N are drawn repeatedly. The statistic of interest is computed for each of these resamples, and the distribution of these statistics serves as an empirical estimate of the sampling distribution of the statistic. A similar procedure, known as parametric bootstrapping, can be performed by resampling from a hypothetical distribution (e.g., a normal distribution) rather than from a single empirical sample. Bootstrap methods such as these are known to produce sampling distribution estimates that exhibit bias, spread, and shape similar to that of the parametric sampling distribution, but the empirical methods are not
ASYMPTOTIC & BOOTSTRAP FIT INDICES 8 subject to the assumptions that are required by parametric methods for estimating sampling distributions (Hesterberg, Moore, Monaghan, Clipson, & Epstein, 2005). In the context of item response models, the bootstrap procedure can be extended to determine the shapes of null distributions for fit statistics by computing fit statistics from datasets that are generated to fit the model in question. In this context, the analyst would (1) estimate item and person parameters based on the original sample, (2) randomly select values of item and person parameters from those estimated values, (3) for each of the B resamples, generate simulated datasets that fit the item response model, (4) compute the statistic of interest for each of the resamples, (5) compute averages of the statistic of interest across the B resamples, and (6) compare the value of the statistic of interest to the averaged bootstrap values. Such a comparison depicts the degree to which the values of the statistic in question produced by the original data deviate from those that would be observed if the data demonstrated expected fit to the item response model. Method In this study, data were generated according to the Rasch dichotomous model for non- studied items and the three-parameter logistic model for studied items. These data were then scaled to the Rasch dichotomous model, and bootstrap and rule-of-thumb critical values were applied to the original estimated model fit indices in order to assess the accuracy of model-data fit diagnosis in the presence and absence of violated model assumptions. In addition, we altered the number of simulated examinees, number of items, and item difficulty/person ability distribution mean offset in the simulated data. For each cell of the experimental design, the proportion of items that satisfy the Rasch model assumptions that were flagged for misfit (Type I
ASYMPTOTIC & BOOTSTRAP FIT INDICES 9 errors) and the proportion of items modeled to violate assumptions that were not flagged for misfit (Type II errors) were compared using bootstrap and rule-of-thumb critical values. Variables The independent variables manipulated in this study included the number of simulated examinees [100, 200, 500, 1000] and the number of simulated items [20, 40, 80, 160]. We also varied the difference in the means of the simulated examinee ability and item difficulty distributions [-1.00, 0.00, 1.00]. The nature of misfit for the single studied item in each simulated data file was controlled by altering the generating item slope and lower asymptote. For non- studied items, the generating item slope and lower asymptote were set to 1.00 and 0.00, respectively. For the studied item, slope could take on the values of [0.50, 1.00, 2.00], and the lower asymptote could take on the values of [0.00, 0.25]. We did not generate data for the studied item that conformed to the Rasch model assumptions (i.e., slope = 1.00 and lower asymptote = 0.00). This resulted in an experimental design containing 250 cells: 4 sample sizes × 4 test lengths × 3 distribution offsets × 5 combinations of model-data misfit. Each cell was replicated 50 times. Simulation Process Examinee ability was generated from a N(0,1) distribution, and item difficulty was generated from a N(,1) distribution, where varied depending on the level of the distribution offset variable. Once the original data were generated, parameters and fit indices were estimated for the simulated data file using Winsteps (Linacre, 2009). Based on those parameter estimates, 50 bootstrap samples (i.e., sampling from the estimated item and ability parameter values with replacement) were generated according to the Rasch dichotomous model. For each bootstrap
ASYMPTOTIC & BOOTSTRAP FIT INDICES 10 sample, parameters and fit indices were estimated using Winsteps, and the 2.5th and 97.5th percentile values of each fit index (UMS, ZUMS, WMS, and ZWMS) within each bootstrap data file were determined. Bootstrap critical values were determined by averaging the 2.5th and 97.5th percentile values for each fit index across the 50 bootstrap samples that were generated for each original data file. Rule-of-thumb and bootstrap critical values were applied to the studied and non-studied item fit statistics produced for each original data file. Each non-studied item was compared to the rule-of-thumb and bootstrap critical values, and the item was declared to exhibit misfit (a Type I error) if the value of the estimated item fit index was more extreme than the critical value limits. Misfit decisions for the non-studied items were coded 0 (not declared to misfit) or 1 (declared to misfit), and these codes were averaged within each data file (across non-studied items) to determine the within data file Type I error rate. Type I error rates within cells of the experimental design were determined by averaging these within data set Type I error rates. Similarly, the estimated fit value for each studied item was compared to the rule-of-thumb and bootstrap critical values, and the item was declared to misfit if the estimated value was more extreme than the critical value limits. Misfit decisions for the studied items were coded 0 (declared to misfit) or 1 (not declared to misfit). Type II error rates within cells of the experimental design were determined by averaging these 0/1 codes across data sets within a cell of the experimental design. Results Table 1 provides the descriptive statistics, collapsed across cells of the experimental design, for each fit index from the original and bootstrap samples and the upper and lower critical values generated through the bootstrap process. It is clear that the bootstrap samples
ASYMPTOTIC & BOOTSTRAP FIT INDICES 11 exhibited fit index distributions that were nearly identical to those obtained from the original samples from which the bootstraps were drawn. It is also interesting to note that the bootstrap lower and upper critical values are considerably narrower than the rule-of-thumb values that are typically adopted in practice. For example, the UMS and WMS intervals range from lows around 0.85 to highs around 1.20, compared to rule-of-thumb values of 0.70 and 1.30. Similarly, the bootstrap critical values for the standardized versions of these fit indices are considerably narrower than the rule-of-thumb values of ±2.00. Table 2 presents the Type I and Type II error rates based on the rule-of-thumb and the bootstrap critical values. Rule-of-thumb critical values produced Type I error rates that were considerably less than what is typically adopted as a desired error rate (e.g., 0.05). In all cases, bootstrap critical values produced higher Type I error rates, although those error rates were closer to the optimal rate of 0.05 for UMS and ZUMS. The bootstrap critical values for WMS and ZWMS, on the other hand, produced Type I error rates that were considerably higher than expected. We conducted general linear modeling analyses to determine whether the independent variables were related to the WMS Type I error rates produced by the rule-of-thumb and bootstrap critical values.1 The five-way model for the bootstrap critical values produced an R- squared value of .04 suggesting that those critical values were not influenced by sample size, test length, distribution offset, guessing, or item discrimination. The five-way model for the rule-of-thumb critical values, on the other hand, produced an R-squared value of .32. In this model, the sample size-by-test length interaction was statistically significant, although the effect size (based on the Type III sum of squares) was not substantial; F1,11968 = 10.24, p = .001, 2 = .0006. Both of the main effects for these variables were also statistically significant with small effect sizes. Figure 1 displays the two-way interaction between
ASYMPTOTIC & BOOTSTRAP FIT INDICES 12 sample size and test length as they relate to the Type I error rates produced by the rule-of-thumb critical values. Clearly, as test length and sample size increase, Type I error rates decrease. However, Type I error rates are highest for short tests administered to small samples. When sample size is large, test length does not have a profound impact on Type I error rate. Concerning the Type II error rates shown in Table 2, it is clear that the bootstrap critical values produced considerably more powerful misfit decisions. In fact, the rule-of-thumb critical values did not identify a single case of the misfit that we simulated. On the other hand, the statistical power of the fit indices ranged from a low of .64 (ZUMS) to a high of .72 (ZWMS). We conducted a logistic regression to determine whether bootstrap Type II error rate was associated with the independent variables that we simulated. Those analyses indicated that the three-way interaction between distribution offset, slope, and lower asymptote produced a statistically significant result that was not statistically significant, but did have a large effect size; 12 = 0.57, p = .45, OR = 2.35. Figure 2 summarizes the relevant mean WMS Type II error rates based on the bootstrap critical values for the three-way interaction between distribution offset, studied item slope, and studied item lower asymptote. That figure indicates that when slope equals 0.50, Type II error rates for MSW range between 0.16 and 0.32 with the error rate being fairly consistent across levels of distribution offset. Similarly, when slope equals 2.00 and the asymptote equals 0.00, Type II error rates for MSW are at their lowest, ranging from 0.08 to 0.13 also being consistent across levels of item distribution offset. However, when items exhibited high discrimination (i.e., slope = 2.00) and the lower asymptote equals 0.25, an item distribution offset of -1.00 (i.e., making items relatively easy for the examinees), the Type II error rate increased slightly when compared to offsets of 0.00 and 1.00. On the other hand, when items had moderate discriminations (i.e., slope = 1.00) and the lower asymptote equals 0.25, increasing
ASYMPTOTIC & BOOTSTRAP FIT INDICES 13 item difficulty (i.e., offset = 1.00) increased Type I error rate considerably. In fact, these Type II error rates were the highest observed, ranging from a low of .44 to a high of .70. Discussion Our results indicate that bootstrap critical values allow for greater statistical power in diagnosing item misfit caused by varying item slopes and lower asymptotes. Rule-of-thumb critical values were generally wider than those produced by bootstrap procedures, and the validity of those critical values varied as a function of sample size and test length, which is consistent with previous research conducted by Smith (1988; 1991). In our simulations, the rule- of-thumb critical values did not detect any of the simulated item misfit. On the other hand, bootstrap critical values produced relatively low Type II error rates for all combinations of misfit except one. Specifically, the average Type II error rate was around .30 for all four fit indices, and was greater than 0.50 only when the misfitting item was modeled to exhibit moderate discrimination and guessing. Our study is limited because we conducted a relatively small number of iterations per cell of the experimental design (50) and conducted a relatively small number of bootstraps per iteration (also 50). Our results are also limited by the fact that we simulated only a single misfitting item in each data file. While this approach is consistent with the typical methodology utilized in studies of differential item functioning, it is unlikely that this mimics real world applications of the Rasch model to dichotomous data. Future studies should consider the proportion of misfitting items as a potential independent variable.
ASYMPTOTIC & BOOTSTRAP FIT INDICES 14 References Dimitrov, D.M., & Smith, R.M. (2006). Adjusted Rasch person-fit statistics. Journal of Applied Measurement, 7, 170-183. Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7, 1-26. Hesterberg, T., Moore, D.S., Monaghan, S., Clipson, A., & Epstein, R. (2005). Bootstrap methods and permutation tests. 2nd ed., from http://bcs.whfreeman.com/ips5e/content/cat_080/pdf/moore14.pdf. Karabatsos, G. (2000). A critique of Rasch residual fit statistics. Journal of Applied Measurement, 1, 152-176. Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-six person fit statistics. Applied Measurement in Education, 16, 277-298. Linacre, J.M. (2009). WINSTEPS Rasch measurement computer program. Chicago: Winsteps.com. Molenaar, I.W., & Hoijtink, H. (1990). The many null distributions of person fit indices. Psychometrika, 55, 75-106. Smith, R.M. (1988). The distributional properties of Rasch standardized residuals. Educational and Psychological Measurement, 48, 657-667. Smith, R.M. (1991). The distributional properties of Rasch item fit statistics. Educational and Psychological Measurement, 51, 541-565. Smith, R.M. (2000). Fit analysis in latent trait measurement models. Journal of Applied Measurement, 1, 199-218.
ASYMPTOTIC & BOOTSTRAP FIT INDICES 15 Smith, R.M., Schumacker, R.E., & Bush, M.J. (1998). Using item mean squares to evaluate fit to the Rasch model. Journal of Outcome Measurement, 2, 66-78. Stone, C.A. (2007). IRTFIT_RESAMPLE: A computer program for assessing goodnes of fit of item response theory models based on posterior expectations. Applied Psychological Measurement, 28, 143-144. Stone, C.A., & Zhang, B. (2003). Assessing goodness of fit of item response theory models: A comparison of traditional and alternative procedures. Journal of Educational Measurement, 40, 331-352. Su, Y.H., Sheu, C.F., & Wang, W.C. (2007). Computing Confidence Intervals of Item Fit in the Family of Rasch Models Using the Bootstrap Method. Journal of Applied Measurement, 8, 190-203. Van den Wollenberg, A.L. (1982). Two tests statistics for the Rasch model. Psychometrika, 47, 123-139. Wang, W.C., & Chen, C.T. (2005). Item parameter recovery, standard error estimates, and fit statistics of the Winsteps program for the family of Rasch models. Educational and Psychological Measurement, 65, 376-404. Wilson, E.B., & Hilferty, M.M. (1931). The distribution of chi-square. Proceedings of the National Academy of Sciences of the United States of America, 17, 684-688. Wolfe, E.W. (2008). RBF.sas (Rasch Bootstrap Fit): A SAS macro for estimating critical values for Rasch model fit statistics. Applied Psychological Measurement, 32, 585-586. Wright, B.D., & Linacre, M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8, 370.
ASYMPTOTIC & BOOTSTRAP FIT INDICES 16 Wright, B.D., & Masters, G.N. (1982). Rating scale analysis: Rasch measurement. Chicago, IL: MESA.
ASYMPTOTIC & BOOTSTRAP FIT INDICES 17 Footnotes 1 In this article, we report only the results of analysis of the MSW index, chosen primarily because of its high level of statistical power in the bootstrap analyses. However, the results that we report are consistent with those obtained for the remaining three fit indices considered in this study.
ASYMPTOTIC & BOOTSTRAP FIT INDICES 18 Table 1 Descriptive statistics for item fit indices Original Bootstrap Bootstrap Bootstrap Index Statistic Samples Samples LCV UCV UMS Mean 1.00 1.00 0.90 1.12 SE -- -- 0.02 0.03 SD 0.05 0.06 0.04 0.05 Minimum 0.88 0.87 0.75 1.05 Maximum 1.12 1.17 0.96 1.29 ZUMS Mean -0.03 0.00 -1.45 1.58 SE -- -- 0.44 0.48 SD 0.79 0.91 0.14 0.15 Minimum -1.97 -2.26 -1.96 1.04 Maximum 1.90 2.89 -0.93 2.17 WMS Mean 1.00 1.00 0.84 1.21 SE -- -- 0.09 0.17 SD 0.13 0.14 0.07 0.11 Minimum 0.72 0.69 0.52 1.06 Maximum 1.48 1.52 1.02 2.20 ZWMS Mean -0.03 0.00 -1.24 1.41 SE -- -- 0.57 0.72 SD 0.88 0.99 0.14 0.16 Minimum -1.89 -2.17 -1.72 0.76 Maximum 2.32 3.20 -0.72 2.06 Note: These represent the averaged values, across 12,000 replications, of the within cell descriptive statistics. LCV = Lower critical value. UCV = Upper critical value.
ASYMPTOTIC & BOOTSTRAP FIT INDICES 19 Table 2 Error rates Type I Type II Index ROT CVs Bootstrap CVs ROT CVs Bootstrap CVs UMS 0.00 0.04 1.00 0.34 ZUMS 0.02 0.07 1.00 0.36 WMS 0.03 0.13 1.00 0.31 ZWMS 0.03 0.13 1.00 0.28 Note: These represent the averaged error rates across 12,000 replications. CV = critical value.
ASYMPTOTIC & BOOTSTRAP FIT INDICES 20 Figure Captions Figure 1. ROT WMS Type I Error Rate Sample Size-by-Test Length Interaction Figure 2. Bootsrap WMS Type II Error Rate Displacement by Type of Misfit Interaction
ASYMPTOTIC & BOOTSTRAP FIT INDICES 21
ASYMPTOTIC & BOOTSTRAP FIT INDICES 22
You can also read