The article by Douglas serves as a starting point for a discussion of the theory of fit and its practice in Rasch measurement. Although we often speak of fit as a unitary concept, there are really two underlying questions being asked when fit is discussed.
The first question concerns fit of the data to the model. If the desirable properties of Rasch measurement are to hold, then the data must approximate the model. This is important in calibrating item sets, in equating test forms, in studies of bias and in studies of the underlying definition of the variable. This question must be answered for the data as a whole, before further analysis of the data are useful. It is similar to that asked about any statistical analysis: are the specifications of the model approximated by the data under study? In Rasch measurement there are many global, item, and person fit statistics that have been used to assess this question including the Wright-Panchapakesan (1969) statistics.
The second question concerns the degree to which the total score that an examinee earns on a test adequately summarizes the examinee's total set of responses. This question of response fit comes later in the analysis at the point when a decision must be made about the individual examinee based on the results of his/her particular examination performance - decisions such as admission, assigning grades, promotion, graduation, certification. A variety of person fit or appropriateness techniques have been developed to answer this question. This is not a question of the utility of the data for analysis by the measurement model, but of the meaning (validity) of the measure for the individual. It is possible, and in practice inevitable, that a favorable answer to the question of the utility of the data for analysis by a Rasch measurement model does not guarantee a favorable answer to the question concerning validity for every individual tested. No matter how hard we try to construct potentially valid tests there will always be individual performances for whom those tests were not valid.
To understand the first question it is necessary to understand that models are abstractions designed to bring order to observations. Real data can never fit any model perfectly. That is why simulated data must be used to develop significance values for fit indices. The relevant question is one of robustness. How robust is a set of data to violations of the model's requirements? Can the analysis extract useful information from the data? A vital property of strong models, such as the Rasch models, is that the information extracted from the data can be useful even when the data do not fit the model very well. This is because the model constructs a strong frame of reference against which the particular properties of the data are revealed. Experience has shown that data analyses guided by Rasch measurement models are quite robust to violations of the model's requirements. In particular, individual measurement disturbances seldom have tangible effects on equating or bias studies.
The alert investigator can strengthen the "person-freed" item calibration property by a priori removing, from the calibration sample, individuals whose responses exhibit measurement disturbances without actually calculating person fit indices. The BICAL item calibration program, for example, made it easy for investigators to omit extreme raw scores from item calibration, i.e., performances with scores near or below the chance level where "guessing" might occur or near perfect performances where "carelessness" might occur. The BICAL program also produced, on request, two sets of item calibrations, one with all misfitting persons (based on total weighted fit) excluded and one with them included. The alert investigator could compare these two calibrations to determine the effect, if any, of the misfitting persons on the item calibrations.
As more powerful data editors and word processors became available, these features were dropped from subsequent Rasch calibration programs. The fact remains, however, that misfit editing of the data prior to final calibration often produces more stable item calibrations.
When in doubt, run two calibrations with misfits and/or low and high scoring persons included and excluded and study the differences.
With regard to the validity of the total score or ability estimate for a person, a second set of concerns arises. Investigators often assume that the fit indices contained in calibration programs for items and persons are sufficient to guarantee the validity of the measure for the individual against all meaningful measurement disturbances. These global fit indices may provide adequate information for answering the question as to the utility of the data for analysis by the model. However, they only begin to provide the information necessary to answer the second question. Studies by Smith point to the need to use a combination of total and between fit statistics when investigating the validity of person measures or item difficulties (Smith, 1986, 1988; Smith & Hedges, 1982). The extent of care and thoroughness needed to validate person measures depends on the importance of the decision to be made with the measures.
It has been implied that the size of most testing programs makes it impractical to look closely at the validity of person measures. But recent efforts by the College Board (for PSAT and SAT tests) and Australian Council for Educational Research (KIDMAPS for grade level achievement tests in New South Wales) show that the statistical results of person fit analysis can be expressed in terms that are accessible and useful to students and parents.
The primary tool for fit analysis in Rasch measurement have been the standardized chi-square statistics based on the work of Wright and Panchapakesan (1969) and further elaborated by Mead (1975), Wright (1977), Wright and Stone (1979), and Wright and Masters (1982). Since their inception these statistics have come under criticism from several fronts. Initial criticism was based on the fact that the squared differences between observed and predicted responses for item/person interactions were only approximately chi-square. Since, however, real data never fit any ideal model, all applications of chi-square are approximations.
Later criticism was that the true distributional properties of these approximate chi-squares or their transformations were unknown. A variety of alternatives have been proposed (Andersen, 1973; Van den Wollenberg, 1982; Yen, 1981). But study and practice has shown that these other statistics offer no useful advantage over the Wright- Panchapakesan statistics. Work by Smith on the distribution of standardized residuals and the null distributions of standardized fit statistics has shown that even though these statistics are not "true" chi-squares, they are regular enough to identify outliers reliably.
The most recent suggestion for an alternative fit statistic, based on the exact probabilities of a given person response pattern (Molenaar and Hoijtink, 1990), is discussed in the Douglas paper. The Wright-Panchapakesan statistics are computationally simpler than the Molenaar-Hoijtink statistic, and are highly correlated with the exact probabilistic results, but can be summarized to answer a priori hypotheses that are inaccessible with the Molenaar-Hoijtink statistic.
The Wright-Panchapakesan (WP) statistics and their derivatives have offered an efficient and practical way to evaluate fit to the Rasch measurement models for 20 years. The WP approximations stand up well in comparison with possibly more precise tests such as likelihood-ratio chi-squares and the Molenaar-Hoijtink statistic. Studies of the distributional properties of WP statistics show that the tails of their distributions are regular enough to identify outliers reliably. There is no practical reason to use anything more complicated.
Richard M. Smith
American Dental Association
Andersen, E.B. (1973) A goodness of fit test for the Rasch Model. Psychometrika, 38, 123-140.
Mead, R.J. (1975) Analysis of fit to the Rasch Model. Ph.D. dissertation. University of Chicago.
Smith, R.M. (1986) Person fit in the Rasch Model. Educational and Psychological Measurement, 46, 359-372.
Smith, R.M. & Hedges, L.V. (1982) A comparison of likelihood ratio chi-square and Pearsonian chi-square tests of fit in the Rasch model. Educational Research and Perspectives, 9, 44-54.
van den Wollenberg, A.L. (1982) Two new test statistics for the Rasch model. Psychometrika, 47, 123-140.
Wright, B.D. (1977) Solving measurement problems with the Rasch Model. Journal of Educational Measurement, 14, 97-116.
Wright, B.D. & Masters, G.N. (1982) Rating scale analysis. Chicago: MESA Press.
Wright, B.D. & Panchapakesan N.A. (1969) A procedure for sample-free item analysis. Educational and Psychological Measurement, 29, 23 - 48.
Wright, B.D. & Stone M.H. (1979) Best test design. Chicago: MESA Press.
Yen, W.M. (1981) Using simulation results to choose a latent trait model Applied Psychological Measurement, 5, 245-262.
Theory and practice of fit. Smith RM. Rasch Measurement Transactions, 1990, 3:4 p.78
|Rasch Measurement Transactions (free, online)||Rasch Measurement research papers (free, online)||Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch||Applying the Rasch Model 3rd. Ed., Bond & Fox||Best Test Design, Wright & Stone|
|Rating Scale Analysis, Wright & Masters||Introduction to Rasch Measurement, E. Smith & R. Smith||Introduction to Many-Facet Rasch Measurement, Thomas Eckes||Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr.||Statistical Analyses for Language Testers, Rita Green|
|Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar||Journal of Applied Measurement||Rasch models for measurement, David Andrich||Constructing Measures, Mark Wilson||Rasch Analysis in the Human Sciences, Boone, Stave, Yale|
|in Spanish:||Análisis de Rasch para todos, Agustín Tristán||Mediciones, Posicionamientos y Diagnósticos Competitivos, Juan Ramón Oreja Rodríguez|
|Forum||Rasch Measurement Forum to discuss any Rasch-related topic|
Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement
Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.
|Coming Rasch-related Events|
|Oct. 9 - Nov. 6, 2020, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com|
|Jan. 22 -Feb. 19, 2021, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com|
|May 21 -June 18, 2021, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com|
|June 25 - July 23, 2021, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com|
|Aug. 13 - Sept. 10, 2021, Fri.-Fri.||On-line workshop: Many-Facet Rasch Measurement (E. Smith,Facets), www.statistics.com|
|June 24 - July 22, 2022, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com|
The URL of this page is www.rasch.org/rmt/rmt34b.htm