The Rasch model has the delightful feature that the probability of the occurrence of any particular set of observations can be determined precisely from the latent parameters. Furthermore no set of observations is impossible, though many are highly improbable. Every conceivable set of observations has some non-zero probability and so could occur. What is surprising is that the probability of observing even the most likely set of observations can be rather small. None of this is news to those who have scrutinized simulated sets of observations. But how many have?
The procedural question is to decide the point at which the Rasch-model probability of a set of observations is so small that it is no longer reasonable to consider such a set of observations as the outcome of a Rasch measurement process. The field-tested solution is to calculate fit statistics which reflect, in some useful way, improbability of a set (or sub-set) of observations. For example, a demonstrably useful, but approximate, fit statistic, based on standardized mean-square residuals (Wright and Panchapakesan 1969), may indicate that a particular set of observations, or a set less probable than it in the same way, can be expected to occur by chance only once in 100 times.
Some purists attack this interpretation of the commonly-used fit statistics. Their position is that such a probability calculation can only be derived from a theoretical distribution and is only true when the distribution of the underlying parameters is of a particular ideal nature. Since this can never happen in any practice, "all bets are off". Accordingly, they say, we cannot rely on any calculated probability, but we must adjust it in some way. Their argument may seem appealing, but it is naive. No observed statistic can ever have exactly the ideal distribution through which it is evaluated. But even were the probability calculations to be successfully adjusted, a more important problem would remain.
Suppose a test is given to 5,000 children. The misfit probability significance level is arbitrarily (there being no other way) chosen as .01. On inspection, it turns out that 50 of the children's fit statistics are at the .01 level or smaller. From an overall perspective the data clearly fit, since, on theoretical grounds, we expect 1% of the children's response strings to be at or below the .01 level and 1% are. Nevertheless, in any real testing situation, we are concerned not only about the sample, but also about the individual. We want the data to fit for each individual, i.e., "better" than sample theory requires. We really don't want any particular child's fit statistic to be at or below the .01 level. Consequently, we question and perhaps eliminate from the analysis the response strings of the 50 children. The hoped for result is that all alarming misfit will be eliminated from the data set.
The fallacy in this approach is immediately apparent when we eliminate these 50 children and then reanalyze the data set. The effect of eliminating these children is, almost always, to produce an overfit of the data to the model. A consequence of the removal of the individual outliers is to alter the structure of the very randomness in the data which is intrinsic to Rasch model estimation. On re-estimation, this causes the generation of a revised set of estimates (assuming the parameter estimates are not pre-set or "anchored") together with new fit statistics. Again, we see, alas, that around 50 of the remaining children have fit statistics at or below the .01 level. If the "estimate - trim misfits - re-estimate - trim misfits" process is repeated mindlessly, the observations of all 5,000 children will eventually be eliminated. This kind of "inability to eliminate misfit" has caused considerable anxiety among practitioners.
Let us look at the problem of misfitting response strings from a different vantage point: what are the odds of the occurrence of a data set, based on a realistic test length, in which none of the 5,000 children has a fit statistic at or below the .01 level" You can do the calculation for yourself, but I make it to be less than 1 in 1,000,000,000,000,000,000,000. In other words, a data set in which there are no misfits is so unlikely as to be a misfit itself!
So, a mechanical, mathematical, approach is doomed to failure. Either we choose a conventional limit, such as .01, and are in danger of discovering, in the course of successive analyses, that none of our data fit the model, or we choose a misfit cut-off limit that is so remote that all data is decreed to fit the model. In practice, of course, we compromise. A common method of "saving face" is to neglect to report the fit statistics obtained on re-estimation.
My proposal for this dilemma is to face the facts! The fit statistics are indicative, not absolute. The decision as to whether a set of data fits the model is a matter for the informed judgement of the analyst, based on the details of a measurable performance by a member of the test sample on the protocol, rather than a matter of some arbitrary cut-off rule applied to a column of numbers on a computer print-out.
"Whoa!," some practitioners cry, "I have to make fit decisions about strings of responses relating to test protocols and students about whom I know nothing. I have no idea what a measurable string of responses looks like. I have no choice but to use the numerical values of the fit statistics as they stand". In this case, it would seem that an arbitrary criterion has to be chosen, and there is no way of knowing, a priori, how closely that criterion aligns with a cut-off chosen on the basis of measurableness of performance. In this case, the choice of the criterion is a decision made apart from the measurement properties of the Rasch model, statistical sophistry notwithstanding.
Based on informed judgment, how, then, do we deal with the 50 "misfitting" children? First, arrange them in order of improbability of response strings, which is approximated by the size of the misfit statistic. Then examine the substantive details of the most unlikely response string. Does it appear to be the outcome of a valid measurement process? If it does, we may care to examine the next most unlikely child's behavior, but we can expect it to appear even more measurable. If this is what happens, none of the children need be eliminated; the data fit the model. On the other hand, if a worst-fitting set of responses does not appear to be in accord with uni-dimensional measurement, then eliminate all, or at least the contradictory part, of that child's responses and continue on to the next child, until the criterion of measurableness is met. If there are many apparently misfitting children, we could stratify them into layers of improbability in order to expedite determining the measurable performance threshold. With experience gained over similar protocols and samples, a close correspondence between the measurableness criterion and some values of fit statistics may become clear, but these must be expected to be entirely local to the situation.
Once unmeasurable performances have been removed, re-estimation will again produce around 1% of the fit statistics at or below the .01 level. But now this is no longer a cause for concern, since, once we determine where misfit begins, we also know where it ends. Such apparent misfit now confirms the stochastic nature of the data - that they do, indeed, fit the model.
John M. Linacre
University of Chicago
Where does misfit begin? Linacre JM. Rasch Measurement Transactions, 1990, 3:4 p.80
|Rasch Measurement Transactions (free, online)||Rasch Measurement research papers (free, online)||Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch||Applying the Rasch Model 3rd. Ed., Bond & Fox||Best Test Design, Wright & Stone|
|Rating Scale Analysis, Wright & Masters||Introduction to Rasch Measurement, E. Smith & R. Smith||Introduction to Many-Facet Rasch Measurement, Thomas Eckes||Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr.||Statistical Analyses for Language Testers, Rita Green|
|Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar||Journal of Applied Measurement||Rasch models for measurement, David Andrich||Constructing Measures, Mark Wilson||Rasch Analysis in the Human Sciences, Boone, Stave, Yale|
|in Spanish:||Análisis de Rasch para todos, Agustín Tristán||Mediciones, Posicionamientos y Diagnósticos Competitivos, Juan Ramón Oreja Rodríguez|
|Forum||Rasch Measurement Forum to discuss any Rasch-related topic|
Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement
Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.
|Coming Rasch-related Events|
|Jan. 25 - March 8, 2023, Wed..-Wed.||On-line course: Introductory Rasch Analysis (M. Horton, RUMM2030), medicinehealth.leeds.ac.uk|
|Apr. 11-12, 2023, Tue.-Wed.||International Objective Measurement Workshop (IOMW) 2023, Chicago, IL. iomw.net|
|June 23 - July 21, 2023, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com|
|Aug. 11 - Sept. 8, 2023, Fri.-Fri.||On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com|
The URL of this page is www.rasch.org/rmt/rmt34c.htm