Diagnosing Person Misfit

"Nearly twenty years after Sato introduced his caution index, person-fit statistics still seem to be in the realm of potential...
* The research has been largely unsystematic,
* The research has been largely atheoretical,
* The research has not explored... applied settings."
Rudner et al., 1995, p.23

The problem of idiosyncratic responses has long been known: "one must expect that some subjects will do their task in a perfunctory or careless manner... [or] fail to understand the experiment or fail to read the... instructions carefully... It has seemed desirable, therefore, to set up some criterion by which we could identify those individual records which were so inconsistent that they should be eliminated from our calculations." (Thurstone & Chave, 1929, p.32-33). But general acceptance of useful person misfit criterion has been slow in coming.

Devised in 1975, Sato's caution index quantifies deviations from Guttman ordering: "The basic condition to be satisfied is that persons who answer a question `favorably' all have higher scores than persons answering the same question `unfavorably'" (Guttman, 1950, p.76-77). Guttman notes that this permits person response diagnosis: "Scale analysis can actually help pick out responses that were correct by guessing from an analysis of the pattern of errors" (Guttman, 1950, p. 81). A deficiency in Sato's approach, however, is insensitivity to item spacing. Items of equal difficulty cannot be Guttman ordered and so raise the caution index in a way irrelevant to person misfit. Another deficiency is Sato's requirement to group persons by total score. This makes Sato's index incalculable when there are missing data.

Rudner et al. credit Wright (1977) with identifying a wide range of potential sources for idiosyncratic responses: "guessing, cheating, sleeping, fumbling, plodding, and cultural bias". Wright and his students are also credited with two stochastically-based solutions to the fit index problem, the statistics now known as INFIT and OUTFIT, whose distributional properties have been exhaustively investigated and reported by Richard Smith (1986, 1991).

After discussing these and various other indices, Rudner et al. chose INFIT, an information-weighted statistic, for their analysis of the NAEP data, but with probabilities computed from the reported person "plausible values" (= theta estimates with their error distributions) and 3-P item parameter estimates.

Rudner chooses INFIT because it
a) "is most influenced by items of median difficulty."
See "Chi-square fit statistics" (RMT 8:2 p. 360-361, 1994) for examples of INFIT and OUTFIT behavior.

b) "has a standardized distribution".
INFIT approximates a mean-square distribution (chi^2/d.f.) with expectation 1.0. Departure from 1.0 measures the proportion of excess (or deficiency) in data stochasticity. Rudner's criterion of 1.20 rejects response strings manifesting more than 20% unmodelled noise.

c) "has been shown to be near optimal in identifying spurious scores at the ability distribution tails."

Rudner's INFIT mean-square distribution is reassuring for the NAEP Trial State Assessment (see Figure 1). Its mean is 0.97, standard deviation .17. But the tails, though statistically acceptable invite investigation. Rudner's other two Figures show how unwanted examinee behavior is indicated by the tails.

In Figure 2, high mean-squares indicate unexpected successes or failures. Unexpected responses by low performers are bound to be improbable successes. These could be due to special knowledge or lucky guessing. Unexpected responses by high performers are bound to be improbable failures. These could be due to carelessness, slipping, misunderstandings or "special ignorance". In Figure 2, in the upper right quadrant, there are many more persons misfitting because of careless errors (or incomplete response strings) than, in the upper left quadrant, persons benefiting from lucky guessing.

Low mean squares indicate less randomness in the response strings than modelled. This could indicate a curriculum effect, i.e., competence at everything taught against a test that also includes difficult, untaught material. Another possibility is the effect of a time limit. When data are taken to be complete, comprising equally determined efforts to succeed on each item, then a time limit makes the last items in a test appear harder. Slow, but careful, workers get all earlier items correct. This higher success rate on early items makes them appear easier. When time runs out these plodders "fail" the later items. The lower success rate on the later items makes them appear harder. This interaction between time and item difficulty makes response strings too predictable and lowers mean-squares below 1.0.

Figure 3 suggests an unexpected interaction between high ability and calculator use in the NAEP Mathematics test. 1990 was the first year that allowed calculators. Items involving calculators misfit. Perhaps high ability persons found calculators as much a liability as an asset, and so committed unexpected errors on items they would have got right by hand. Again there is an excess of unlucky errors over lucky guesses in Figure 3.

Although Rudner reports that trimming unexpected response strings has minimal impact on the overall NAEP conclusions, examining and diagnosing the response strings of such individuals enables us to evaluate and improve our tests, discover when and when not to trust test results, and identity those examinees who require special personal attention for instruction, guidance and decision making.

Guttman, L. 1950. The Basis for Scalogram Analysis. pp. 60-90 in Stouffer, S.A., et. al., Measurement and Prediction. New York: John Wiley, pp.76-77.

Rudner LM, Skagg G, Bracey G, Getson PR. 1995. Use of Person-Fit Statistics in Reporting and Analyzing National Assessment of Educational Progress Results. NCES 95-713. Washington DC: National Center for Education Statistics.

Smith, R.M. (1986) Person fit in the Rasch model. Educational and Psychological Measurement. 46(2) 359-372

Smith, R.M. (1991) The distributional properties of Rasch item fit statistics. Educational and Psychological Measurement. 51(3) 541-565.

Thurstone, L.L., Chave, E.J. 1929. The Measurement of Attitudes. Chicago: University of Chicago Press.

Wright, B.D. 1977. Solving Measurement Problems with the Rasch model. Journal of Educational Measurement, 14(2), 108.

Diagnosing person misfit. Rudner L, Wright BD. … Rasch Measurement Transactions, 1995, 9:2 p.430

Rasch Books and Publications
Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, 2nd Edn. George Engelhard, Jr. & Jue Wang	Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene	Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver	Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone	Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes	Statistical Analyses for Language Testers (Facets), Rita Green	Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind	Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M	Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind	Rasch Measurement: Applications, Khine	Winsteps Tutorials - free Facets Tutorials - free	Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre	Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan
Other Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free	An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse	Rasch Measurement Theory Analysis in R, Wind, Hua	Applying the Rasch Model in Social Sciences Using R, Lamprianou	El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar	Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch	Rasch Models for Measurement, David Andrich	Constructing Measures, Mark Wilson	Best Test Design - free, Wright & Stone Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias	Diseño de Mejores Pruebas - free, Spanish Best Test Design	A Course in Rasch Measurement Theory, Andrich, Marais	Rasch Models in Health, Christensen, Kreiner, Mesba	Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
Apr. 21 - 22, 2025, Mon.-Tue.	International Objective Measurement Workshop (IOMW) - Boulder, CO, www.iomw.net
Jan. 17 - Feb. 21, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Feb. - June, 2025	On-line course: Introduction to Classical Test and Rasch Measurement Theories (D. Andrich, I. Marais, RUMM2030), University of Western Australia
Feb. - June, 2025	On-line course: Advanced Course in Rasch Measurement Theory (D. Andrich, I. Marais, RUMM2030), University of Western Australia
May 16 - June 20, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 20 - July 18, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com
July 21 - 23, 2025, Mon.-Wed.	Pacific Rim Objective Measurement Symposium (PROMS) 2025, www.proms2025.com
Oct. 3 - Nov. 7, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com