How much item drift is too much?

When equating different forms of an examination across administrations, a common item equating design is often used. It involves anchoring the values of the common items to known difficulty calibrations and then estimating the calibrations of the new items relative to the anchor items. In actual practice, the difficulty of some of the items will change over time, so the previous difficulty calibrations of those anchor items do not always reflect their relative difficulty as found in the current data set that is being equated. This difference between an item's anchored value and the value that would have been estimated had the item been unanchored is called displacement (Linacre, 2013).

When equating, it is common practice to unanchor those items that display excessive displacement and instead use the calibration implied by the current dataset. When only a few items must be unanchored and they are symmetrically distributed around zero, there is little cause for alarm; but when many items show an excessive amount of displacement, one becomes concerned that perhaps the particular items they selected to unanchor may have an impact on the equating. To illustrate, if only 2% of the anchor items are unanchored, it seems unlikely to have much impact on the equating, but if 60% of the items are unanchored, the psychometrician may wonder if s/he has unanchored the right items. Although having a substantive theoretical explanation for why an item changed in difficulty is preferred, in practice, working psychometricians are often left with only numerical indices with which to make their decision. In those cases, the question becomes what seems to be a useful threshold for identifying excessive amounts of displacement.

To this issue, Wright and Douglas (1976) found that random displacement of less than .5 logits have little effect on the test instrument. Draba (1977) recommends 0.5 logits based upon the rationale that item difficulties typically range between -2.5 and +2.5 logits, thus a shift of 0.5 logits represents a 10 percent shift within that range. Other studies (Jones & Smith, 2006; Stahl & Muckle, 2007) have found that displacement values symmetrically distributed around zero have very little impact.

At the American Board of Family Medicine (ABFM), we have defined a displacement with an absolute value greater than or equal to 0.6 logits as excessive and found this threshold to be useful. When we implement this criterion we typically find that 10% to 25% of our anchor items are flagged for excessive displacement. At the American Board of Pediatrics (ABP), we also find this displacement threshold useful for identifying items that should be unanchored. When we implement this criterion, we usually find that it flags between 5% and 15% of anchored items on our largest volume examinations (n>500), and between 10% and 30% of anchored items on our subspecialty examinations, which have much lower candidate volume.

Although the ABFM and the ABP have found the 0.6 logit criterion useful, this does not mean that it will be useful for everyone. Psychometricians considering other examinations may find different thresholds to be useful. There is always a trade-off when unanchoring items, an example being that it suggests there are some differences in the construct across administrations. Of course changes in the construct can happen and should be accommodated.

To illustrate, imagine a question about HIV being given on a test in 1986 and again in 1992. In 1986, the question would be about a rather obscure immunology topic, but by 1992 it would represent a current events topic. Answering the question correctly on those two different occasions would represent two very different levels of immunology knowledge. Clearly, using an item calibration that better reflects the data will improve the data-model fit which is important for interpreting the meaning of a measure. On the other hand, too much "flexibility" in permitting the items to float will cause the substantive understanding of the construct to become fuzzy and perhaps less useful. Finding the correct balance between the stability of the substantive meaning of the construct and the conformity of the respondents to that construct is difficult, and is largely the reason why different thresholds for what is considered excessive displacement exist and are likely to continue. We have found that the 0.6 logit threshold typically restricts the flagging of items to those that might produce a noticeable effect on the test instrument, and usually flags fewer than 15% of the anchored items. This is useful for us. Please tell us what thresholds you use and why you find them useful.

Thomas O'Neill and Michael Peabody, American Board of Family Medicine
Rachael Jin Bee Tan and Ying Du, American Board of Pediatrics

Draba, R. (1977). The Identification and Interpretation of Item Bias. Research Memorandum No. 25, Statistical Laboratory, Department of Education, University of Chicago.

Jones, P. & Smith, R. (2006) Item Parameter Drift in Certification Exams and Its Impact on Pass-Fail Decision Making, Paper presented NCME, San Francisco.

Linacre, J. M. (2013). Winsteps® Rasch measurement computer program User's Guide. Beaverton, Oregon: Winsteps.com.

Stahl, J. & Muckle, T. (2007). Investigating Drift Displacement in Rasch Item Calibrations. Rasch Measurement Transactions, 2007, 21:3 p. 1126-1127.

Wright, B.D. & Douglas, G. A. (1976). Rasch item analysis by hand. Research Memorandum No. 21, Statistical Laboratory, Department of Education, University of Chicago.

How much item drift is too much? T. O'Neill, M. Peabody, Rachael Jin Bee Tan, Ying Du … Rasch Measurement Transactions, 2013, 27:3 p. 1423-4

Rasch Books and Publications
Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, 2nd Edn. George Engelhard, Jr. & Jue Wang	Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene	Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver	Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone	Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes	Statistical Analyses for Language Testers (Facets), Rita Green	Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind	Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M	Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind	Rasch Measurement: Applications, Khine	Winsteps Tutorials - free Facets Tutorials - free	Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre	Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan
Other Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free	An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse	Rasch Measurement Theory Analysis in R, Wind, Hua	Applying the Rasch Model in Social Sciences Using R, Lamprianou	El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar	Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch	Rasch Models for Measurement, David Andrich	Constructing Measures, Mark Wilson	Best Test Design - free, Wright & Stone Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias	Diseño de Mejores Pruebas - free, Spanish Best Test Design	A Course in Rasch Measurement Theory, Andrich, Marais	Rasch Models in Health, Christensen, Kreiner, Mesba	Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
Apr. 21 - 22, 2025, Mon.-Tue.	International Objective Measurement Workshop (IOMW) - Boulder, CO, www.iomw.net
Jan. 17 - Feb. 21, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Feb. - June, 2025	On-line course: Introduction to Classical Test and Rasch Measurement Theories (D. Andrich, I. Marais, RUMM2030), University of Western Australia
Feb. - June, 2025	On-line course: Advanced Course in Rasch Measurement Theory (D. Andrich, I. Marais, RUMM2030), University of Western Australia
May 16 - June 20, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 20 - July 18, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com
July 21 - 23, 2025, Mon.-Wed.	Pacific Rim Objective Measurement Symposium (PROMS) 2025, www.proms2025.com
Oct. 3 - Nov. 7, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com