When equating different forms of an examination across administrations, a common item equating design is often used. It involves anchoring the values of the common items to known difficulty calibrations and then estimating the calibrations of the new items relative to the anchor items. In actual practice, the difficulty of some of the items will change over time, so the previous difficulty calibrations of those anchor items do not always reflect their relative difficulty as found in the current data set that is being equated. This difference between an item's anchored value and the value that would have been estimated had the item been unanchored is called displacement (Linacre, 2013).
When equating, it is common practice to unanchor those items that display excessive displacement and instead use the calibration implied by the current dataset. When only a few items must be unanchored and they are symmetrically distributed around zero, there is little cause for alarm; but when many items show an excessive amount of displacement, one becomes concerned that perhaps the particular items they selected to unanchor may have an impact on the equating. To illustrate, if only 2% of the anchor items are unanchored, it seems unlikely to have much impact on the equating, but if 60% of the items are unanchored, the psychometrician may wonder if s/he has unanchored the right items. Although having a substantive theoretical explanation for why an item changed in difficulty is preferred, in practice, working psychometricians are often left with only numerical indices with which to make their decision. In those cases, the question becomes what seems to be a useful threshold for identifying excessive amounts of displacement.
To this issue, Wright and Douglas (1976) found that random displacement of less than .5 logits have little effect on the test instrument. Draba (1977) recommends 0.5 logits based upon the rationale that item difficulties typically range between -2.5 and +2.5 logits, thus a shift of 0.5 logits represents a 10 percent shift within that range. Other studies (Jones & Smith, 2006; Stahl & Muckle, 2007) have found that displacement values symmetrically distributed around zero have very little impact.
At the American Board of Family Medicine (ABFM), we have defined a displacement with an absolute value greater than or equal to 0.6 logits as excessive and found this threshold to be useful. When we implement this criterion we typically find that 10% to 25% of our anchor items are flagged for excessive displacement. At the American Board of Pediatrics (ABP), we also find this displacement threshold useful for identifying items that should be unanchored. When we implement this criterion, we usually find that it flags between 5% and 15% of anchored items on our largest volume examinations (n>500), and between 10% and 30% of anchored items on our subspecialty examinations, which have much lower candidate volume.
Although the ABFM and the ABP have found the 0.6 logit criterion useful, this does not mean that it will be useful for everyone. Psychometricians considering other examinations may find different thresholds to be useful. There is always a trade-off when unanchoring items, an example being that it suggests there are some differences in the construct across administrations. Of course changes in the construct can happen and should be accommodated.
To illustrate, imagine a question about HIV being given on a test in 1986 and again in 1992. In 1986, the question would be about a rather obscure immunology topic, but by 1992 it would represent a current events topic. Answering the question correctly on those two different occasions would represent two very different levels of immunology knowledge. Clearly, using an item calibration that better reflects the data will improve the data-model fit which is important for interpreting the meaning of a measure. On the other hand, too much "flexibility" in permitting the items to float will cause the substantive understanding of the construct to become fuzzy and perhaps less useful. Finding the correct balance between the stability of the substantive meaning of the construct and the conformity of the respondents to that construct is difficult, and is largely the reason why different thresholds for what is considered excessive displacement exist and are likely to continue. We have found that the 0.6 logit threshold typically restricts the flagging of items to those that might produce a noticeable effect on the test instrument, and usually flags fewer than 15% of the anchored items. This is useful for us. Please tell us what thresholds you use and why you find them useful.
Thomas O'Neill and Michael Peabody, American Board of Family Medicine
Rachael Jin Bee Tan and Ying Du, American Board of Pediatrics
Draba, R. (1977). The Identification and Interpretation of Item Bias. Research Memorandum No. 25, Statistical Laboratory, Department of Education, University of Chicago.
Jones, P. & Smith, R. (2006) Item Parameter Drift in Certification Exams and Its Impact on Pass-Fail Decision Making, Paper presented NCME, San Francisco.
Linacre, J. M. (2013). Winsteps® Rasch measurement computer program User's Guide. Beaverton, Oregon: Winsteps.com.
Stahl, J. & Muckle, T. (2007). Investigating Drift Displacement in Rasch Item Calibrations. Rasch Measurement Transactions, 2007, 21:3 p. 1126-1127.
Wright, B.D. & Douglas, G. A. (1976). Rasch item analysis by hand. Research Memorandum No. 21, Statistical Laboratory, Department of Education, University of Chicago.
How much item drift is too much? T. O'Neill, M. Peabody, Rachael Jin Bee Tan, Ying Du Rasch Measurement Transactions, 2013, 27:3 p. 1423-4
|Rasch Measurement Transactions (free, online)||Rasch Measurement research papers (free, online)||Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch||Applying the Rasch Model 3rd. Ed., Bond & Fox||Best Test Design, Wright & Stone|
|Rating Scale Analysis, Wright & Masters||Introduction to Rasch Measurement, E. Smith & R. Smith||Introduction to Many-Facet Rasch Measurement, Thomas Eckes||Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr.||Statistical Analyses for Language Testers, Rita Green|
|Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar||Journal of Applied Measurement||Rasch models for measurement, David Andrich||Constructing Measures, Mark Wilson||Rasch Analysis in the Human Sciences, Boone, Stave, Yale|
|in Spanish:||Análisis de Rasch para todos, Agustín Tristán||Mediciones, Posicionamientos y Diagnósticos Competitivos, Juan Ramón Oreja Rodríguez|
|Forum||Rasch Measurement Forum to discuss any Rasch-related topic|
Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement
Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.
|Coming Rasch-related Events|
|June 23 - July 21, 2023, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com|
|Aug. 11 - Sept. 8, 2023, Fri.-Fri.||On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com|
The URL of this page is www.rasch.org/rmt/rmt273a.htm