Early Detection of Item Miskey on a CAT: The Use of Multiple Indices

An item on an operational test that has been keyed incorrectly represents a threat to score validity. A miskeyed item or items can cause more able examinees to have lower overall scores and less able examinees to have higher overall scores, thus reducing the ability to clearly discriminate between examinees. This is particularly true when test scores are used for classification such as determining whether or not an examinee should be awarded a professional license or a certificate. A procedure that can detect miskeyed items early in an examination cycle improves the integrity of a testing program by reducing the likelihood of misclassifying examinees.

No one statistical index is completely reliable in the detection of a miskeyed item. In classical test statistics, the p-value and the point biserial correlation have frequently been used to identify miskeyed items. A low p- value and a negative point biserial are often interpreted as indicating an item miskey. While these outcomes can indicate an item miskey they are also associated with other item characteristics, such as item ambiguity or multiple correct answers. For a fixed form test, these statistics may be sufficient, however in a computerized adaptive test (CAT) environment the usefulness of the p- value and the point biserial is greatly diminished. CAT examinations are designed to present items to an examinee based on an estimate of the examinee's ability which causes the sample used to calculate p-value and point biserial estimates of items in an operational examination to be different than the reference group used to establish the original item parameters. Further, calculating p-value and point biserial estimates based on a sample obtained from responses generated by a CAT examination, results in statistics which are less stable due to the range restriction of the sample. In item response theory (IRT) models, fit statistics are often used to identify problem items. Commonly the weighted (infit) and unweighted (outfit) standardized mean squares statistics are used to identify items that do not meet the expectations of the measurement model. However, the calculation of both infit and outfit is dependent on deviations from the model expectations and a restricted sample range will also impact these calculations, making it difficult to use them for identifying miskeyed items.

There are three things to consider when identifying miskeys in an operational examination.

Ideally a single statistic would provide all the information needed to determine a miskey however it may be that a combination of statistics would be more useful. Sample size is important because a method that works well with a smaller sample would enable earlier analysis during a testing cycle reducing the amount of time a miskeyed item was used. Finally, it is important to understand the false positive and false negative rate since too many false positives require manual inspection and too many false negatives would defeat the purpose of the process. Cut-off values can be established for each statistic, such that any item falling above or below an established set of values would be considered to be a likely candidate as a miskeyed item.

To explore this idea a simulation was created to review the performance of readily available statistics to determine if singly or in combination they could provide a consistent identification of miskeyed items. The statistics investigated are p-value, point-measure correlation, infit, outfit, displacement, and upper asymptote. The upper asymptote statistic is available in the Winsteps item analysis and represents a four-parameter IRT model (4- PL) estimate of carelessness or inadvertent selection of a wrong answer. The expectation is that this value should be close to 1 for normal items and much smaller for miskeyed items.

A simulator program (Becker 2012) was used to administer ten replications of a variable length CAT examination, each replication having 1200 examinees. Eight items out of a large pool of items of over 1400 items were selected to be miskeyed items. The simulator program generated test results using the candidate ability measure and the item difficulty to generate a probability of a correct response for each candidate/item interaction. A random number was then generated and, if that number was less than or equal to the probability, the candidate was scored as having answered the question correctly. However, when a candidate encountered a miskeyed item, if the random number was less than or equal to the probability, the candidate was scored as having answered the question incorrectly. The resulting matrix of answer strings was then analyzed using Winsteps and the statistical indices described above were examined to assess their utility in identifying the miskeyed items.

The analysis identified three statistics that, used in combination, gave the cleanest separation between miskeyed and normal items. These statistics were p-value, displacement, and upper asymptote. The cut-off values that were found to be most useful were as follows; p- value <= 0.20, displacement >= 1.5 and upper asymptote <=0.4. Items could receive different N counts based on the selection algorithm used in the variable length CAT. A further cut off was established setting an exposure minimum of 20. The ten replications with eight miskeyed items in each replication presented 80 cases in which a miskeyed item would hopefully be flagged. Using these criteria miskeyed items were flagged in 68 out of the 80 instances (85%). Conversely none of the normal items were flagged out of the 14,640 cases. In the 12 instances in which a miskeyed item was not flagged, 7 involved the same item, which was the hardest item in the miskey set. Logically, hard items are going to be the most difficult to detect as miskeys.

Early Detection of Item Miskey on a CAT: The Use of Multiple Indices. John A. Stahl & Gregory M. Applegate … Rasch Measurement Transactions, 2013, 27:1 p. 1405-6

Rasch Books and Publications
Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, 2nd Edn. George Engelhard, Jr. & Jue Wang	Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene	Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver	Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone	Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes	Statistical Analyses for Language Testers (Facets), Rita Green	Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind	Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M	Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind	Rasch Measurement: Applications, Khine	Winsteps Tutorials - free Facets Tutorials - free	Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre	Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan
Other Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free	An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse	Rasch Measurement Theory Analysis in R, Wind, Hua	Applying the Rasch Model in Social Sciences Using R, Lamprianou	El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar	Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch	Rasch Models for Measurement, David Andrich	Constructing Measures, Mark Wilson	Best Test Design - free, Wright & Stone Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias	Diseño de Mejores Pruebas - free, Spanish Best Test Design	A Course in Rasch Measurement Theory, Andrich, Marais	Rasch Models in Health, Christensen, Kreiner, Mesba	Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
Apr. 21 - 22, 2025, Mon.-Tue.	International Objective Measurement Workshop (IOMW) - Boulder, CO, www.iomw.net
Jan. 17 - Feb. 21, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Feb. - June, 2025	On-line course: Introduction to Classical Test and Rasch Measurement Theories (D. Andrich, I. Marais, RUMM2030), University of Western Australia
Feb. - June, 2025	On-line course: Advanced Course in Rasch Measurement Theory (D. Andrich, I. Marais, RUMM2030), University of Western Australia
May 16 - June 20, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 20 - July 18, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com
July 21 - 23, 2025, Mon.-Wed.	Pacific Rim Objective Measurement Symposium (PROMS) 2025, www.proms2025.com
Oct. 3 - Nov. 7, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com