An item on an operational test that has been keyed incorrectly represents a threat to score validity. A miskeyed item or items can cause more able examinees to have lower overall scores and less able examinees to have higher overall scores, thus reducing the ability to clearly discriminate between examinees. This is particularly true when test scores are used for classification such as determining whether or not an examinee should be awarded a professional license or a certificate. A procedure that can detect miskeyed items early in an examination cycle improves the integrity of a testing program by reducing the likelihood of misclassifying examinees.
No one statistical index is completely reliable in the detection of a miskeyed item. In classical test statistics, the p-value and the point biserial correlation have frequently been used to identify miskeyed items. A low p- value and a negative point biserial are often interpreted as indicating an item miskey. While these outcomes can indicate an item miskey they are also associated with other item characteristics, such as item ambiguity or multiple correct answers. For a fixed form test, these statistics may be sufficient, however in a computerized adaptive test (CAT) environment the usefulness of the p- value and the point biserial is greatly diminished. CAT examinations are designed to present items to an examinee based on an estimate of the examinee's ability which causes the sample used to calculate p-value and point biserial estimates of items in an operational examination to be different than the reference group used to establish the original item parameters. Further, calculating p-value and point biserial estimates based on a sample obtained from responses generated by a CAT examination, results in statistics which are less stable due to the range restriction of the sample. In item response theory (IRT) models, fit statistics are often used to identify problem items. Commonly the weighted (infit) and unweighted (outfit) standardized mean squares statistics are used to identify items that do not meet the expectations of the measurement model. However, the calculation of both infit and outfit is dependent on deviations from the model expectations and a restricted sample range will also impact these calculations, making it difficult to use them for identifying miskeyed items.
There are three things to consider when identifying miskeys in an operational examination.
Ideally a single statistic would provide all the information needed to determine a miskey however it may be that a combination of statistics would be more useful. Sample size is important because a method that works well with a smaller sample would enable earlier analysis during a testing cycle reducing the amount of time a miskeyed item was used. Finally, it is important to understand the false positive and false negative rate since too many false positives require manual inspection and too many false negatives would defeat the purpose of the process. Cut-off values can be established for each statistic, such that any item falling above or below an established set of values would be considered to be a likely candidate as a miskeyed item.
To explore this idea a simulation was created to review the performance of readily available statistics to determine if singly or in combination they could provide a consistent identification of miskeyed items. The statistics investigated are p-value, point-measure correlation, infit, outfit, displacement, and upper asymptote. The upper asymptote statistic is available in the Winsteps item analysis and represents a four-parameter IRT model (4- PL) estimate of carelessness or inadvertent selection of a wrong answer. The expectation is that this value should be close to 1 for normal items and much smaller for miskeyed items.
A simulator program (Becker 2012) was used to administer ten replications of a variable length CAT examination, each replication having 1200 examinees. Eight items out of a large pool of items of over 1400 items were selected to be miskeyed items. The simulator program generated test results using the candidate ability measure and the item difficulty to generate a probability of a correct response for each candidate/item interaction. A random number was then generated and, if that number was less than or equal to the probability, the candidate was scored as having answered the question correctly. However, when a candidate encountered a miskeyed item, if the random number was less than or equal to the probability, the candidate was scored as having answered the question incorrectly. The resulting matrix of answer strings was then analyzed using Winsteps and the statistical indices described above were examined to assess their utility in identifying the miskeyed items.
The analysis identified three statistics that, used in combination, gave the cleanest separation between miskeyed and normal items. These statistics were p-value, displacement, and upper asymptote. The cut-off values that were found to be most useful were as follows; p- value <= 0.20, displacement >= 1.5 and upper asymptote <=0.4. Items could receive different N counts based on the selection algorithm used in the variable length CAT. A further cut off was established setting an exposure minimum of 20. The ten replications with eight miskeyed items in each replication presented 80 cases in which a miskeyed item would hopefully be flagged. Using these criteria miskeyed items were flagged in 68 out of the 80 instances (85%). Conversely none of the normal items were flagged out of the 14,640 cases. In the 12 instances in which a miskeyed item was not flagged, 7 involved the same item, which was the hardest item in the miskey set. Logically, hard items are going to be the most difficult to detect as miskeys.
Becker, K. (2012). Pearson CAT Simulator. Chicago, IL: Pearson VUE.
John A. Stahl and Gregory M. Applegate
Early Detection of Item Miskey on a CAT: The Use of Multiple Indices. John A. Stahl & Gregory M. Applegate Rasch Measurement Transactions, 2013, 27:1 p. 1405-6
|Rasch Measurement Transactions (free, online)||Rasch Measurement research papers (free, online)||Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch||Applying the Rasch Model 3rd. Ed., Bond & Fox||Best Test Design, Wright & Stone|
|Rating Scale Analysis, Wright & Masters||Introduction to Rasch Measurement, E. Smith & R. Smith||Introduction to Many-Facet Rasch Measurement, Thomas Eckes||Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr.||Statistical Analyses for Language Testers, Rita Green|
|Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar||Journal of Applied Measurement||Rasch models for measurement, David Andrich||Constructing Measures, Mark Wilson||Rasch Analysis in the Human Sciences, Boone, Stave, Yale|
|in Spanish:||Análisis de Rasch para todos, Agustín Tristán||Mediciones, Posicionamientos y Diagnósticos Competitivos, Juan Ramón Oreja Rodríguez|
|Forum||Rasch Measurement Forum to discuss any Rasch-related topic|
Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement
Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.
|Coming Rasch-related Events|
|June 23 - July 21, 2023, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com|
|Aug. 11 - Sept. 8, 2023, Fri.-Fri.||On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com|
The URL of this page is www.rasch.org/rmt/rmt271c.htm