In many high-stakes testing scenarios samples tend to be reasonably comparable with regard to demographic characteristics across administrations. As part of the initial quality control checks, most psychometricians will investigate various demographic characteristics to get a pulse on sample stability. Unfortunately, many psychometricians may be tempted to only investigate "visible" demographic variables, such as gender, ethnicity, and so on. Failing to investigate "invisible" demographic variables such as whether the examinee is a first-time or repeat test-taker, or has previously rendered a fail result could lead to an enormous mistake with regard to equating examinations. Consider the following example.
Suppose a data set is provided to a psychometrician for scoring. As part of the initial quality control checks, s/he learns both the sample size and the visible demographics variables all seem fairly comparable to previous administrations. On the surface it appears the sample is comparable to previous samples, thus the psychometrician proceeds to investigate item quality and functioning. Preliminary item analyses reveal the items appear to be sound and functioning properly. Upon obtaining this assurance, the psychometrician then begins developing item anchors for equating purposes. After several iterations of investigating displacement values and unanchoring item calibrations that displace from those obtained from previous administrations, the psychometrician is satisfied with the remaining item calibrations and locks them down as anchors for the final scoring run.
Once data are scored, the results are reviewed and compared to historical trends. Diagnostic results (e.g., fit statistics, separation and reliability estimates, etc.) appear sound, but some notable differences in pass/fail statistics and mean scaled scores are evident. Concerned, the psychometrician revisits the scoring processes by reviewing syntax and reproducing all relevant data files. Examination data are rescored and the same results are produced. Still suspicious, the psychometrician begins combing both the new data set and last year's data set to identify anyone that had previously taken the exam. A list of repeat examinees is pulled and their scores are compared across both administrations of the examination. It turns out virtually all of the repeat examinees appear to have performed worse on the new examination. How could this be? Examinees have had additional training, education and time to prepare for the examination.
Upon closer inspection the psychometrician is surprised to learn a less obvious demographic characteristic had fluctuated among the examinees and caused this unusual scenario. It turns out a larger proportion of examinees were taking the examination due to a prior failure. This small, yet very important, artifact had a significant ripple effect on the quality of the final scores. The problem began when a less able sample interacted with items and the psychometrician was deceived into thinking many of the existing calibrations were unstable. As a result, the psychometrician unanchored many item calibrations that should otherwise have been left alone. Thus, when the new scale was established, it jumped and resulted in scores that lost their meaning across administrations.
Although item equating under the Rasch framework is quite simple and straight-forward, it still requires a great deal of careful attention. The scenario presented above illustrates how a significant problem may occur simply as a result of failing to investigate one key demographic characteristic of the sample. When equating, it is critical that one considers all types of sample characteristics, especially those that pertain to previous performance. An inconsistency in these demographics can result in item instability, which in turn, can go unnoticed when examining displacement values and creating item anchors. It is for this reason that many psychometricians only use first-time examinee data when equating exams. In any instance, all psychometricians that equate examinations under the Rasch framework would be wise to include to their list of quality control checks a comprehensive investigation of demographic characteristics both before and after a scoring run is complete.
Kenneth D. Royal, University of North Carolina at Chapel Hill
Mikaela M. Raddatz, American Board of Physical Medicine and Rehabilitation
Royal K. & Raddatz M. (2013) A Cautionary Tale about Item Equating with Fluctuating Samples. Rasch Measurement Transactions, 27:2 p. 1417
Response on the Rasch Listserv, Sept. 6, 2013
I read with interest the recent note in RMT 27:2 by Royal and Raddatz that contained a cautionary tale about equating test forms for certification and licensure exams. By the end of the note, I was troubled, by what I feel is a common misunderstanding about the properties of Rasch measurement.
Their tale begins with a test administration and the investigation of item quality and functioning before attempting to equate the current form to a previously established standard.
It is widely known that the properties of item invariance that allow equating in the Rasch model hold, if and only if, the data fit the Rasch model. The investigation of the fit of the data to the model should be investigated in the initial stage of equating. The authors state, "Preliminary item analyses reveal the items appear to be sound and functioning." One assumes that the fit of the data to the model was confirmed in this process, though it is not explicitly stated. As the story continues, we find, in fact, that the equating solution does not hold across the various subgroups represented in the analysis and the calibration sample is subsequently altered to produce a different and more logical equating solution.
This suggests that the estimates of item difficulty were not freed from the distributional properties of the sample. Hence, the data can not fit a Rasch model. One would hope that it is not necessary to get to the very end of the equating process before discovering that the estimates of item difficulty are not invariant and the link constant developed for equating is not acceptable.
What then is the cause of the problem? Without independent confirmation, I would suggest that the fit statistics used in the preliminary analysis lacked the power to detect violations of this type of first-time vs. repeater invariance. This is easily corrected with the use of the between group item fit statistic available in Winsteps. It will not solve the problem lack of fit to the Rasch model, but it will let you know there is a problem before you get too far into the equating process. Developing an item bank that measures both types of examinees fairly is an entirely different issue, and one that should be addressed. The lack of item invariance across subgroups is a classic definition of item bias.
Richard M. Smith, Editor
Journal of Applied Measurement
|Rasch Measurement Transactions (free, online)||Rasch Measurement research papers (free, online)||Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch||Applying the Rasch Model 3rd. Ed., Bond & Fox||Best Test Design, Wright & Stone|
|Rating Scale Analysis, Wright & Masters||Introduction to Rasch Measurement, E. Smith & R. Smith||Introduction to Many-Facet Rasch Measurement, Thomas Eckes||Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr.||Statistical Analyses for Language Testers, Rita Green|
|Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar||Journal of Applied Measurement||Rasch models for measurement, David Andrich||Constructing Measures, Mark Wilson||Rasch Analysis in the Human Sciences, Boone, Stave, Yale|
|in Spanish:||Análisis de Rasch para todos, Agustín Tristán||Mediciones, Posicionamientos y Diagnósticos Competitivos, Juan Ramón Oreja Rodríguez|
|Forum||Rasch Measurement Forum to discuss any Rasch-related topic|
Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement
Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.
|Coming Rasch-related Events|
|June 29 - July 27, 2018, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com|
|July 25 - July 27, 2018, Wed.-Fri.||Pacific-Rim Objective Measurement Symposium (PROMS), (Preconference workshops July 23-24, 2018) Fudan University, Shanghai, China "Applying Rasch Measurement in Language Assessment and across the Human Sciences", www.promsociety.org|
|July 29 - August 4, 2018||Vth International Summer School `Applied Psychometrics in Psychology and Education`, Institute of Education at the Higher School of Economics, St. Petersburg, Russia, https://ioe.hse.ru/en/announcements/215681182.html|
|July 30 - Nov., 2018||Online Introduction to Classical and Rasch Measurement Theories (D.Andrich), University of Western Australia, Perth, Australia, http://www.education.uwa.edu.au/ppl/courses|
|Aug. 10 - Sept. 7, 2018, Fri.-Fri.||On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com|
|August 25 - 28, 2018, Sat.-Tue.||Análisis de Rasch introductorio (en español). (Agustín Tristán), Instituto de Evaluación e Ingeniería Avanzada. San Luis Potosí, México. www.ieia.com.mx|
|Sept. 3 - 6, 2018, Mon.-Thurs.||IMEKO World Congress, Belfast, Northern Ireland, www.imeko2018.org|
|Oct. 12 - Nov. 9, 2018, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com|
The URL of this page is www.rasch.org/rmt/rmt272c.htm