A Critique of 3-PL IRT Estimation

Benjamin Drake Wright was asked to respond to Green et al. (1989) which discusses "9 years of using a three-parameter model in the construction of major achievement batteries." Here is Ben's response:

1. Mathematical analysis shows that the 3p model is a non-converging, inestimable elaboration of the Rasch model. When the generic criteria for measurement identified by physicists (Campbell, 1920) and mathematicians (Luce & Tukey, 1964) and demanded by the founders of psychometrics: Thorndike (1927), Thurstone (1928, 1931), Guilford (1936) and Guttman (1950a) are really required, then only the Rasch model can be deduced (Brogden, 1977; Perline, Wright & Wainer, 1979; Roskam & Jansen, 1984; Wright & Linacre, 1987; Wright, 1985, 1988a, 1988b, 1989a, 1989b). Far from being a special case of some superfluous affectation, the Rasch model is the necessary and sufficient definition of measurement. It follows that only data that can be made to fit a Rasch model can be used to construct measures.

2. Empirical analyses by Lord and Stocking demonstrate this at length:
Pages 1015-1017 (Lord, 1968) testify that: "Successful results were obtained only after a hundred or so painstaking attempts (1015)." Item discriminations "are likely to increase without limit (1015)." Person abilities "tend to increase or decrease without limit (1016)." "Divergence of the entire iterative procedure may occur simply because the initial approximations are not good enough (1016)."

Pages 13, 15 and 19 ( Lord, 1975) show that even for artificial data generated to fit the 3PL model exactly only item difficulty (13) is satisfactorily recovered by LOGIST. If estimation were successful, then the

dispersions of discrimination estimates (15) and guessing estimates (19) would be well estimated but, in fact, numerous of these estimates diverge by many standard errors from their generating parameter values.

Stocking (1989) compares BILOG to LOGIST unfavorably (26-28, 45) and details serious estimation problems in LOGIST (41-45). In particular: When analyzing data generated to fit the 3p model, "It is somewhat startling to find that changing starting values for item discriminations has such a large effect on the standard LOGIST procedure (24)." and "Running LOGIST to complete convergence allows too much movement away from the good starting values (25)."

More serious, "While there is no apparent bias in the ability estimates when obtained from true item parameters, the bias is significant when ability estimates are obtained from estimated item parameters. And in spite of the fact that the calibration and cross-validation samples are the same for each setting, the bias differs by test (18)." Stocking underlines this statement as well she might since it is only estimated item parameters that are available in real practice!

The startling magnitudes of bias found by Stocking are shown in her Figures 3-7 (56-60), Figures 21-23 (74-76).

3. Guessing cannot and need not be estimated as an item asymptote. Guessing is inapt as an item characteristic. When guessing occurs, it is a person response anomaly, manifested occasionally by a few individuals on a few items which baffle those few persons (Wright, 1977, pp. 110-112). Only recurring lucky guessing on multiple choice items disturbs measurement. But when guesses are lucky, the consequences in the responses of the lucky guesser are clearly visible as improbable right answers. Whenever something must be done about the few lucky guesses which actually occur in multiple choice item response data, the few persons responsible for those occurrences are easy to find and reasonable corrections for any interference with measurement are easy to apply (Wright & Stone, 1979, pp. 170-190).

4. Variation in item discrimination is not only impossible to estimate without arbitrary impositions (because cross-weighing observed responses by ability estimates when discrimination is estimated and then by discrimination estimates when ability is estimated produces a regenerative feedback which escalates to infinity (Wright, 1977, pp. 103-104)) but, more devastating, modeling variation in item discrimination denies the development of construct validity because then the meaning of the variable cannot be based on item difficulty ordering. No fixed maps of item difficulty hierarchy and hence construct definition can be made because variation in discrimination forces the hierarchy of item difficult to vary with person ability. Variation in item discrimination causes ICC's to cross. But when ICC's cross, there is no unique item ordering on which to build construct validity or set standards. Construct validity and criterion meaning disappear.

5. What this means for practice is that:
a. Whenever one counts on raw scoring, i.e. counts right answers or Likert scale categories, then one is collecting data from which only a Rasch model can construct measures.
b. Whenever one estimates a regression analysis, growth study, t-test or means and standard deviations, one requires quantification of the dependent variable sufficiently linear and invariant to justify the arithmetic, i.e. one requires measures of the kind only Rasch models construct.
c. Whenever one aspires to understand the construct meaning of one's variables in terms of the calibrated item content by which they have been defined then one has decided to work with a model which specifies that the ICC's do not cross, i.e. a Rasch model.

6. The purpose of test analysis is not to serve the test or the variety of good and bad items which happen to fall into the test. The purpose is to serve the measurement of the child taking the test. This means:
a. Using a measurement model which establishes a clear, simple and maintainable definition of good measurement. [When one uses 3p to recalibrate the same test over samples of varying ability (an exercise any test analyzer can easily perform), the 3p estimates of discrimination and guessing are conspicuously incoherent. And even the 3p item difficulties are unnecessarily disturbed when compared with the same pair of recalibrations done by a Rasch model analysis.]
b. Using fit statistics based on this good measurement model to maintain the quality of measurement (i) by using item misfit to detect and remove eccentric items which cannot be relied upon to evoke useful responses and (ii) person misfit to identify and diagnose anomalous patterns of person response. Should some person obtain some lucky guesses, they stand out like a sore thumb against the Rasch model. [The 3p model buries this individual person information by forcing item guessing parameters on everyone who takes the items whether they guess or not.] If something beneficial, not to mention legal, is to be done about guessing, then it must face those few persons who benefit from lucky guesses and not mistreat everyone else.

Benjamin Drake Wright, 12/18/95, in a Note to Allan Olson, Northwest Evaluation Association (NWEA).

Green D.R., Yen W.M., Burket G.R. (1989) Experiences in the Application of Item Response Theory in Test Construction. Applied Measurement in Education, 2(4), 297-312.

Lord, F.M. (1968). An analysis of the Verbal Scholastic Aptitude Test Using Birnbaum's Three-Parameter Model. Educational and Psychological Measurement, 28, 989-1020.

Lord, F.M. (1975). Evaluation with artificial data of a procedure for estimating ability and item characteristic curve parameters. (Research Report RB-75-33). Princeton, NJ: ETS.

Stocking, M.L. (1989),. Empirical estimation errors in item response theory as a function of test properties. (Research Report RR-89-5). Princeton, NJ: ETS.

A Critique of 3-PL IRT Estimation. Benjamin Drake Wright … Rasch Measurement Transactions, 2013, 27:2 p. 1411-2

Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free	An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse	Rasch Measurement Theory Analysis in R, Wind, Hua	Applying the Rasch Model in Social Sciences Using R, Lamprianou	El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar	Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch	Rasch Models for Measurement, David Andrich	Constructing Measures, Mark Wilson	Best Test Design - free, Wright & Stone Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias	Diseño de Mejores Pruebas - free, Spanish Best Test Design	A Course in Rasch Measurement Theory, Andrich, Marais	Rasch Models in Health, Christensen, Kreiner, Mesba	Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen
Rasch Books and Publications: Winsteps and Facets
Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene	Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver	Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone	Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale	Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes	Rasch Models for Solving Measurement Problems (Facets), George Engelhard, Jr. & Jue Wang	Statistical Analyses for Language Testers (Facets), Rita Green	Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind	Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind	Rasch Measurement: Applications, Khine	Winsteps Tutorials - free Facets Tutorials - free	Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre	Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
May 17 - June 21, 2024, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 12 - 14, 2024, Wed.-Fri.	1st Scandinavian Applied Measurement Conference, Kristianstad University, Kristianstad, Sweden http://www.hkr.se/samc2024
June 21 - July 19, 2024, Fri.-Fri.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com
Aug. 5 - Aug. 6, 2024, Fri.-Fri.	2024 Inaugural Conference of the Society for the Study of Measurement (Berkeley, CA), Call for Proposals
Aug. 9 - Sept. 6, 2024, Fri.-Fri.	On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com
Oct. 4 - Nov. 8, 2024, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Jan. 17 - Feb. 21, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
May 16 - June 20, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 20 - July 18, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com
Oct. 3 - Nov. 7, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com