A Critique of 3-PL IRT Estimation

Benjamin Drake Wright was asked to respond to Green et al. (1989) which discusses "9 years of using a three-parameter model in the construction of major achievement batteries." Here is Ben's response:

Does Green make sense? Following are some off-the-cuff reactions.

1. Mathematical analysis shows that the 3p model is a non-converging, inestimable elaboration of the Rasch model. When the generic criteria for measurement identified by physicists (Campbell, 1920) and mathematicians (Luce & Tukey, 1964) and demanded by the founders of psychometrics: Thorndike (1927), Thurstone (1928, 1931), Guilford (1936) and Guttman (1950a) are really required, then only the Rasch model can be deduced (Brogden, 1977; Perline, Wright & Wainer, 1979; Roskam & Jansen, 1984; Wright & Linacre, 1987; Wright, 1985, 1988a, 1988b, 1989a, 1989b). Far from being a special case of some superfluous affectation, the Rasch model is the necessary and sufficient definition of measurement. It follows that only data that can be made to fit a Rasch model can be used to construct measures.

2. Empirical analyses by Lord and Stocking demonstrate this at length:
Pages 1015-1017 (Lord, 1968) testify that: "Successful results were obtained only after a hundred or so painstaking attempts (1015)." Item discriminations "are likely to increase without limit (1015)." Person abilities "tend to increase or decrease without limit (1016)." "Divergence of the entire iterative procedure may occur simply because the initial approximations are not good enough (1016)."

Pages 13, 15 and 19 ( Lord, 1975) show that even for artificial data generated to fit the 3PL model exactly only item difficulty (13) is satisfactorily recovered by LOGIST. If estimation were successful, then the

dispersions of discrimination estimates (15) and guessing estimates (19) would be well estimated but, in fact, numerous of these estimates diverge by many standard errors from their generating parameter values.

Stocking (1989) compares BILOG to LOGIST unfavorably (26-28, 45) and details serious estimation problems in LOGIST (41-45). In particular: When analyzing data generated to fit the 3p model, "It is somewhat startling to find that changing starting values for item discriminations has such a large effect on the standard LOGIST procedure (24)." and "Running LOGIST to complete convergence allows too much movement away from the good starting values (25)."

More serious, "While there is no apparent bias in the ability estimates when obtained from true item parameters, the bias is significant when ability estimates are obtained from estimated item parameters. And in spite of the fact that the calibration and cross-validation samples are the same for each setting, the bias differs by test (18)." Stocking underlines this statement as well she might since it is only estimated item parameters that are available in real practice!

The startling magnitudes of bias found by Stocking are shown in her Figures 3-7 (56-60), Figures 21-23 (74-76).

3. Guessing cannot and need not be estimated as an item asymptote. Guessing is inapt as an item characteristic. When guessing occurs, it is a person response anomaly, manifested occasionally by a few individuals on a few items which baffle those few persons (Wright, 1977, pp. 110-112). Only recurring lucky guessing on multiple choice items disturbs measurement. But when guesses are lucky, the consequences in the responses of the lucky guesser are clearly visible as improbable right answers. Whenever something must be done about the few lucky guesses which actually occur in multiple choice item response data, the few persons responsible for those occurrences are easy to find and reasonable corrections for any interference with measurement are easy to apply (Wright & Stone, 1979, pp. 170-190).

4. Variation in item discrimination is not only impossible to estimate without arbitrary impositions (because cross-weighing observed responses by ability estimates when discrimination is estimated and then by discrimination estimates when ability is estimated produces a regenerative feedback which escalates to infinity (Wright, 1977, pp. 103-104)) but, more devastating, modeling variation in item discrimination denies the development of construct validity because then the meaning of the variable cannot be based on item difficulty ordering. No fixed maps of item difficulty hierarchy and hence construct definition can be made because variation in discrimination forces the hierarchy of item difficult to vary with person ability. Variation in item discrimination causes ICC's to cross. But when ICC's cross, there is no unique item ordering on which to build construct validity or set standards. Construct validity and criterion meaning disappear.

5. What this means for practice is that:
a. Whenever one counts on raw scoring, i.e. counts right answers or Likert scale categories, then one is collecting data from which only a Rasch model can construct measures.
b. Whenever one estimates a regression analysis, growth study, t-test or means and standard deviations, one requires quantification of the dependent variable sufficiently linear and invariant to justify the arithmetic, i.e. one requires measures of the kind only Rasch models construct.
c. Whenever one aspires to understand the construct meaning of one's variables in terms of the calibrated item content by which they have been defined then one has decided to work with a model which specifies that the ICC's do not cross, i.e. a Rasch model.

6. The purpose of test analysis is not to serve the test or the variety of good and bad items which happen to fall into the test. The purpose is to serve the measurement of the child taking the test. This means:
a. Using a measurement model which establishes a clear, simple and maintainable definition of good measurement. [When one uses 3p to recalibrate the same test over samples of varying ability (an exercise any test analyzer can easily perform), the 3p estimates of discrimination and guessing are conspicuously incoherent. And even the 3p item difficulties are unnecessarily disturbed when compared with the same pair of recalibrations done by a Rasch model analysis.]
b. Using fit statistics based on this good measurement model to maintain the quality of measurement (i) by using item misfit to detect and remove eccentric items which cannot be relied upon to evoke useful responses and (ii) person misfit to identify and diagnose anomalous patterns of person response. Should some person obtain some lucky guesses, they stand out like a sore thumb against the Rasch model. [The 3p model buries this individual person information by forcing item guessing parameters on everyone who takes the items whether they guess or not.] If something beneficial, not to mention legal, is to be done about guessing, then it must face those few persons who benefit from lucky guesses and not mistreat everyone else.

Benjamin Drake Wright, 12/18/95, in a Note to Allan Olson, Northwest Evaluation Association (NWEA).


Green D.R., Yen W.M., Burket G.R. (1989) Experiences in the Application of Item Response Theory in Test Construction. Applied Measurement in Education, 2(4), 297-312.

Lord, F.M. (1968). An analysis of the Verbal Scholastic Aptitude Test Using Birnbaum's Three-Parameter Model. Educational and Psychological Measurement, 28, 989-1020.

Lord, F.M. (1975). Evaluation with artificial data of a procedure for estimating ability and item characteristic curve parameters. (Research Report RB-75-33). Princeton, NJ: ETS.

Stocking, M.L. (1989),. Empirical estimation errors in item response theory as a function of test properties. (Research Report RR-89-5). Princeton, NJ: ETS.

(Other references not included in Wright's Note)

A Critique of 3-PL IRT Estimation. Benjamin Drake Wright … Rasch Measurement Transactions, 2013, 27:2 p. 1411-2

Rasch Publications
Rasch Measurement Transactions (free, online) Rasch Measurement research papers (free, online) Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch Applying the Rasch Model 3rd. Ed., Bond & Fox Best Test Design, Wright & Stone
Rating Scale Analysis, Wright & Masters Introduction to Rasch Measurement, E. Smith & R. Smith Introduction to Many-Facet Rasch Measurement, Thomas Eckes Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr. Statistical Analyses for Language Testers, Rita Green
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar Journal of Applied Measurement Rasch models for measurement, David Andrich Constructing Measures, Mark Wilson Rasch Analysis in the Human Sciences, Boone, Stave, Yale
in Spanish: Análisis de Rasch para todos, Agustín Tristán Mediciones, Posicionamientos y Diagnósticos Competitivos, Juan Ramón Oreja Rodríguez

To be emailed about new material on www.rasch.org
please enter your email address here:

I want to Subscribe: & click below
I want to Unsubscribe: & click below

Please set your SPAM filter to accept emails from Rasch.org

www.rasch.org welcomes your comments:

Your email address (if you want us to reply):


ForumRasch Measurement Forum to discuss any Rasch-related topic

Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
June 23 - July 21, 2023, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com
Aug. 11 - Sept. 8, 2023, Fri.-Fri. On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com


The URL of this page is www.rasch.org/rmt/rmt272a.htm

Website: www.rasch.org/rmt/contents.htm