IRT in the 1990s: Which Models Work Best? 3PL or Rasch?

Ben Wright's opening remarks in his invited debate with Ron Hambleton, Session 11.05, AERA Annual Meeting 1992.

Good morning! I was introduced as one of the debaters. I wonder if I might not turn out to be a debunker rather than a debater. We will find out as time goes on. As for the mysterious "one parameter" model mentioned by Moderator, Gwyneth Boodoo, I don't know what that is, so I can't speak for it. To my knowledge, there are no "one parameter" models in psychometrics. There are, in fact, only two deliberate models widely used. One is the two-parameter Rasch Model. The two parameters (B, D) are explicit in Figure 1. The other is the four parameter (a, b, c, θ) Birnbaum model which sometimes has five when an upper asymptote is estimated (Barton & Lord, 1981).

I will defend the Rasch Model. Actually, even at two parameters (B, D) the comparison is misleading because the Rasch Model can have any number of parameters to the right of the log-odds statement, as long as they are connected with plus or minus signs. As long as you maintain the additivity of measurement construction, you can have twenty parameters off to the right, even for a dichotomous observation. That might be a somewhat complex data design, but we work with these things all the time these days, and fruitfully.

Birnbaum Model: 3-PL For 2-PL, set c_i=0 For 1-PL, set a_i=1.7, c_i=0	Rasch Model
Allan Birnbaum 1957$ / 1968	Georg Rasch 1952$ / 1960
imitates data	defines measures
contrived to fit observed MCQ ICC's	derived to construct scientific measurement
θ is the assumed, not actual, person sample distribution	n is the actual individual person ability
Shared a_i and θ causes θ <–> a_i feedback: divergence unless constrained	B and D estimable separately: inevitable convergence
MCQ dichotomies only [1992: Eiji Muraki's Generalized Partial Credit Model]	any ordered observation dichotomy, rating, ranking, counting
guessing accepted c_i reliable item asset	guessing rejected unreliable person liability
discrimination variation welcomed a_i as a useful item scoring weight	discrimination variation rejected as a misleading item-bias interaction
crossed ICC's accepted natural and unavoidable item-difficulty-ordering is different for different persons	crossed ICC's rejected prevents construct validity item-difficulty-ordering is the same for everyone
Figure 1. Comparison of Rasch and Birnbaum Models. ($ first written report)

I make measures for a living. Measures have a specific definition. They're the kind of thing where one more is always the same amount - like inches and pounds. You may have noticed, if you have read Thurstone (1925), Guilford (1936) or Thorndike (1904), or thought a little, that raw scores are not measures, neither are grade equivalents, age equivalents, percentiles or any of those things. In science, engineering, business and cooking, you need measures which have this simple essential property: one more is always the same amount, like the inches on this carpenter's ruler I am using for a pointer. To get that result, that kind of numbers, you need to use the additive construction of the Rasch model.

It is important to clarify what these two models are about. I will compare them to bring out how different they are. They're not at all cases of one another. Even the arithmetical trick of making parameters "a" and "c" disappear, so that the Birnbaum model looks like Rasch doesn't make Birnbaum in spirit, purpose or function equivalent to Rasch. The two are opposite in philosophy and in practice.

The Birnbaum model is designed to imitate data, to be faithful to the data as well as possible, to accept any kind of data, whatever may come up. However it is contrived primarily for MCQ response curves. Quite different from that is the Rasch model which is not designed to fit any data, but instead is derived to define measurement. The Rasch model is a statement, a specification, of the requirements of measurement - the kind of statement that appears in Edward Thorndike's work, in Thurstone's work, in Guttman's (1950) work. Rasch is the one who made the deduction of the necessary mathematical formulation and showed that it was both sufficient and necessary for the construction of linear, objective measurement. It is also nice that there are sufficient statistics for these parameters, because that's a useful and robust device for getting estimates. The Birnbaum model has loose standards for incoming data. It hardly ever objects to anything because it's adjusted to adapt to whatever strangeness there is in the data. The Rasch model has tight standards. The two models are opposites - one loose, the other tight - in the standards they set for the data they will work with.

When you take a look in Figure 1 at Birnbaum's (1968) estimation equations and compare them to the estimation equations for the Rasch model, you notice something striking and troublesome. There's a cross-weighting of data and parameters in Birnbaum's equations. The discrimination estimates weight the data when you estimate person ability, and the person ability estimates weight the data when you estimate item discrimination. This cross-weighting guarantees divergence. It guarantees the failure to converge reported in almost every paper about the Birnbaum model since 1968. In Stocking (1989), people are advised never to iterate more than four times, because, if they do, the estimates will go further and further away, even from the generating parameters of artificial data made to fit the model perfectly which the model is trying and failing to recover as it iterates.

These models are opposites. In estimation procedure they function oppositely. The Birnbaum model is impossible to apply. Unless you cheat, you can't apply it at all. In contrast, Rasch is easy to apply. It takes very strange real data configurations to prevent Rasch estimation from converging. Data configurations so strange that, when you track them down, you discover that these were not the data you wanted to analyze. Another polarity. Birnbaum is hard to use. Many say impossible. Rasch is easy to use. I've never heard of anyone being unable to use the Rasch model, whatever ideology they may profess.

What about application? They are also opposite in application. Because the Birnbaum model is only for multiple choice dichotomies, that's where it stops - at a dying-out kind of item soon to be gone. The Rasch model, in contrast, is for any kind of ordered observation, any kind at all: a rating, a scoring, a ranking. We have handled successfully a tremendously wide variety of data structures. Birnbaum is narrow, about as narrow as you can get. Rasch is wide, about as wide as you can get. So wide its limits have not yet been encountered. There are at least nine different formulations, kinds of models, which we can analyze with the Rasch formulation.

Finally we come back to another aspect in which they are opposite. It is an echo of the earlier ones. In Birnbaum we have a kind of promiscuity. Guessing is accepted as a reliable item asset. Discrimination is welcomed as a useful scoring weight. And crossed item characteristic curves are accepted as natural and unavoidable.

In Rasch, instead of promiscuity, we have choosiness. We don't want guessing. We recognize it as an unreliable person liability. I never met an item that guessed, so I look for guessing among people. I do find some people who guess sometimes, but not all people guess and seldom on the same items. So I do a better job of identifying and controlling guessing when I look for it in persons' responses, label it and decide what I want to do with it.

Variation in discrimination is also rejected by Rasch as a symptom of item bias, multi-dimensionality. This phenomenon has been followed up empirically many times (e.g., Masters, 1988). The items which vary in discrimination have been demonstrated to be contaminated by item bias or to introduce extra dimensions.

What I want to talk about in the minute or two I have left is crossed ICC's. I reject them because they prevent construct validity. Here in Figure 2 is a beautiful word-recognition ruler constructed by another man who makes measures for a living, Dick Woodcock (1974). In the left column are the inches on Dick's ruler. They mean the same amount from one end to the other. In the center column is the range of this ruler: from 1st to 12th Grade. In the right column are the words that define this variable, that specify its definition. "Red" is a nice short easy word. It is recognized at the 1st Grade. But, when you get down to "heterogeneous", it takes a 12th Grader to nail it down. We have a continuous construct here, specified explicitly, which we can use to make sense out of children's measures. This construct gives the scale meaning. The identification of a stable ordering and spacing of items is decisive for construct validity.

Look at Figure 3. It needs to be the case that, whether you are a 1st, 2nd or 3rd Grader, "red", "away", "drink", "octopus" and "equestrian" remain in the same order of experienced difficulty, at the same spacing. The ruler has to be the same for every child measured whatever their grade. If the ruler changes, it is not a ruler. It's something else.

Look at Figure 4. To obtain the arrangement in Figure 3 and to keep it stable, we need a special kind of response curve. Here is the item response ogive deduced from the standard definition of measurement. These curves are parallel, in the sense that they don't cross. If you make the vertical axis log-odds instead of probabilities, you will find that these curves become straight lines that are exactly parallel. The important thing to see is that they don't cross each other.

Now let's see what happens when we look at some Birnbaum curves. Figure 5 shows a handful of typical Birnbaum curves. They have different asymptotes, different slopes. It looks messy. It doesn't look like any measurement system that I would want to work with. Why? Because the curves cross.

In Figure 6 we see the consequence for the variable experienced by those three children. Incredible! Look at the 1st Grader. "Red" is easier than "away" is easier than "drink" is easier than "octopus". OK. But what happens to the 3rd Grader? For the 3rd Grader its "away" that is easier. "Red" is harder even than "drink"! And "octopus" is now next to "red", instead of up near "equestrian". What is the definition of this variable? What is the construct defined here? What kind of ruler is this? It changes for every level of ability. I can't make a living with that kind of a ruler. No scientist, engineer, businessman or cook, who depends on measures of the kind this carpenter's ruler exemplifies, can work with that kind of ruler.

Let's go backwards. Much as I might be intrigued by the apparent sophistication of the Birnbaum curves in Figure 5, I cannot work with them. I must have orderly, cooperating curves like the Rasch curves in Figure 4, and I must find data that will serve this purpose. I cannot swallow whatever junk happens to come my way. I must be choosy and selective and careful when I construct my data. When I go to market I don't buy rotten fruit. I buy good fruit. When I make a salad, I only pick the parts that make a good salad. I have a recipe for what I want. I have a model for measurement.

I need to make the kind of a structure in Figure 3 - the same ruler for everybody, so I can have a useful and stable construct definition like Dick's word-recognition ruler in Figure 2.

A USEFUL RULER WOODCOCK READING MASTERY TESTS
	DIFFICULTY	SAMPLE TASK
Mastery Scale	Grade Scale 50% Mastery
25 41 58 70 86 101 114 124 143 159 174 192 211 240		1.1 1.3 1.4 1.5 1.7 1.8 2.0 2.2 2.8 3.3 4.1 5.7 9.3 12.9	is red down black away cold drink shallow through octopus allowable hinderance equestrian heterogeneous	A B C D E
FIXED ITEM POSITIONS DEFINE VARIABLE
Figure 2. A useful, linear, invariant measuring instrument.

The Birnbaum model is data-centered: model must fit, else get a better model. It hardly ever objects to any item. The Rasch model is theory-centered: data must fit, else get better data. And in the search for better data, wonderful things are discovered about the nature of what you are measuring and the way that people can tell you about it. These discoveries are important events which develop and strengthen your construct and your ability to measure it. The Birnbaum model is patched up to chase after whatever pops up. The Rasch model is derived a priori, to define the criteria which data must follow to qualify for making measures.

Barton, Marc A, & Lord, Frederic M. (1981) An upper asymptote for the three-parameter logistic item-response model. Princeton, N.J.: Educational Testing Service.

Birnbaum, A. (1968). Some latent trait models. In F.M. Lord & M.R. Novick, (Eds.), Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

Guttman L. (1950). The basis for scalogram analysis. In Stouffer et al. (Eds.), Measurement and prediction. New York: Wiley.

Masters, G.N. (1988). Item discrimination: When more is worse. Journal of Educational Measurement, (24), 15-29.

Stocking, M. L. (1989). Empirical estimation errors in item response theory as a function of test properties. (Research Report RR-89-5). Princeton: ETS.

Thorndike E. L. (1904) An introduction to the theory of mental and social measurement. New York: Teachers College, Columbia University

Thurstone L. L. (1925) A method of scaling psychological and educational data. J Educ Psychol 1925; 15:433-51.

Woodcock, R. W. (1974). Woodcock Reading Mastery Tests. Circle Pines, Minn: American Guidance Service.

IRT in the 1990s: Which Models Work Best? 3PL or Rasch? B.D. Wright … Rasch Measurement Transactions, 1992, 6:1, 196-200

Rasch Books and Publications
Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, 2nd Edn. George Engelhard, Jr. & Jue Wang	Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene	Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver	Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone	Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes	Statistical Analyses for Language Testers (Facets), Rita Green	Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind	Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M	Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind	Rasch Measurement: Applications, Khine	Winsteps Tutorials - free Facets Tutorials - free	Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre	Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan
Other Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free	An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse	Rasch Measurement Theory Analysis in R, Wind, Hua	Applying the Rasch Model in Social Sciences Using R, Lamprianou	El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar	Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch	Rasch Models for Measurement, David Andrich	Constructing Measures, Mark Wilson	Best Test Design - free, Wright & Stone Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias	Diseño de Mejores Pruebas - free, Spanish Best Test Design	A Course in Rasch Measurement Theory, Andrich, Marais	Rasch Models in Health, Christensen, Kreiner, Mesba	Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
Apr. 21 - 22, 2025, Mon.-Tue.	International Objective Measurement Workshop (IOMW) - Boulder, CO, www.iomw.net
Jan. 17 - Feb. 21, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Feb. - June, 2025	On-line course: Introduction to Classical Test and Rasch Measurement Theories (D. Andrich, I. Marais, RUMM2030), University of Western Australia
Feb. - June, 2025	On-line course: Advanced Course in Rasch Measurement Theory (D. Andrich, I. Marais, RUMM2030), University of Western Australia
May 16 - June 20, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 20 - July 18, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com
July 21 - 23, 2025, Mon.-Wed.	Pacific Rim Objective Measurement Symposium (PROMS) 2025, www.proms2025.com
Oct. 3 - Nov. 7, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com