IRT in the 1990s: Which Models Work Best? 3PL or Rasch?

Ben Wright's opening remarks in his invited debate with Ron Hambleton, Session 11.05, AERA Annual Meeting 1992.

Good morning! I was introduced as one of the debaters. I wonder if I might not turn out to be a debunker rather than a debater. We will find out as time goes on. As for the mysterious "one parameter" model mentioned by Moderator, Gwyneth Boodoo, I don't know what that is, so I can't speak for it. To my knowledge, there are no "one parameter" models in psychometrics. There are, in fact, only two deliberate models widely used. One is the two-parameter Rasch Model. The two parameters (B, D) are explicit in Figure 1. The other is the four parameter (a, b, c, θ) Birnbaum model which sometimes has five when an upper asymptote is estimated (Barton & Lord, 1981).

I will defend the Rasch Model. Actually, even at two parameters (B, D) the comparison is misleading because the Rasch Model can have any number of parameters to the right of the log-odds statement, as long as they are connected with plus or minus signs. As long as you maintain the additivity of measurement construction, you can have twenty parameters off to the right, even for a dichotomous observation. That might be a somewhat complex data design, but we work with these things all the time these days, and fruitfully.

Birnbaum Model: 3-PL
For 2-PL, set ci=0
For 1-PL, set ai=1.7, ci=0
Rasch Model
Allan Birnbaum 1957\$ / 1968Georg Rasch 1952\$ / 1960
contrived to fit observed MCQ ICC'sderived to construct scientific measurement

θ is the assumed, not actual, person sample distribution

n is the actual individual person ability

Shared ai and θ causes θ <–> ai feedback:
divergence unless constrained

B and D estimable separately:
inevitable convergence
MCQ dichotomies only
[1992: Eiji Muraki's Generalized Partial Credit Model]
any ordered observation
dichotomy, rating, ranking, counting
guessing accepted ci
reliable item asset
guessing rejected
unreliable person liability
discrimination variation welcomed ai
as a useful item scoring weight
discrimination variation rejected
crossed ICC's accepted
natural and unavoidable
item-difficulty-ordering is different for different persons
crossed ICC's rejected
prevents construct validity
item-difficulty-ordering is the same for everyone
Figure 1. Comparison of Rasch and Birnbaum Models.
(\$ first written report)

I make measures for a living. Measures have a specific definition. They're the kind of thing where one more is always the same amount - like inches and pounds. You may have noticed, if you have read Thurstone (1925), Guilford (1936) or Thorndike (1904), or thought a little, that raw scores are not measures, neither are grade equivalents, age equivalents, percentiles or any of those things. In science, engineering, business and cooking, you need measures which have this simple essential property: one more is always the same amount, like the inches on this carpenter's ruler I am using for a pointer. To get that result, that kind of numbers, you need to use the additive construction of the Rasch model.

It is important to clarify what these two models are about. I will compare them to bring out how different they are. They're not at all cases of one another. Even the arithmetical trick of making parameters "a" and "c" disappear, so that the Birnbaum model looks like Rasch doesn't make Birnbaum in spirit, purpose or function equivalent to Rasch. The two are opposite in philosophy and in practice.

The Birnbaum model is designed to imitate data, to be faithful to the data as well as possible, to accept any kind of data, whatever may come up. However it is contrived primarily for MCQ response curves. Quite different from that is the Rasch model which is not designed to fit any data, but instead is derived to define measurement. The Rasch model is a statement, a specification, of the requirements of measurement - the kind of statement that appears in Edward Thorndike's work, in Thurstone's work, in Guttman's (1950) work. Rasch is the one who made the deduction of the necessary mathematical formulation and showed that it was both sufficient and necessary for the construction of linear, objective measurement. It is also nice that there are sufficient statistics for these parameters, because that's a useful and robust device for getting estimates. The Birnbaum model has loose standards for incoming data. It hardly ever objects to anything because it's adjusted to adapt to whatever strangeness there is in the data. The Rasch model has tight standards. The two models are opposites - one loose, the other tight - in the standards they set for the data they will work with.

When you take a look in Figure 1 at Birnbaum's (1968) estimation equations and compare them to the estimation equations for the Rasch model, you notice something striking and troublesome. There's a cross-weighting of data and parameters in Birnbaum's equations. The discrimination estimates weight the data when you estimate person ability, and the person ability estimates weight the data when you estimate item discrimination. This cross-weighting guarantees divergence. It guarantees the failure to converge reported in almost every paper about the Birnbaum model since 1968. In Stocking (1989), people are advised never to iterate more than four times, because, if they do, the estimates will go further and further away, even from the generating parameters of artificial data made to fit the model perfectly which the model is trying and failing to recover as it iterates.

These models are opposites. In estimation procedure they function oppositely. The Birnbaum model is impossible to apply. Unless you cheat, you can't apply it at all. In contrast, Rasch is easy to apply. It takes very strange real data configurations to prevent Rasch estimation from converging. Data configurations so strange that, when you track them down, you discover that these were not the data you wanted to analyze. Another polarity. Birnbaum is hard to use. Many say impossible. Rasch is easy to use. I've never heard of anyone being unable to use the Rasch model, whatever ideology they may profess.

What about application? They are also opposite in application. Because the Birnbaum model is only for multiple choice dichotomies, that's where it stops - at a dying-out kind of item soon to be gone. The Rasch model, in contrast, is for any kind of ordered observation, any kind at all: a rating, a scoring, a ranking. We have handled successfully a tremendously wide variety of data structures. Birnbaum is narrow, about as narrow as you can get. Rasch is wide, about as wide as you can get. So wide its limits have not yet been encountered. There are at least nine different formulations, kinds of models, which we can analyze with the Rasch formulation.

Finally we come back to another aspect in which they are opposite. It is an echo of the earlier ones. In Birnbaum we have a kind of promiscuity. Guessing is accepted as a reliable item asset. Discrimination is welcomed as a useful scoring weight. And crossed item characteristic curves are accepted as natural and unavoidable.

In Rasch, instead of promiscuity, we have choosiness. We don't want guessing. We recognize it as an unreliable person liability. I never met an item that guessed, so I look for guessing among people. I do find some people who guess sometimes, but not all people guess and seldom on the same items. So I do a better job of identifying and controlling guessing when I look for it in persons' responses, label it and decide what I want to do with it.

A USEFUL RULER
WOODCOCK
DIFFICULTY SAMPLE
Mastery
Scale
50% Mastery
25
41
58
70
86
101
114
124
143
159
174
192
211
240
1.1
1.3
1.4
1.5
1.7
1.8
2.0
2.2
2.8
3.3
4.1
5.7
9.3
12.9
is
red
down
black
away
cold
drink
shallow
through
octopus
allowable
hinderance
equestrian
heterogeneous

A

B

C

D

E
FIXED ITEM POSITIONS
DEFINE VARIABLE
Figure 2. A useful, linear, invariant measuring instrument.

Variation in discrimination is also rejected by Rasch as a symptom of item bias, multi-dimensionality. This phenomenon has been followed up empirically many times (e.g., Masters, 1988). The items which vary in discrimination have been demonstrated to be contaminated by item bias or to introduce extra dimensions.

What I want to talk about in the minute or two I have left is crossed ICC's. I reject them because they prevent construct validity. Here in Figure 2 is a beautiful word-recognition ruler constructed by another man who makes measures for a living, Dick Woodcock (1974). In the left column are the inches on Dick's ruler. They mean the same amount from one end to the other. In the center column is the range of this ruler: from 1st to 12th Grade. In the right column are the words that define this variable, that specify its definition. "Red" is a nice short easy word. It is recognized at the 1st Grade. But, when you get down to "heterogeneous", it takes a 12th Grader to nail it down. We have a continuous construct here, specified explicitly, which we can use to make sense out of children's measures. This construct gives the scale meaning. The identification of a stable ordering and spacing of items is decisive for construct validity.

 Figure 3. Five Rasch Items and Three Ability Levels 1st = Low ability; 2nd = Medium ability; 3rd = High Ability Notice the 3 identical item-difficulty hierarchies (advancing from left to right) Figure 4. Five Rasch Curves and Three Ability Levels 1st = Low ability; 2nd = Medium ability; 3rd = High Ability

Look at Figure 3. It needs to be the case that, whether you are a 1st, 2nd or 3rd Grader, "red", "away", "drink", "octopus" and "equestrian" remain in the same order of experienced difficulty, at the same spacing. The ruler has to be the same for every child measured whatever their grade. If the ruler changes, it is not a ruler. It's something else.

Look at Figure 4. To obtain the arrangement in Figure 3 and to keep it stable, we need a special kind of response curve. Here is the item response ogive deduced from the standard definition of measurement. These curves are parallel, in the sense that they don't cross. If you make the vertical axis log-odds instead of probabilities, you will find that these curves become straight lines that are exactly parallel. The important thing to see is that they don't cross each other.

 Figure 5. Five Birnbaum Curves and Three Ability Levels 1st = Low ability; 2nd = Medium ability; 3rd = High Ability Figure 6. Five Birnbaum Items and Three Ability Levels 1st = Low ability; 2nd = Medium ability; 3rd = High Ability Notice the 3 different item-difficulty hierarchies (advancing from left to right)

Now let's see what happens when we look at some Birnbaum curves. Figure 5 shows a handful of typical Birnbaum curves. They have different asymptotes, different slopes. It looks messy. It doesn't look like any measurement system that I would want to work with. Why? Because the curves cross.

In Figure 6 we see the consequence for the variable experienced by those three children. Incredible! Look at the 1st Grader. "Red" is easier than "away" is easier than "drink" is easier than "octopus". OK. But what happens to the 3rd Grader? For the 3rd Grader its "away" that is easier. "Red" is harder even than "drink"! And "octopus" is now next to "red", instead of up near "equestrian". What is the definition of this variable? What is the construct defined here? What kind of ruler is this? It changes for every level of ability. I can't make a living with that kind of a ruler. No scientist, engineer, businessman or cook, who depends on measures of the kind this carpenter's ruler exemplifies, can work with that kind of ruler.

Let's go backwards. Much as I might be intrigued by the apparent sophistication of the Birnbaum curves in Figure 5, I cannot work with them. I must have orderly, cooperating curves like the Rasch curves in Figure 4, and I must find data that will serve this purpose. I cannot swallow whatever junk happens to come my way. I must be choosy and selective and careful when I construct my data. When I go to market I don't buy rotten fruit. I buy good fruit. When I make a salad, I only pick the parts that make a good salad. I have a recipe for what I want. I have a model for measurement.

I need to make the kind of a structure in Figure 3 - the same ruler for everybody, so I can have a useful and stable construct definition like Dick's word-recognition ruler in Figure 2.

The Birnbaum model is data-centered: model must fit, else get a better model. It hardly ever objects to any item. The Rasch model is theory-centered: data must fit, else get better data. And in the search for better data, wonderful things are discovered about the nature of what you are measuring and the way that people can tell you about it. These discoveries are important events which develop and strengthen your construct and your ability to measure it. The Birnbaum model is patched up to chase after whatever pops up. The Rasch model is derived a priori, to define the criteria which data must follow to qualify for making measures.

Benjamin D. Wright

Barton, Marc A, & Lord, Frederic M. (1981) An upper asymptote for the three-parameter logistic item-response model. Princeton, N.J.: Educational Testing Service.

Birnbaum, A. (1968). Some latent trait models. In F.M. Lord & M.R. Novick, (Eds.), Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

Guilford J. P. (1936). Psychometric Methods. New York: McGraw- Hill.

Guttman L. (1950). The basis for scalogram analysis. In Stouffer et al. (Eds.), Measurement and prediction. New York: Wiley.

Masters, G.N. (1988). Item discrimination: When more is worse. Journal of Educational Measurement, (24), 15-29.

Stocking, M. L. (1989). Empirical estimation errors in item response theory as a function of test properties. (Research Report RR-89-5). Princeton: ETS.

Thorndike E. L. (1904) An introduction to the theory of mental and social measurement. New York: Teachers College, Columbia University

Thurstone L. L. (1925) A method of scaling psychological and educational data. J Educ Psychol 1925; 15:433-51.

Woodcock, R. W. (1974). Woodcock Reading Mastery Tests. Circle Pines, Minn: American Guidance Service.

IRT in the 1990s: Which Models Work Best? 3PL or Rasch? B.D. Wright … Rasch Measurement Transactions, 1992, 6:1, 196-200

Rasch Publications
Rasch Measurement Transactions (free, online) Rasch Measurement research papers (free, online) Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch Applying the Rasch Model 3rd. Ed., Bond & Fox Best Test Design, Wright & Stone
Rating Scale Analysis, Wright & Masters Introduction to Rasch Measurement, E. Smith & R. Smith Introduction to Many-Facet Rasch Measurement, Thomas Eckes Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr. Statistical Analyses for Language Testers, Rita Green
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar Journal of Applied Measurement Rasch models for measurement, David Andrich Constructing Measures, Mark Wilson Rasch Analysis in the Human Sciences, Boone, Stave, Yale
in Spanish: Análisis de Rasch para todos, Agustín Tristán Mediciones, Posicionamientos y Diagnósticos Competitivos, Juan Ramón Oreja Rodríguez

 Forum Rasch Measurement Forum to discuss any Rasch-related topic

Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
March 21, 2019, Thur. 13th annual meeting of the UK Rasch user group, Cambridge, UK, http://www.cambridgeassessment.org.uk/events/uk-rasch-user-group-2019
April 4 - 8, 2019, Thur.-Mon. NCME annual meeting, Toronto, Canada,https://ncme.connectedcommunity.org/meetings/annual
April 5 - 9, 2019, Fri.-Tue. AERA annual meeting, Toronto, Canada,www.aera.net/Events-Meetings/Annual-Meeting
April 12, 2019, Fri. On-line course: Understanding Rasch Measurement Theory - Master's Level (G. Masters), https://www.acer.org/au/professional-learning/postgraduate/rasch
May 24 - June 21, 2019, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
May 22 - 30, 2019, Wed.-Thu. Measuring and scale construction (with the Rasch Model), University of Manchester, England, https://www.cmist.manchester.ac.uk/study/short/intermediate/measurement-with-the-rasch-model/
June 4 - 7, 2019, Tue.-Fri.In-Person Italian Rasch Analysis Workshop based on RUMM (Fabio La Porta and Serena Caselli; entirely in Italian). Prof David Andrich from Western Australia University will be hosted by the workshop. For enquiries and registration email to workshop.rasch@gmail.com
June 17-19, 2019, Mon.-Wed. In-person workshop, Melbourne, Australia: Applying the Rasch Model in the Human Sciences: Introduction to Rasch measurement (Trevor Bond, Winsteps), Announcement
June 20-21, 2019, Thurs.-Fri. In-person workshop, Melbourne, Australia: Applying the Rasch Model in the Human Sciences: Advanced Rasch measurement with Facets (Trevor Bond, Facets), Announcement
June 28 - July 26, 2019, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com
July 2-5, 2019, Tue.-Fri. 2019 International Measurement Confederation (IMEKO) Joint Symposium, St. Petersburg, Russia,https://imeko19-spb.org
July 11-12 & 15-19, 2019, Thu.-Fri. A Course in Rasch Measurement Theory (D.Andrich), University of Western Australia, Perth, Australia, flyer - http://www.education.uwa.edu.au/ppl/courses
Aug 5 - 10, 2019, Mon.-Sat. 6th International Summer School "Applied Psychometrics in Psychology and Education", Institute of Education at HSE University Moscow, Russia.https://ioe.hse.ru/en/announcements/248134963.html
Aug. 9 - Sept. 6, 2019, Fri.-Fri. On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com
August 25-30, 2019, Sun.-Fri. Pacific Rim Objective Measurement Society (PROMS) 2019, Surabaya, Indonesia https://proms.promsociety.org/2019/
Oct. 11 - Nov. 8, 2019, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Nov. 3 - Nov. 4, 2019, Sun.-Mon. International Outcome Measurement Conference, Chicago, IL,http://jampress.org/iomc2019.htm
Jan. 24 - Feb. 21, 2020, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
May 22 - June 19, 2020, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 26 - July 24, 2020, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com
Aug. 7 - Sept. 4, 2020, Fri.-Fri. On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com
Oct. 9 - Nov. 6, 2020, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 25 - July 23, 2021, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com