Review of:
Item Response Theory by C.L. Hulin, F. Drasgow &
C.K. Parsons, Homewood, IL: Dow Jones-Irwin, 1983
Applications of Item Response Theory ed. by R.K. Hambleton, Vancouver, B.C.: Educational Research Institute of British Columbia, 1983
Science builds on an evolving network of measurement. The realization that scientific measurement has special characteristics is not new. Campbell (1920) showed that what physical scientists mean by measurement requires an ordering system and the kind of additivity illustrated by physical concatenation. Campbell called this "fundamental measurement."
Fundamental Measurement
Intelligence test scores with which arithmetic could be done were called for by Terman and Thorndike before 1920 and required for measurement by Thurstone in the 1920's. Thurstone (1925) constructed some rather good interval scales with his absolute scaling. With his Law of Comparative Judgement he did even better, producing results that are successful instances of fundamental measurement (Thurstone, 1927). Insistence on order and additivity recurs in Guilford's (1936) definition of measurement. The significant consequence of additivity is the maintenance of the unit of measurement and hence of the invariance of comparisons of measures across the scale.
During the 1940's Guttman realized that a test score would be ambiguous in meaning unless the total score on the items was sufficient to reproduce the response pattern the score represented. This led Guttman (1950) to a criterion for judging whether data were good enough to build a scale. The data must demonstrate a joint order shared by items and persons.
During the 1950's the Danish mathematician Georg Rasch found that he could not obtain an invariance of test item characteristics over variations in persons unless he could represent the way persons and items interacted to produce a response in a special way. It had to be possible to represent a response by an exponential function in which the participation of person and item parameters could have a linear form (Rasch, 1960, p. 120).
Rasch also noted that invariance could be maintained only when data could be gathered so that they cooperated with a stochastic response model that produced a joint order of response probabilities - similar to the joint order Guttman called for (Rasch, 1960, p. 117). As he worked with this first "Rasch model," Rasch discovered that the best (i.e., minimally sufficient) statistics from which to estimate person measures and item calibrations were none other than the good old unweighted sums of right answers for persons and for items - the familiar raw scores!
Then Luce and Tukey (1964) showed that an additivity just as good for measurement as that produced by physical concatenation could be obtained from responses produced by the interaction of two kinds of objects (e.g., persons and items), if only this interaction were conducted so that its outcomes (e.g., item response data) were dominated by a linear combination of the two kinds of quantities implied (e.g., the differences between person measures and item calibrations). Luce and Tukey showed that if care were taken; their "conjoint measurement" could produce results as fundamental as Campbell's fundamental measurement. For Luce and Tukey,
the moral seems clear: when no natural concatenation operation exists, one should try to discover a way to measure factors and responses (e.g., gather data) such that the 'effects' of different factors are additive (p. 4).
The realization that Rasch's models could do this when data were collected carefully enough followed (Brogden, 1979; Fischer, 1968; Keats, 1967). Perline, Wright and Wainer (1979) provide empirical demonstrations of the efficacy of the Rasch process in constructing fundamental measurement.
When Andersen (1977) showed that the sufficient statistics that enable fundamental measurement depend on scoring successive categories with the equivalent of successive integers - exactly the ordered category scoring most widely used on intuitive grounds - the Rasch model for dichotomously scored items was extended to response formats with more than two ordered categories (Andrich, 1978; Wright & Masters, 1982).
Common Practice
As this brief history shows, there has been enough successful theoretical and practical work on the nature and implementation of fundamental measurement to establish its necessity as a basic tool of science and its ready accessibility for educational researchers.
Fundamental measurement is obtainable from educational data and, in an intuitive form, educational researchers have relied on it for a long time. In their use of unweighted raw scores and successive integer category weights, educational researchers have been practicing the scoring required for fundamental measurement all along. Intuitive rather than explicit reliance, however, has meant that they have neither recognized nor enjoyed the benefits of fundamental measurement, and they have not built on its strengths to improve their understanding of education.
Illiterate Theory
The ready accessibility of fundamental measurement and its necessity for the successful practice of science is still unknown to most educational researchers and misunderstood by most psychometricians. In spite of 60 years of literature explaining the strong reasons for and illustrating the successful application of fundamental measurement in educational research, few psychometricians and fewer educational researchers attempt to construct fundamental measures.
Much current psychometric writing and practice goes on as though no knowledge concerning the theory and practice of fundamental measurement existed. In books claiming to provide the latest ways to make measurements with educational data, one might expect that an exposition of fundamental measurement would not only appear but would dominate the discussion, the choice of methods advanced, and the choice of data analyzed. In Item Response Theory by Hulin, Drasgow and Parsons and Applications of Item Response Theory, edited by Hambleton, there is no discussion of the nature, meaning, or practice of fundamental measurement.
Indeed, a theory of measurement is denied in Hulin, Drasgow and Parsons, who say "in social science we have no well-articulated meta-theory that specifies rules by which one can decide among competing (item response) theories on the basis of the propositions, assumptions, and conclusions of the theories" (p. vii). The possibility of a relevant theory is despaired of in Hambleton, where the "ineluctable conclusions" are "that no unidimensional item response model is likely to fit educational achievement data" and "the Rasch model is least likely to fit" (Traub, in Hambleton, p. 65).
Response models for one, two and three item parameters are described in detail (Hulin, Drasgow, & Parsons, Chapter 2, and Hambleton, Chapter 1), but their motivation through Thurstone's (1925) assumption of a latent response process of continuous distribution is unnecessarily abstract and circuitous. It would be so much better education and science to begin with a binomial model for the observable dichotomy as Rasch does (1960, pp. 73). That simple story is not only easier to follow but leads directly to fundamental measurement.
These two books are rife with the usual misunderstandings. The three parameter model is called a logistic model although, when guessing varies, Birnbaum (1968, p. 432) says it is not. The estimation virtues extolled by Swaminathan in a chapter featuring the three parameter model apply, in fact, only to the one parameter Rasch model. Indeed "the likelihood function (of the three parameter model) may possess several maxima" and its value at infinite ability "may be larger than the maximum value found" when ability is finite (Swaminathan, in Hambleton, p. 30).
The Rasch model is referred to as a special case of the three parameter model when, in fact, its singular significance for measurement is that it is a unique (necessary and sufficient) deduction from the (fundamental) measurement requirements of joint order and additivity.
The sum of discrimination estimates for items answered correctly, which cannot be a statistic because a statistic must be a function of data and not of other statistics, leads a double life. It is called a sufficient statistic (Bejar, in Hambleton, p. 11; Swaminathan, in Hambleton, p. 30) and then shown to be insufficient because "it is not possible to extend the conditional (estimation) approach to the two parameter model" so that "sufficient statistics exist for ability . . . only in the Rasch model" (Swaminathan, in Hambleton, p. 36).
Bejar worries that "the logistic model ignores the difficulty of the items answered correctly in assigning a score" (Bejar, in Hambleton, p. 14). But that is exactly what raw scores have always done. Although the score pays close attention to the difficulty of the test, it must be indifferent to which items within the test are answered correctly. This indifference is a necessary consequence of the local independence assumed by all of these models and required for any item banking.
The study of which items a person answers correctly, that is, the investigation of the joint order between observed person responses and estimated item difficulties, is the study of person fit. Does the person's data fit the measurement model? Is the person's performance valid? Indifference as to which items within a test are answered correctly is necessary for building measurement. But the measure constructed from this indifference is only valid when the person's performance pattern is stochastically consistent (jointly ordered) with the items' difficulties.
Impractical Practice
Methods for using these response models are left to computer programs, mostly LOGIST. Lord help the poor researcher who hasn't got his LOGIST working. There are no explicit procedures, detailed examples, or even estimation equations provided. Any reader hoping to learn how to apply IRT will have to look elsewhere (e.g., Rasch, 1960; Spearritt, 1982; Wright & Stone, 1979). Even in Hambleton's Chapter 3, which is dedicated to LOGIST, there are no instructions for how to use it.
The basic problems with the two and three parameter models are made plain by the steps taken to deal with their symptoms. During "estimation in the two and three parameter models . . . the item parameter estimates drift out of bounds" (Swaminathan, in Hambleton, p. 34). Thus "Range restrictions (must be) applied to all parameters except the item difficulties" to control "the problem of item discriminations going to infinity" (Wingersky, in Hambleton, pp. 47-48). Worse than that, "items with vastly different discrimination and difficulty parameters (can) nonetheless have virtually identical ICCs in the ability interval" where the data are and the estimation work is to be done (Hulin, Drasgow, & Parsons, p. 100).
All this happens because an empirical binomial ICC expressed on a scale that must be defined by the same data contains only enough information to identify one item parameter. The only way more item parameters can be estimated is to assume that the particular persons participating in the item calibration are random examples of some "true" distribution of ability for which these items are always to be used.
The problem is aggravated by the way these models use ability in two ways at once. The first use is as a "difference" that specifies a distance between person ability and item difficulty. This difference is essential for the construction of measurement. The second use is as a "factor" to multiply item discrimination so that a different unit can be specified for each item. This interacts with the first use to confound estimation and prevent the construction of joint order or additivity. The nonlinear combination of item parameters prevents the algebraic separation of difficulty and discrimination and hence the derivation of sufficient statistics for estimating them. When ICC's are allowed to cross (e.g., Hambleton, pp. 45, 46, 163), the manifest difficulty order items varies with ability. This prevents the construction of a variable that can be defined in any general way by the relative difficulties of its items.
The estimations of item discrimination and person ability are based on a feedback between i) summing the product of observed response and current ability estimate over persons and ii) summing the product of observed response and current discrimination estimate over items (Birnbaum, 1968, pp. 421-422). This process cannot converge because the cumulative effect of the feedback between ability and discrimination pushes their estimates to infinity.
Yen describes the kind of trouble this can get one into:
The biggest surprise occurring with the CTBS/U interlevel linking related to the type of scale produced. As grade increased, the scale scores had decreasing standard deviations and corresponding decreasing standard errors of measurement and increasing item-discriminations.... This result was unexpected because the scaling procedure used with previous tests, Thurstone's absolute scaling, produced a scale with standard deviations increasing with grade. (Yen, in Hambleton, p. 139).
When the response model allows person ability and item discrimination to interact, this is bound to happen. The decreasing standard deviations are not describing children on an interval scale. They are describing the consequences of an estimation procedure that cannot keep item discriminations from drifting toward infinity.
Parameterizing guessing is much prized in theory. But "attempts to estimate the guessing parameter ... are not usually successful" (Hulin, Drasgow, & Parsons, p. 63). In one study, "40% of the guessing parameter estimates did not converge even with a sample size of 1593" (Ironson, in Hambleton, p. 160). Even when some estimates of guessing are obtained there are problems. "If a test is easy for the group (from which guessing parameters are estimated) and then administered to a less able group, the guessing parameters (from the more able group) may not be appropriate" (Wingersky, in Hambleton, p. 48). "When dealing with three parameter logistic ICCs, a nonzero guessing parameter precludes a convenient transformation to linearity" (Hulin, Drasgow, & Parsons, p. 173). None of this should be in the least surprising. The formulation of the model shows quite plainly that the explicit interaction between guessing and ability must make any guessing estimates inextricably sample dependent.
Analysis of Fit
The analysis of fit for persons and items is addressed at length - two chapters in Hulin, Drasgow, and Parsons, and three in Hambleton. The techniques displayed and their sources are basically the same. But the substantive discussion and motivation are much better in Hulin, Drasgow, and Parsons. Unfortunately, as with estimation, the methods described are made to seem unnecessarily complex and arcane. If a reader should want to apply one of them, he or she will have to go elsewhere to learn how.
Harnisch and Tatsuoka call for person fit statistics with standard normal distributions and no correlations with ability (in Hambleton, pp. 114, 117-119). That is a good idea when fine tuning fits statistics against data simulated to fit. But they use their ideal to evaluate fit statistics applied to real data. With real data one would prefer fit statistics that are skewed by misfit and correlate negatively with ability (to detect the guessing of low ability persons). But these criteria are the opposite of the ideal Harnisch and Tatsuoka use in their comparisons.
A special technique for fit analysis used by Yen (in Hambleton, p. 126) on CTB items sounds attractive at first, but there is good reason to avoid it. Yen's first step is to smooth out the misfit occurring in her data by regressing observed scores of ability-groups on a monotonic function of their estimated expectations. Then she makes her "fit" comparisons, not against the regression residuals into which the misfit has just been pushed, but against the regression predictions from which the misfit has just been removed. The reason her "fit statistic is more stable" is that she is no longer analyzing misfit.
Ironson does a chapter (in Hambleton, Chapter 10) on item bias, but the topic is better dealt with in Chapter 5 of Hulin, Drasgow, and Parsons. The trouble with both approaches to bias is that bias found for groups is never uniformly present among members of the group or uniformly absent among those not in the group. For the analysis of item bias to do individuals any good, say, by removing the bias from their measures, it will have to be done on the individual level of the much more useful person fit analyses described in other chapters (Hulin, Drasgow, & Parsons, Chapter 4; Hambleton, Chapter 7).
Test Equating
Test equating appears in Chapter 6 of Hulin, Drasgow, and Parsons, only on the way to connecting tests written in more than one language, whereas Hambleton provides a chapter (11) eleven on equating. Unfortunately item banking, the purpose of test equating, is not mentioned. Neither book describes the actual details of equating sufficiently well to enable the reader to learn how to do it.
There is a refreshing irony in the various accounts (Hulin, Drasgow, & Parsons, pp. 174, 202; Hambleton, pp. 45-46, 132, 178, 181-182; 191) of how equating with the three parameter model is actually accomplished. First of all, whereas the three parameter model is always claimed, the two parameter model is the one actually used to calibrate most of the tests to be equated. This happens because the guessing parameter proves too elusive to be kept as a variable. Then, when it comes to actually connecting two tests, even item discrimination is given up. The actual equating is based entirely on item difficulty, the one item parameter of the Rasch model. In other words, whenever these authors get involved in actually building a measuring system, even of only two tests, they are forced by what then happens to them to use the only response model that can build fundamental measurement, namely the one item parameter Rasch model.
The irony is particularly poignant in the Cook and Eignor attempt to distinguish between three methods for equating tests (in Hambleton, p. 191). Test equating must result in a single linear system of item calibrations and person measures (associated with test scores), which is invariant (but for scale and origin) over the data and also over the methods used to construct this system. Unless all three methods produce the same result, none of them has worked.
These authors claim concern with advancing science by providing better methods for building knowledge. But they offer models for item response data that systematically prevent the construction of fundamental measurement. The only way educational researchers can hope to build knowledge is to use models that insist on data from which fundamental measurement could be constructed - models that contain in their formulation the joint order and additivity conditions required for its construction.
For most of the models presented in these books to produce meaningful item estimates, it is necessary to assume that the persons who provided the data are random events sampled from the "right" simple standard distribution. Persons do not appear as individuals to be measured in these models. Person parameters are not even subscripted in most presentations. The scale reported is routinely standardized to give the person sample a mean of zero and a variance of one as though those particular persons were a true random sample of exactly the population of persons with whom the scale would always be used - and as though any individual person subsequently measured could be usefully understood as no more than a random instance of the particular sample of other persons with whom the item analysis was done.
Figments of Despair
Despair is the latent message in books like this - despair of ever constructing any good mental or psychological variables. Review the testimony. The item response models described are said to be:
Hard to understand. The "procedures involved in estimation are complex and require sophistication on the part of the user" (Bejar, in Hambleton, p. 3).
Difficult to use. "The difficulty in applying these models is stressed" (Bejar, in Hambleton, p. 1). "Working with IRT is an arduous process" and "the largest hurdle is the estimation" (Bejar, in Hambleton, p. 3). Would-be users are warned that they "must be ready to pay ... by investing substantial resources in parameter estimation and model monitoring" (Bejar, in Hambleton, p. 17).
Demanding of large samples of persons and many test items (Bejar, in Hambleton, p. 3; Ironson, in Hambleton, p. 160; Wingersky, in Hambleton, p. 46). "Even long tests and large samples do not necessarily allow accurate estimation of the guessing parameter" (Hulin, Drasgow, & Parsons, p. 100).
Unreliable in action. Discrimination eludes capture because of its interaction with ability. Guessing can't be laid hold of, not only because it is persons and not items who are doing it, but because when plausible estimates do emerge they are sample dependent.
Hopeless in any event. "Multiple-choice items ... are unlikely to be modelled very well by any unidimensional item response model" (Traub, in Hambleton, p. 57). "Speeded administration and an item format that permits guessing will each introduce another trait into the response process" (Traub, in Hambleton, p. 62). "Data from examinees who vary in their guessing propensities will result in systematically biased calibrations of items and systematically biased estimates of examinee ability" (Traub, in Hambleton, p. 63). As for using a model to select data good enough to make measurements with, "It will be a sad day indeed when our conception of measurable educational achievement narrows to the point where it coincides with the criterion of fit to a unidimensional item response model" (Traub, in Hambleton, p. 64).
If this were all we had to look forward to, the future of educational research would be dim indeed. Fortunately there is a readily accessible and remarkably hearty antidote to all the confusion and despair, even to Traub's terrors.
Foundations for Hope
Hope for the future of measurement in educational research can be found in the following:
The theory of fundamental measurement, the scoring for which we have been engaged in for decades, and the advantages of which we are poised to enjoy. This theory can be put in a form that is easy to understand. The unique response model for applying it, the one parameter Rasch model, is easy to grasp. Finally, what can be more straightforward than seeking the kind of person-item interactions that can be understood as showing us the difference between the person's ability and the item's difficulty in an orderly and uniform way?
The one-parameter model is easy to use. No special computer or computer program is required. Scores of researchers (including R.L. Thorndike) have long since written their own Rasch programs for their preferred computers and hand calculators. If the PROX approximation is used (Wright & Stone, 1979), the job can be done by hand.
Even the smallest data sets can be usefully addressed with this model. Wright and Stone (1979) get important information about the structure of Knox's Cube Test and the status of the children tested from no more than 14 items taken by 34 children.
The functioning of this model is robust. The Rasch model is so robust, in fact, that it is routinely used during every second cycle of LOGIST to resist the digressive pressure from the guessing and discrimination parameter estimates to stray (Wingersky, in Hambleton, p. 47). Whenever tests are to be equated, it is the Rasch item parameter, difficulty, that is used (Hulin, Drasgow, & Parsons, pp. 174, 202; in Hambleton, pp. 45-46, 132, 178, 181-182, 191). This not only testifies in a natural way to the necessity of the Rasch model for the construction of measurement systems, but its many successes in practice (Wright & Bell, 1984) encourage us that there is something useful we can do.
The Rasch sufficient statistics are as relevant and as hopeful as any raw score ever was. That's because they come directly from raw scores. If the millions of tests given in the past 70 years were good for anything, if any test ever given and then summarized by the count of right answers was useful, then there is hope.
This hope is particularly strong today because the one parameter model gives us a clear and simple prescription for what a raw score is supposed to do. This puts us in the position of being able to review every performance pattern before we draw any final conclusions concerning its raw score or implied measure. Routine analyses of the joint order between estimated item difficulties and observed responses enable us to validate our measures (and calibrations) when they deserve it and to identify, diagnose, and learn from those situations where our analyses point out improbable inconsistencies.
The one parameter model has all the statistical virtues extolled by Swaminathan (in Hambleton, pp. 30, 33, 35). It works well on all kinds of data (Hulin, Drasgow, & Parsons, pp. 57, 95, 96; Hambleton, pp. 200, 221, 226). It does better than the three parameter model two out of three times even in the hands of a researcher who wishes with all his heart it were otherwise and finds it "surprising to observe the one parameter model performing better than the three parameter model . . . since [as he mistakenly believes] there is no theoretical reason to expect such a result" (in Hambleton, pp. 208-209).
Fortunately the despair intimated in books like these and acted out in choices of methods that are unnecessarily complex and incorrigibly ineffective is not all there is. There is the theory and practice of fundamental measurement ready and waiting for any educational researcher seriously interested in building educational variables in order to measure the status and change of individuals. There are many Rasch built item banks doing good work (Wright & Bell, 1984). For a recent overview of Rasch thinking and doing there is the Australian Council for Educational Research Golden Jubilee Seminar on The Improvement of Measurement in Education and Sociology (Spearritt, 1982). Best of all, the University of Chicago Press (1980) has reprinted Rasch's own book on Probabilistic Models for Some Intelligence and Attainment Tests. No scholar of quantitative research method, whether student or professor, need remain unlettered in that great book!
by Benjamin D. Wright
MESA Research Memorandum Number 41
MESA PSYCHOMETRIC LABORATORY
References
ANDERSEN E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42, 69-81.
ANDRICH D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-573.
BIRNBAUM A. (1968). Some latent trait models. In F.M. Lord & M.R. Novick, (Eds.), Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
BROGDEN H.E. (1979). The Rasch model, the law of comparative judgement and additive conjoint measurement. Psychometrika, 42, 631-634.
CAMPBELL N.R. (1920). Physics: The Elements. London: Cambridge University Press.
FISCHER G. (1968). Psychologische Test Theorie. Bern: Huber.
GUILFORD J.P. (1936). Psychometric methods. New York: McGraw-Hill.
GUTTMAN L. (1950). The basis for scalogram analysis. In Stouffer et al. (Eds.), Measurement and prediction. New York: Wiley.
KEATS J.A. (1967). Test theory. Annual Review of Psychology, 18, 217-238.
LUCE R.D. & TUKEY J.W. (1964). Simultaneous conjoint measurement: A new type of fundamental measurement. Journal of Mathematical Psychology, J, 1-27.
PERLINE R., WRIGHT B.D. & WAINER H. (1979). The Rasch model as additive conjoint measurement. Applied Psychological Measurement, 3, 237-256.
RASCH G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press.
SPEARRITT D. (Ed.). (1982). The improvement of measurement in education and psychology. Victoria: Australian Council for Educational Research.
THURSTONE L.L. (1925). A method of scaling psychological and educational tests. Journal of Educational Psychology, 16; 433-451.
THURSTONE L.L. (1927). A law of comparative judgement. Psychological Review, 34, 273-286.
WRIGHT B.D., & BELL S.R. (1984). Item banks: What, why and how. Journal of Educational Measurement, 21, 4, 331 -345.
WRIGHT B.D. & MASTERS G.N. (1982). Rating scale analysis. Chicago: MESA Press.
WRIGHT B D. & STONE M.H. (1979). Best test design. Chicago: MESA Press.
This appeared in
Contemporary Education Review, Spring 1984, Volume 3, Number 1, pp.
281-288
Go to Top of Page
Go to Institute for Objective Measurement
Page
FORUM | Rasch Measurement Forum to discuss any Rasch-related topic |
Coming Rasch-related Events | |
---|---|
Oct. 4 - Nov. 8, 2024, Fri.-Fri. | On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com |
Jan. 17 - Feb. 21, 2025, Fri.-Fri. | On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com |
May 16 - June 20, 2025, Fri.-Fri. | On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com |
June 20 - July 18, 2025, Fri.-Fri. | On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com |
Oct. 3 - Nov. 7, 2025, Fri.-Fri. | On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com |
Our current URL is www.rasch.org
The URL of this page is www.rasch.org/memo41.htm