ABSTRACT. Quantitative observations are based on counting observed events or levels of performance. Meaningful measurement is based on the arithmetical properties of interval scales. The Rasch measurement model provides the necessary and sufficient means to transform ordinal counts into linear measures. Imperfect unidimensionality and other threats to linear measurement can be assessed by means of fit statistics. The Rasch model is being successfully applied to rating scales.
Merbitz and associates[5] provide a sensitive and useful explanation of the hazards encountered when data are treated improperly. This misconstruction of data can be understood as the result of a confusion as to the relationship between observation and measurement - a confusion which can be speedily resolved with a little clarification.
Data are Always Ordinal
All observations begin as ordinal, if not nominal, data. Quantitative science begins with identifying conditions and events which, when observed, are deemed worth counting. This counting is the beginning of quantification. Measurement is deduced from well- defined sets of counts. The most elementary level is to count the presence, "1," or absence, "0," of the defined condition or event.
More information can be obtained when the conditions that identify countable events are ordered into successive categories which increase (or decrease) in status along some intended underlying variable. It then becomes possible to count, not just the presence (versus absence) of an event, but the number of steps up the ordered set of categories which the particular category observed implies.
When, for example, a rating scale is labeled: "none," "plenty," "nearly all," "all," the inarguable order of these labels from less to more can be used to represent them as a series of steps. The observation of "none" can be counted as zero steps up this rating scale, "plenty" as one step up, "nearly all" as two steps up and "all" as three. This counting has nothing to do with any numbers or weights with which the categories might have been tagged in addition to or instead of their labels. For instance, "plenty" might also have been labelled as "20" or "40" by the instrument designer, but the assertion of such a numerical category label would not alter the fact that, on this scale, "plenty" is just one step up the scale from "none."
All classifications are qualitative. Some classifications[1], like those above, can be ordered and so are more than nominal. Other classifications, such as one based on race, can usually not be ordered, though there may be perspectives from which an ordering becomes useful. This does not mean that nominal data, such as race or gender, cannot have powerful explanatory or diagnostic power. Nonparametric statistical techniques[3] can be useful in such cases. But it does mean that they are not measurement in the accepted sense of the word.
As Merbitz and colleagues[5] emphasize, this counting of steps says nothing about distances between categories, nor does it require that all test items employ the same rating scale. Whenever four category labels share the same ordering, however else they may differ in implied amounts, they can only be represented by exactly the same step counts, even though, after analysis, their calibrations may well differ. It would not make any difference to the method of step counting if the four ordered categories were labeled quite differently for another item, say, "none," "almost none," "just a little," "all." Even though the relative meanings and the intended amounts corresponding to the alternative sets of category labels are conspicuously different, their order is the same and so their step counts can only be the same (0, 1, 2, 3). This is always so no matter how the four ordered categories might be labeled.
Measures are Always Interval/Ratio
What every scientist and layman means by a "measure" is a number with which arithmetic (and linear statistics) can be done, a number which can be added and subtracted, even multiplied and divided, and yet with results that maintain their numerical meaning. The original observations in any science are not yet measures in this sense. They cannot be measures because a measure implies the previous construction and maintenance of a calibrated measuring system with a well-defined origin and unit which has been shown to work well enough to be useful. Merbitz and his coworkers stress the importance of linear scales as a prerequisite to unequivocal statistical analysis. They are saying that something must be done with counts of observed events to build them into measures. A measurement system must be constructed from a related set of relevant counts and its coherence and utility established.
Confusing Counts with Measures
It is true that counts of concrete events are on a kind of ratio scale. They have an obvious origin in "none" and the counted events provide a raw unit of "one more event." The problem is that the events that are counted are specific rather than general, concrete rather than abstract, and varying rather than uniform in their import. Sometimes the next "one more event" implies, according to the labels assigned, a small increment as in the step up from "none" to "almost none." Sometimes the next event implies a big increment as in the step up from "none" to "plenty." Since, in either case, all that we can do at this stage is to count one more step, our raw counts, as they stand, are insensitive to any differing implications of the steps taken. To get at these implied step sizes we must construct a measuring system based on a coordinated set of observed counts. This requires a measurement analysis of the inevitably ordinal observations which always comprise the initial data in any science.
Even those counts which seem to be useful measures in one context may not be measures in another[8]. For example "seconds" would seem always to be a linear measure of time. But. surprising as it may seem at first, counting the number of seconds it takes a patient to walk across a room does not necessarily provide a linear measure of "patient mobility." For that, the "seconds" counted are just the raw data from which a measuring system has still to be constructed. It is naive to believe that a seemingly universal counter like "seconds," that is so often linear in physics and commerce, will necessarily also be linear in the measurement of patient mobility. To construct a linear measure of patient mobility based on elapsed time we must first count the seconds taken by a relevant sample of patients of varying mobility to cover a variety of relevant distances of varying magnitudes. Then we must analyze these counting data to discover whether a linear measure of "mobility" can be constructed from them and if so what its relation to "seconds" may be.
The Step from Observation to Measurement
Realization of the necessity of a progression from counting observations to measurement is not new. Serious recognition of the need to transform observations into measures goes back to the turn of the century. Edward Thorndike[10] called for it 80 years ago. Louis Thurstone[11] invented techniques which partially solved the problem in the 1920s. Finally, in 1953, Georg Rasch[7] devised a complete solution which has since been shown to be not only sufficient but also necessary for the construction of measures in any science. The phrase "in any science" is notable here since the Rasch relationship has been shown to be just as fundamental to the construction of a surveyor's yardstick as it is to the construction of other less familiar and more subtle measures.
Rasch's insight into the problem was simple and yet profound. First, he realized that, to be of any use at all, a measure must retain its quantitative status, within reason, regardless of the context in which it occurs. For a yardstick to be useful for measurement, it must maintain its length calibrations irrespective of what it is measuring. So too, each test or rating scale item must maintain its level of difficulty, regardless of who is responding to it. It also follows that the person measured must retain the same level of competence or ability regardless of which particular test items are encountered, so long as whatever items are used belong to the calibrated set of items which define the variable under study. The implementation of this essential concept of invariance or objectivity has been successfully extended in the past decade to the leniency (or severity) of raters and to the step structure of rating scales.
Second, Rasch recognized that the outcome of an interaction between an object-to-be-measured, such as a person, and a measuring-agent, such as a test item, cannot, in practice, be fully predetermined but must involve an additional, unavoidably unpredictable, component. This realization changes the way we can usefully specify what is supposed to happen when a person responds to an item from an "absolute" outcome to a "likely" outcome. The final measuring system requirements become: the more able the person, the more likely a success on any relevant item. The more difficult the item, the less likely a success for any relevant person.
From just these,in retrospect rather obvious, requirements, Rasch deduced a mathematical model which specifies exactly how to convert observed counts into linear (and ratio) measures. The model also specifies how to find out the extent to which any particular conversion has been successful enough to be useful. This "Rasch" model has since been demonstrated to be the one and only possible mathematical formulation for performing this essential function.
Rasch's introduction of his discovery appears in his innovative 1960 book[7]. Detailed, elementary explanations of why, when and how to apply Rasch's idea to dichotomous (right/wrong, yes/no, present/absent) data are provided by Wright and Stone[14]. The extension of this to rating scales and other observations embedded in ordered categories is developed and explained in Wright and Masters[13].
This conversion from counts to measures is greatly facilitated by the use of a computer. Rasch analysis compute programs have been available since 1965. The two most recent and most versatile are BlGSCALE[12] and FACETS[4]. These programs analyze the initial original data for the possibility of a single latent variable along which the intended measuring agents, the items, can be calibrated and the intended objects of measurement, the subjects, can be measured. The programs then report: 1) the best possible unidimensional calibrations and measures which these data can support, 2) the reliabilities of these calibrations and measures in terms of their standard errors and 3) their internal validities in terms of detailed fit statistics.
Choosing an Origin
The concept of measurement implies a count of some well-defined unit from a well-defined starting point usually called "none" or "zero." This implication can be visualized as a distance between two points on a line. To be useful, measures must be set up to begin counting their standard units from some convenient reference point defined to be their standard origin. The location of this origin is fundamentally arbitrary, although there are often frames of reference, or theories, for which a particular position is especially convenient. Consider temperature. The Celsius, Fahrenheit and Kelvin scales have different zero points. Each choice was made for good theoretical reasons. Each has been convenient for particular applications. But no one of them is universally superior, despite the exhortations of molecular thermodynamicists. It is the same for psychometric scales. Each origin is chosen for the convenience of its users. Should two users choose different origins, then, as with temperature, it must be a simple monotonic operation to transform measures relative to one origin into measures relative to another, or they are not talking about the same variable. However intriguing it may be theoretically, there is no measurement requirement to locate an absolute point of minimum intensity or to extrapolate a point such as that of "zero mobility."
A ratio scale does have a clear origin. But that origin is usually of more theoretical interest than practical utility. It is a simple arithmetical operation to convert measures from an interval scale to a ratio scale and vice versa. When interval scales are exponentiated, their arbitrary origins become the unit of the resulting ratio scale and their minus infinity becomes this ratio scale's origin. This mathematical result, by the way, reminds us that the seemingly unambiguous origins of ratio scales, however intriguing they may be theoretically, are necessarily unrealizable abstractions, see also "What is a Ratio Scale".
The practical convenience of being able to measure length from some arbitrary origin, like the end of a yardstick, far outweighs the abstract benefit of measurement from some theoretically interesting "absolute" origin, such as the center of the universe. With an interval scale, once it is constructed from relevant counts, we can always answer questions such as "Is the distance from 'wheelchair' to 'unaided' more than twice as far as the distance from 'cane' to 'unaided'?" The convenient origin for kind of question is the shared category 'unaided' rather than some abstract point tagged "complete mobility" or "complete immobility."
`Why Treating Raw Scores as Measures Sometimes Seems to Work
In view of the clear difference between counts and measures, why do regressions and other interval-level statistical analyses of raw score counts and numerical category labels so often seem to work? Examples mentioned include Miller's "100 point" scale[6], the LORS- IIB[6], the FIM[2], and the Barthel Index[2]. This paradox is due to the monotonic relationship between scores and measures when data is complete and unedited. This guarantees that correlation analyses of scores and the measures they may imply will be quite similar. Further the relationship between scores and measures is necessarily ogival because the closed interval between the minimum possible score and the maximum possible score must be extended to an open interval of measures from minus infinity to plus infinity. Toward the center of this ogive the relationship between score and measure is nearly linear. But the monotonicity between score and measure holds only when data are complete, that is, when every subject encounters every item, and no unacceptably flawed responses have been deleted. This kind of completeness is inconvenient and virtually impossible to maintain, since it permits no missing data and prevents tailoring item difficulties to person abilities. It is also no more necessary for measurement than it would be to require that all children be measured with exactly the same particular yardstick before we could analyze their growth. Further the approximate linearity between central scores and their corresponding measures breaks down as scores approach their extremes and is strongly influenced by the step structure of the rating scale.
Consequently, as Merbitz and associates warn, it is foolish to count on raw scores being linear. It is always necessary to verify that any particular set of raw scores do, in fact, closely correspond to linear measures before subjecting them to statistical analysis[2]. Whatever the outcome of such a verification it is clearly preferable to convert necessarily nonlinear raw scores to necessarily linear measures and then to perform the statistical analyses on these measures.
Unidimensionality
An occasional objection to Rasch measurement is its imposition on the data of a single underlying unidimensional variable. This objection is puzzling because unidimensionality is exactly what is required for measurement. Unidimensionality is an essence of measurement. In fact the importance of the Rasch model as the method for constructing measures is due, in part, to its deduction from the requirement of unidimensionality.
In actual practice, of course, unidimensionality is a qualitative rather than quantitative concept. No actual test can ever be perfectly unidimensional. No empirical situation can meet exactly the requirements for measurement which generate the Rasch model. This fact of life is encountered by every science. Even physicists make corrections for unavoidable multidimensionalities an integral part of their experimental technique. Nevertheless, the ideal of unidimensional measures must be approximated if generalizable results are to be obtained.
If a test comprising a mixture of medical and law items is used to make a single pass/fail decision, then the examination board, however inadvertently, has decided to use this mixed test as though it were unidimensional. This is regardless of any qualitative or quantitative arguments which might "prove" multidimensionality. Further, their practical decision does not make medicine and law identical or exchangeable anywhere but in their pass/fail actions. But their "unidimensional" behavior does testify that they are making medicine and law exchangeable for these pass/fail decisions. Unless each test item is to be treated as a test in itself, every test score is a compromise between the essential ideal of unidimensionality and the unavoidable exigencies of practice. The Rasch model fit statistics are there in order to evaluate the success of that compromise in each instance. It is the responsibility of test developers and test managers to use these validity statistics to identify the extent of the compromises they are making and to minimize their effects on practice.
The pursuit of approximate unidimensionality is undertaken at two levels. First, the test constructor makes every effort to produce a useful set of observable categories (rating scales) which are intended and expected to work together to gather unambiguous information along a single, useful underlying dimension. Test items, tasks, observation techniques and other aspects of the testing situation are organized to realize, as perfectly as possible, the variable which the test is intended to measure. Second, the test analyst collects a relevant sample of these carefully defined observations and evaluates the practical realization of that intention.
Before observations can be used to support any quantitative research or substantive decisions, the observations must be examined to see how well they fit together to define the intended underlying variable on a linear scale[9]. Rasch provides theory and technique. But the extent to which a particular set of observations is in accord with this theory is, indeed, an "empirical matter"[3]. Merbitz and coworkers[5] caution us against blindly accepting any total score without verifying that its meaning is in accord with the meanings of the scores on its component items. Assistance in doing this is provided by fit statistics which report the degree to which the observations match the specifications necessary for measurement. Misfitting items can be redesigned. Misfitting populations can be reassessed. Once the quality of the measures has been determined, the analyst, test constructor, and examination board are then, and only then, in a position to make informed decisions concerning the quantitative significance of their measures.
The process of test evaluation is never finished. Every time we use our measuring agents, questions, or items to collect new information from new persons in order to estimate new measures, we must verify in those new data that the unidimensionality requirements of our measuring system have once again been sufficiently well approximated to maintain the quantitative utility of the measures produced. Whether a particular set of data can be used to initiate or to continue a unidimensional measuring system is an empirical question. The only way it can be addressed is to 1) analyze the relevant data according to a unidimensional measurement model, 2) find out how well and in what parts these data do conform to our intentions to measure and, 3) study carefully those parts of the data which do not conform, and hence cannot be used for measuring, to see if we can learn from them how to improve our observations and so better achieve our intentions.
Once interval scale measures have been constructed, it is then reasonable to proceed with statistical analysis in order to determine the predictive validity of the measures from a particular test. We can also then compare the measures produced by different test instruments, such as the FAS subscales, to see if they are measures of the same thing, like inches and centimeters, or different things, like inches and ounces.
Rasch Analysis and the Practice of Measurement
The Rasch measurement model has been successfully applied to testing in schools since 1965, with large scale implementations in Portland (OR), Detroit, Chicago and New York. Many medical specialty boards, including the National Board of Medical Examiners[9], employ it in their certification examinations. Pilot research at the Veteran's Administration and Marianjoy Rehabilitation Center has demonstrated that useful measures of the degree of impairment can bc constructed from ratings of the performance of handicapped individuals. New applications of the Rasch model are continually emerging; judge-awarded ratings is currently an area of active interest for the Board of Registry of the American Society of Clinical Pathologists and for a national group of occupational therapists centered at the University of Illinois.
We are grateful to Merbitz and colleagues[5] for raising the important topic of ordinal scales and inference and so permitting us to discuss this often misunderstood concept of measurement.
BENJAMIN D. WRIGHT AND JOHN M. LINACRE
MESA Research Memorandum Number 44
MESA PSYCHOMETRIC LABORATORY
References
1. Gresham, GE. Letter to the Editor. Arch Phys Med Rehabil 1989; 70:867.
2. Hamilton BB, Granger CV. Letter to the Editor. Arch Phys Med Rehabil 1989; 70:861-2.
3. Johnston MV. Letter to the editor. Arch Phys Med Rehabil 1989; 70:861.
4. Linacre JM. FACETS Computer program for many-faceted Rasch analysis. Chicago: MESA Press, 1989.
5. Merbitz C, Morris J, Grip JC. Ordinal scales and foundations of misinference. Arch Phys Med Rehabil 1989; 70:308-32.
6. Miller LS. Letter to the editor. Arch Phys Med Rehabil 1989; 70:866.
7. Rasch G. Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research, 1960, and Chicago: University of Chicago Press. 1980.
8. Santopoalo RD. Letter to the editor. Arch Phys Med Rehabil 1989; 70:863.
9. Silverstein B, Kilgore K. Fisher W. Letter to the editor. Arch Phys Med Rehabil 1989; 70:864-5.
10. Thorndike EL. An introduction to the theory of mental and social measurement. New York: Teachers College, Columbia University, 1904.
11. Thurstone LL. A method of scaling psychological and educational data. J Educ Psychol 1925; 15:433-51.
12. Wright BD, Linacre JM, Schulz M. BIGSCALE Rasch analysis computer program. Chicago: MESA Press, 1989.
13. Wright BD, Masters GN. Rating scale analysis. Chicago: MESA Press, 1982.
14. Wright BD, Stone MH. Best test design. Chicago: MESA Press, 1979.
This appeared in
Archives of Physical Medicine and Rehabilitation
70 (12) pp. 857-860, November 1989
Go to Top of Page
Go to Institute for Objective Measurement Page
FORUM | Rasch Measurement Forum to discuss any Rasch-related topic |
Coming Rasch-related Events | |
---|---|
Oct. 6 - Nov. 3, 2023, Fri.-Fri. | On-line workshop: Rasch Measurement - Core Topics (E. Smith, Facets), www.statistics.com |
Oct. 12, 2023, Thursday 5 to 7 pm Colombian time | On-line workshop: Deconstruyendo el concepto de validez y Discusiones sobre estimaciones de confiabilidad SICAPSI (J. Escobar, C.Pardo) www.colpsic.org.co |
June 12 - 14, 2024, Wed.-Fri. | 1st Scandinavian Applied Measurement Conference, Kristianstad University, Kristianstad, Sweden http://www.hkr.se/samc2024 |
Aug. 9 - Sept. 6, 2024, Fri.-Fri. | On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com |
Our current URL is www.rasch.org
The URL of this page is www.rasch.org/memo44.htm