The Differences Between Scores and Measures

All observations originate as ordinal, if not nominal, data. Quantitative science begins with identifying conditions and events which, when observed, are deemed worth counting. The resulting counts are sometimes called "raw scores" to distinguish them from "weighted" or "scaled" scores. But usually they are just called "scores". As such, they are no more than counts of observed events - essential for the construction of measures, but not yet measures. Since "scores" are so often mistaken for "measures" and then misused as though they were measures, we take the trouble to refer to "scores" as "counts" so that their failure to be measures will remain explicit.

Counting is the beginning of quantification. Measurement is deduced from well-defined sets of counts. The most elementary level is to count the presence of the defined event. But more information can be obtained when the conditions that identify countable events can be ordered into successive categories which increase in status along some intended underlying variable. It then becomes possible to count, not just the presence of an event, but the number of steps up the ordered set of categories which the observed category implies.

When a rating scale is tagged: "none", "plenty", "all", the inarguable order of these tags enables their use as steps. The observation of "none" can be counted as 0 steps up this scale, "plenty" as 1 step up and "all" as 2. This counting has nothing to do with any numbers or weights which might be assigned to the categories. "Plenty" might have been tagged "20" or "40" by the test author but the assertion of such a numerical category tag would not alter the fact that on this scale "plenty" is just 1 step up from "none".

All classifications are qualitative. Some classifications can be ordered and so become more than nominal. Other classifications, like sex, are usually not ordered, though there may be perspectives from which an ordering becomes useful. This does not mean that nominal data cannot have explanatory power. It does mean that nominal data are not measurement in the accepted sense of the word.

Counting steps, however, says nothing about distances between categories, nor is it a requirement that all items employ the same category tags. It would make no difference to the step counting if the categories were tagged "none", "plenty", "all" for this item and "none", "almost none" and "all" for another item. Even though the relative meanings and intended amounts corresponding to the alternative tags are different, their order is the same and so their step counts can only be the same. Whenever category tags share the same ordering, even though they may differ in implied amounts, they can only be represented by the same step counts.

What every scientist, cook and carpenter mean by a "measure" is a number with which arithmetic can be done, a number which can be added and subtracted and the differences of which can be multiplied and divided with results that maintain their numerical meaning. The original observations in any science are never measures in this sense. They cannot be measures because a measure implies and requires the previous construction and maintenance of a necessarily abstract measuring system which has proven useful.

Counts of events form a primitive ratio scale. They have an origin at "none" and a raw unit of "one more". But the events counted are specific rather than general, concrete rather than abstract and thus varying rather than uniform in their import. Sometimes the next "one more" implies a small increment as in the step from "none" to "almost none". Sometimes the next event implies a big increment as in the step from "none" to "plenty". Since all we can do is to count one more step, any particular raw count is insensitive to the differing implications of the steps. To get at the empirical magnitudes of the step sizes we must construct a measuring system based on a coordinated set of observed counts. This requires a measurement analysis of the ordinal observations which comprise the initial data in every science.

Even counts of time-honored units like grams, centimeters and seconds, so useful as measures in many contexts, need not function as measures in others. Counting the seconds it takes a patient to cross a room does not provide a linear measure of patient mobility. To construct a linear measure of patient mobility based on seconds elapsed we must count the seconds taken by a relevant sample of patients of varying mobility to cover a variety of relevant distances. Then we must analyze these counting data to discover whether a linear measure of mobility can be constructed from them and, if so, what its relationship to elapsed seconds" may be.

Thorndike stressed the necessity of ascending from counting to measuring in 1904. Thurstone spent the 1920's developing partial solutions. Then in 1953 Rasch invented the model which is necessary as well as sufficient for the construction of measures in any science. Rasch noticed that a measure must retain its quantitative status regardless of the context in which it occurs. This means that a test item is only useful for measuring persons among whom it maintains a constant difficulty, and a person is only useful for calibrating items among which he maintains a fixed ability. Rasch also noticed that the outcomes of interactions between persons and items could never be fully pre-determined but must always involve an unpredictable element. This leads to the (stochastic Guttman) requirement that:
The more able the person, the more likely a success on any item.
The more difficult the item, the less likely a success for any person.

The (Rasch) model necessary for converting counts into measures follows by deduction from these conjoint requirements.

"Measurement" implies a count of "standard" (hence abstract) units from a "standard" starting point. The mental picture is a distance between points on a line. But there is no measurement requirement to find "the" point of minimum intensity or to extrapolate an "absolute zero mobility". It is only necessary to anchor the scale by choosing some origin. Usually there are frames of reference for which particular choices are useful.

The seemingly non-arbitrary origin of a ratio scale is more theoretical than practical. Logarithms convert any ratio scale to an interval scale and exponentiation converts any interval scale to a ratio scale. The interval scale's origin becomes the unit of the ratio scale and the interval scale's minus infinity becomes the ratio scale's origin.

The practical convenience of measuring length from an arbitrary origin, like the end of a yardstick, far outweighs the abstract benefit of measuring from some "absolute" origin, such as the center of the Universe. Once an interval scale is constructed from relevant counts, we can always answer ratio questions such as "Is the distance from "wheelchair" to "unaided" more than twice the distance from "cane" to "unaided"?"

In view of the difference between counts and measures, why do analyses of raw score counts and Likert rating scale tags seem to "work"? When data is complete and all data are used, then the relationship between scores and measures is monotonic. This makes covariation analyses of scores and the measures they may imply similar. Further, for complete data, the relationship between scores and measures is ogival because the finite interval between the minimum score and the maximum score must extend to an infinite interval of measures. Toward the center of this ogive the relationship between score and measure is approximately linear.

But the monotonicity between score and measure holds: only when data are complete, only when every subject responds to every item, only when no responses are disqualified. This means no missing data and no tailoring item difficulties to person abilities. Further, the approximate linearity between central scores and their corresponding measures breaks down as scores approach their extremes.

An occasional objection to Rasch measurement is its imposition of unidimensionality. This objection is puzzling because unidimensionality is a meaning of measurement. The necessity of the Rasch model as "the" method for constructing measures is due to its deduction from the measurement requirement of unidimensionality.

In practice, unidimensionality is conceptual rather than factual, qualitative rather than quantitative. No actual test can be perfectly unidimensional. No empirical situation can complete the requirements for measurement which generate the Rasch model. This reality is encountered by every science. Physicists' corrections for the unavoidable multi-dimensionalities they encounter are an integral part of their experimental technique.

If a test containing a mixture of medical and law items is used to make a single pass-fail decision, then the examination board, however inadvertently, has decided to use the test as though it were unidimensional. This is quite beside any qualitative or quantitative arguments which might "prove" multi-dimensionality. The board's practice does not make medicine and law identical or exchangeable anywhere but in their pass/fail actions. But their "unidimensional" pass/fail decisions do testify that they are making medicine and law exchangeable and hence identical for their purpose. Unless each item is treated as a test in itself, every test score is a compromise between the essential ideal of unidimensionality and the unavoidable exigencies of practice.

Before observations can be used to support quantitative research, they must be examined to see how well they fit together to define the intended underlying variable. Rasch measurement provides theory and technique. But the extent to which a particular set of observations can serve measurement is empirical. No total score can be accepted before verifying that its meaning is enough in accord with the meanings of the scores of its item components to lead to a useful measure. Assistance in doing this is provided by fit statistics which report the degree to which the observations approximate the assumptions necessary for measurement, and hence quantify the validity of the data.

The process of test evaluation is never finished. Every time items collect new information from new persons to estimate new measures, we must verify again that the unidimensionality requirements of the measuring system have been well enough approximated by these new data to maintain the quantitative utility of the measures produced. Whether a particular set of data can be used to initiate or to continue a measuring system is empirical. This empirical question is addressed by 1) analyzing the relevant data according to a unidimensional measurement model, 2) discovering how well and in what parts these data conform to the intentions to measure and, 3) examining carefully those parts of the data which do not conform and hence cannot be used for measuring to see whether we can learn from them how to improve our observations and so better achieve our intention to measure.

Once interval-scale measures are obtained, it becomes reasonable to proceed with statistical analysis in order to determine, for example, the predictive validity of some measures or to compare measures produced by different tests to see if they are measures of the same thing, like inches and centimeters, or different things, like inches and ounces.

Differences between scores and measures. Wright BD, Linacre JM. … Rasch Measurement Transactions, 1989, 3:3 p.63

Rasch Books and Publications
Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, 2nd Edn. George Engelhard, Jr. & Jue Wang	Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene	Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver	Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone	Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes	Statistical Analyses for Language Testers (Facets), Rita Green	Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind	Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M	Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind	Rasch Measurement: Applications, Khine	Winsteps Tutorials - free Facets Tutorials - free	Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre	Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan
Other Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free	An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse	Rasch Measurement Theory Analysis in R, Wind, Hua	Applying the Rasch Model in Social Sciences Using R, Lamprianou	El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar	Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch	Rasch Models for Measurement, David Andrich	Constructing Measures, Mark Wilson	Best Test Design - free, Wright & Stone Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias	Diseño de Mejores Pruebas - free, Spanish Best Test Design	A Course in Rasch Measurement Theory, Andrich, Marais	Rasch Models in Health, Christensen, Kreiner, Mesba	Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
Jan. 16 - Feb. 13, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Apr. 8 - Apr. 11, 2026, Wed.-Sat.	National Council for Measurement in Education - Los Angeles, CA, ncme.org/events/2026-annual-meeting
Apr. 8 - Apr. 12, 2026, Wed.-Sun.	American Educational Research Association - Los Angeles, CA, www.aera.net/AERA2026
May. 15 - June 12, 2026, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 19 - July 25, 2026, Fri.-Sat.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com