It seems that the status quo in educational measurement has not changed one iota since the 1984 publication of Wright's "Despair and hope for educational measurement."
Arthur Ellen recently pointed out to me that the Winter, 2000 issue of the Journal of Educational Measurement (37(4): 281-306) includes an article by Neil J. Dorans and Paul W. Holland entitled, "Population invariance and the equatability of tests: basic theory and the linear case." The article starts off strong, with a first sentence that reads
"The comparability of measurements made in differing circumstances by different methods and investigators is a fundamental pre-condition for all of science."
The phrase "by different methods" must mean different tests designed to measure the same construct. The methods by which these similar tests would be administered could be the same, if they were of the same format, but it isn't unreasonable to also expect multiple choice tests and performance assessments to scale together on a common construct, even though these involve quite different measuring methods.
But despite the great start, the authors then set themselves up to fail in the realization of this pre-condition, by not making a single reference to or utilization of fundamental measurement theory or principles (Krantz, et al. 1971, 1989, 1990; Perline, et al. 1979; Wright 1985; Fisher & Wright 1994; Michell 1999) in their entire article.
Their five requirements of test equating are listed (pp. 282-3):
1) the tests must address the same construct;
2) the tests must be of equal reliability;
3) the equating function for going from test Y to text X must be symmetrical with the function for going from X to Y;
4) examinees must obtain the same measure from either test (equity);
5) the equating function should be "population invariant".
This final phrase, taken from Lord (1980), is so awkward that the authors almost seem to be deliberately avoiding the terms sample- or scale-free.
Interestingly, Harris and Crouse (1993: 196) list four conditions for equating that they credit to Lord (1980). These are identical with four of the five requirements listed by Dorans and Holland, though the latter credit Lord only with the equity requirement. (Lord does not include the equal reliability requirement, which seems to be more of a convenience than a requirement). Harris and Crouse state the invariance requirement as "the equating function should be invariant across populations." This is an awkward way of expressing the matter, since the populations of all relevant persons and items ought to define the variable. Irrelevant persons and items from other populations ought to define one or more other, different constructs. If the variable is invariant across what are presumed to be different populations, for the purposes of measuring this variable, they are different samples of the same population. It is very difficult to avoid the impression that these writers seem to be putting themselves through contortions to avoid using the phrase "sample-free." Their success in not using those particular words, however, hardly prevents them from engaging with the concept, and the relevant issues have a long history of development in measurement theory, dating from Thurstone's (1959) sample- and scale-free methods of the 1920s.
Not only is none of this history mentioned in the Dorans and Holland article, but the five equating requirements are disparaged by the authors as "not much of a theoretical underpinning for test equating" (p. 283). Dorans and Holland (p. 282) also state that "there is no systematic theory of test linking and equating." This apparent denial of the existence of measurement theory brings to mind similar denials in other IRT writings, as was pointed out by Wright (1984: 283). But were the equating requirements to be expressed in the terms of axiomatic measurement theory, they would have considerable theoretical underpinning and the authors could have taken more efficacious steps toward realizing their stated goal of measurement comparability across methods and investigators.
Perhaps it is possible to understand why measurement theory is being avoided. In a manner reminiscent of Whitehead's (1925) development of the "fallacy of misplaced concreteness", Fisher (1994) shows that test theorists commonly reduce construct validity to content validity. They do this by mistakenly confusing the particular sample of questions actually appearing on the test with the abstract population of all possible test questions that the sample may or may not represent. Thurstone (1959: 296) similarly points out that:
"There is a popular fallacy that a unit of measurement is a thing - such as a piece of yardstick. This is not so. A unit of measurement is always a process of some kind which can be repeated without modification in the different parts of the measurement continuum."
Dorans and Holland partake of this modern-day Pythagorean fallacy of misplaced concreteness, contending that
"It is our opinion that it is not at all that easy to identify the construct that a test measures. However, the most common way that this requirement is checked is by inspection of the content and wording of the test questions, in other words, by examining the two tests to see if they appear to competent judges to be measuring the same thing. This is, of course, easiest to judge when the two tests are built to the same specifications and follow the same test blueprint. This situation satisfies the equal construct requirement by definition."
As has been repeatedly shown, establishing construct validity necessarily entails an evaluation of the internal consistency of the observations facilitated by a test (Messick 1975, 1981; Benson 1998). Even when a single test is built from a single blueprint, there is no guarantee that it is construct-valid, i.e., that every item on it measures the intended construct, so why should the construction of two tests from the same test blueprint satisfy "the equal construct requirement by definition," as Dorans and Holland assert?
The failure to dissociate the concrete test content from the abstract measurement construct is intimately related to the unpopularity of scaling methods that make the hypothesis of a construct-valid quantitative variable vulnerable to falsification (Wilson 1971; Cliff 1982, 1992; Guttman 1985; Michell 1999). Thus Dorans and Holland (p. 286) address the failure to establish "population invariance" only briefly, acknowledging that some may be surprised and even offended by the suggestion that tests ought not to be linked when invariance is insufficient. Then they quickly qualify this recommendation. Fearing that some readers may think the authors are viewing "the requirement of population invariance as some sort of panacea," Dorans and Holland show how the requirement can be forced to hold in a trivial way, as though this somehow dilutes the fundamental importance of what Thurstone (1959: 228) called the "crucial experimental test" of invariance.
Because the internal consistency of the data is rarely evaluated, and is considered as secondary to content even when it is, Dorans and Holland say that, when tests are not equatable,
"The methods can still be carried out, and there is rarely any indication in the data to alert us that something inappropriate has been done. This lack of a data-based warning for inappropriate test equating is a problem that underlies all of test linking and it is the primary motivation for our interest in population invariance" (p. 283).
Had the authors started from the principles of item-and-examinee population invariance, i.e., sample independence, built into fundamental measurement theory, they would have found a wealth of data-based warnings for inappropriate equating. Foremost among these are methods in which model fit statistics and scatterplots are used to identify measures and calibrations that vary across samples. Experimental removal of inconsistent observations, or of entire subsets of observations, and comparison of these results with those obtained when the inconsistencies are left intact, is a long and respected tradition in the history of science (e.g., Kruskal, 1960).
Much more could be said about the issues raised in Dorans and Holland's paper, but what is of most intense interest is the fact that the research reported in this paper was prompted by Holland's experience as chair of a National Research Council committee that made recommendations about the feasibility of linking various state standardized assessments with the National Assessment of Educational Progress scales. The report of this committee (Feuer, et al. 1999) is available from the National Academy Press. The Executive Summary of this report states on page 1 that:
"The issues surrounding comparability and equivalency of educational assessments, although not new to the measurement and student testing literature, received broader public attention during congressional debate over the Voluntary National Tests (VNT) proposed by President Clinton in his 1997 State of the Union address. If there is any common ground shared by the advocates and opponents of national testing, it is the potential merits of bringing greater uniformity to Americans' understanding of the educational performance of their children. Advocates of the VNT argue that this is only possible through the development of a new test, while opponents have suggested that statistical linkages among existing tests might provide a basis for comparability."
"To help inform this debate, Congress asked the National Research Council (NRC) to study the feasibility of developing a scale to compare, or link, scores from existing commercial and state tests to each other and to the National Assessment of Educational Progress (NAEP). This question, stated in Public Law 105-78 (November 1997), was one of three, stemming from the debate over the VNT, that the NRC was asked to study. Under the auspices of the Board on Testing and Assessment, the NRC appointed the Committee on Equivalency and Linkage of Educational Tests in January 1998."
The page immediately after the title page states that "the members of the committee responsible for the report were chosen for their special competencies and with regard for appropriate balance." Just as is the case with the Dorans and Holland paper, however, axiomatic measurement theory is glaring in its absence from this report, suggesting that the special competencies of the committee members were inappropriately balanced away from a thorough and rigorous consideration of their charge from the President and Congress.
As a clue to the report's contents, consider that in a 13-page glossary of technical terms at the end of the report, there are no entries for "Sufficiency," "Invariance," "Construct validity," "Unidimensionality," "Parameter separation," "Additivity," or "Fundamental Measurement Theory," though the crucial importance of each of these for scientific measurement is well established and recognized. There are entries in the glossary for "Content congruence," "Content domain," "Content standard," and "Item Response Theory," and the term "scores" is routinely treated as though it means "measures".
Given the report's lack of measurement theoretical foundations, its prioritization of content validity over construct validity, and its focus on scores instead of measures, the committee's negative findings are to be expected:
1. Comparing the full array of currently administered commercial and state achievement tests to one another, through the development of a single equivalency or linking scale, is not feasible.
2. Reporting individual student scores from the full array of state and commercial achievement tests on the NAEP scale and transforming individual scores on these various tests and assessments into the NAEP achievement levels are not feasible.
In direct opposition to what we assume whenever we weigh apples in a supermarket, in the measurement of abilities we assume, nay, even grotesquely and counterproductively require, that the meaning of the measure depend on which brand scale is used. This observation leads to the distasteful conclusion that, until the culture of the human sciences experiences a paradigmatic shift away from assumptions of sample- and scale-dependency toward the systematic deployment of metrological systems of data quality and quantity standards (Fisher 2000), society will not reap the benefits that stand to be gained from capitalizing on the established existence of mathematically rigorous, abstract quantities. Even if this shift and the gains to be realized will take decades or even centuries to occur, as the history of the metric system suggests, they will never happen if we do not make use of the tools at our disposal.
Fundamental measurement theory offers the tools, the criteria, and the standards through which fair and equitable scale- and sample-free metrics can be developed. We will not long be able to continue protesting about the injustices imposed upon us by the inferential constraints of the human sciences when we hold in our hands the keys to unlocking those constraints.
William P. Fisher, Jr
Benson J. 1988. Developing a strong program of construct validation: a test anxiety example. Educational Measurement: Issues and Practice, 17(1), 22.
Cliff N. 1982. What is and isn't measurement. In: Statistical and methodological issues in psychology and social sciences research, edited by G. Keren. Hillsdale, NJ: Lawrence Erlbaum.
Cliff N. 1992. Abstract measurement theory and the revolution that never happened. Psychological Science, 3:186-90.
Dorans NJ, Holland PW. 2000. Population invariance and the equatability of tests: basic theory and the linear case. Journal of Educational Measurement, 37(4) 281-306.
Feuer MJ, Holland PW, Green BF, Bertenthal MW, Hemphill FC. 1999. Uncommon measures: equivalence and linkage among educational tests. Washington, DC: National Academy Press.
Fisher WP, Jr. 1994. The Rasch debate: validity and revolution in educational measurement. In: Objective measurement: theory into practice. Vol. II. Ed. M. Wilson. Norwood, New Jersey: Ablex.
Fisher WP, Jr, Wright BD, Eds. 1994. Applications of Probabilistic Conjoint Measurement. International Journal of Educational Research, 21(6):557-664.
Guttman L. 1985. The illogic of statistical inference for cumulative science. Applied Stochastic Models and Data Analysis, 1:3-10.
Harris DJ, Crouse JD. 1993. A study of criteria used in equating. Applied Measurement in Education, 6(3):195-240.
Krantz DH, Luce RD, Suppes P, Tversky A. 1971, 1989, 1990. Foundations of measurement. Vols 1-3 (authors vary). New York: Academic Press.
Kruskal WH. 1960. Some remarks on wild observations. Technometrics, 2: 1. Reprinted in Precision Measurement and Calibration, ed. H. H. Ku. 1969. Washington D.C.: National Bureau of Standards.
Lord FM. 1980. Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.
Messick S. 1975. The standard problem: meaning and values in measurement and evaluation. American Psychologist, 30, 955-66.
Messick S. 1981. Constructs and their vicissitudes in educational and psychological measurement. Psychological Bulletin, 89:575-88.
Michell J. 1999. Measurement in psychology: A critical history of a methodological concept. Cambridge: Cambridge University Press.
Perline R, Wright BD, Wainer H. 1979. The Rasch model as additive conjoint measurement. Applied Psychological Measurement, 3(2), 237-55.
Thurstone LL. 1959. The measurement of values. Chicago: University of Chicago Press, Midway Reprint Series.
Whitehead AN. 1925. Science and the Modern World.
Wilson TP. 1971. Critique of ordinal variables. Social Forces, 49:432-44.
Wright BD. 1984. Despair and hope for educational measurement. Contemporary Education Review, 3(1):281-8.
Wright BD. 1985. Additivity in psychological measurement. In: Measurement and personality assessment, edited by Roskam E. North Holland: Elsevier Science.
"Normal science, pathological science and psychometrics" (Joel Michell, Theory & Psychology, 10(5), 639-667, 2000) has been summarized thus:
"It examines the pathology of science, defined as a two-level breakdown in processes of critical inquiry: first a hypothesis is accepted without serious attempts being made to test it; and, second, this first-level failure is ignored. It is shown that the hypothesis upon which psychometrics stands, the hypothesis that some psychological attributes are quantitative, has never been critically tested. Furthermore, it is shown that psychometrics has avoided investigating this hypothesis through endorsing an anomalous definition of measurement. In this way, the failure to test this key hypothesis is not only ignored, but disguised. It is concluded that psychometrics is a pathology of science, and an explanation of this fact is found in the influence of Pythagoreanism upon the development of quantitative psychology." [Emphasis mine.]
Given Michell's use of the word "never", it sounds as though he himself is continuing to ignore 1) that Rasch models do in fact test the hypothesis that a variable is quantitative, and 2) the widespread use of these models in professional certification testing, educational research and practice, and in health care outcomes research. The rest of his arguments are well-formulated, however, and his work offers a leverage point for arguing in favor of improved measurement. Enjoy!
William P. Fisher, Jr.
Fisher, W.P. Jr. (2001) Invariant Thinking vs. Invariant Measurement. Rasch Measurement Transactions 14:4 p.778-81
Invariant Thinking vs. Invariant Measurement. Fisher, W.P. Jr. Rasch Measurement Transactions, 2001, 14:4 p.778-81
|Rasch Measurement Transactions (free, online)||Rasch Measurement research papers (free, online)||Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch||Applying the Rasch Model 3rd. Ed., Bond & Fox||Best Test Design, Wright & Stone|
|Rating Scale Analysis, Wright & Masters||Introduction to Rasch Measurement, E. Smith & R. Smith||Introduction to Many-Facet Rasch Measurement, Thomas Eckes||Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr.||Statistical Analyses for Language Testers, Rita Green|
|Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar||Journal of Applied Measurement||Rasch models for measurement, David Andrich||Constructing Measures, Mark Wilson||Rasch Analysis in the Human Sciences, Boone, Stave, Yale|
|in Spanish:||Análisis de Rasch para todos, Agustín Tristán||Mediciones, Posicionamientos y Diagnósticos Competitivos, Juan Ramón Oreja Rodríguez|
|Forum||Rasch Measurement Forum to discuss any Rasch-related topic|
Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement
Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.
|Coming Rasch-related Events|
|Oct. 6 - Nov. 3, 2023, Fri.-Fri.||On-line workshop: Rasch Measurement - Core Topics (E. Smith, Facets), www.statistics.com|
|Oct. 12, 2023, Thursday 5 to 7 pm Colombian time||On-line workshop: Deconstruyendo el concepto de validez y Discusiones sobre estimaciones de confiabilidad SICAPSI (J. Escobar, C.Pardo) www.colpsic.org.co|
|June 12 - 14, 2024, Wed.-Fri.||1st Scandinavian Applied Measurement Conference, Kristianstad University, Kristianstad, Sweden http://www.hkr.se/samc2024|
|Aug. 9 - Sept. 6, 2024, Fri.-Fri.||On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com|
The URL of this page is www.rasch.org/rmt/rmt144e.htm