Let us start by considering Harvey Goldstein (HG, 2012, p.153):
HG: "The specific literature on the 'Rasch' model, a particularly simple item-response model, is ... insistent that only a single dimension is needed in any given application,"
JML comment: The number of dimensions needed, or encountered, in a given application depend on the application, but, whenever we talk about "more" or "less" of something, we have declared the "something" to have the properties of a dimension. The goal of the Rasch model is to quantify that dimension in terms of additive units of "more"-ness. The complexity of the Rasch model matches this task.
WPF comment: Quantification is inherently linear along a single dimension of more and less. If quantification is desired, isolating those aspects of a construct that exhibit consistent variation is essential.
HG: "The specific literature on the 'Rasch' model .... displays a general unwillingness to explore further (see Goldstein 1980 for an illustrative example)."
JML comment: Rasch analyses are unusual in that every person, demographic group, item, item response option, even each individual response, can be reported with fit statistics, estimates and other indicators, as appropriate. Routine exploration of any dataset includes searching for secondary dimensions in the data, and determining their impact on the empirical functioning of the intended dimension. The depth and complexity of Rasch analysis has advanced considerably since 1980. For instance, the User Manual for BICAL, the leading Rasch software in 1980, was 95 pages of text. BICAL has about 2,000 lines of computer code. An equivalent Rasch program in 2012, Winsteps, has a User Manual with 615 pages of text and has more than 70,000 lines of computer code.
WPF comment: The specific literature that refers to Rasch's work is wide ranging in the explorations of the infinite ways in which constructs can interact, overlap, or display anomalous features. Karabatsos (2003), for instance, examines 36 different ways of evaluating inconsistencies in person measures. In addition, a wide range of Rasch models for item bundles or testlets, multidimensional collections of constructs, multilevel models of group-level effects, and multifaceted situations of many kinds have emerged in the last 30 years.
HG: "proponents of this model regard the model as paramount and suggest that data should be constructed or modified to satisfy the model's assumptions."
JML comment: Social Scientists, indeed scientists of all types, construct or modify data to meet their intentions. For instance, Census Bureaus construct the data they want by writing appropriate questions. Analysis of Census data often requires that the data be modified, because the analytical question does not exactly match the question on the Census form.
Currently "data mining" methodology is in vogue and considered to be highly successful. Here are its stages (Fayyad et al., 1996): (1) Data Selection, (2) Data Pre-processing, (3) Data Transformation, (4) Data Mining, (5) Interpretation/Evaluation. Rasch methodology uses the same stages, but with (4) Rasch analysis. Stages (1) and (2) correspond to data construction and modification. A difference is that Rasch analysts tend to be more methodical and overt about their data procedures.
WPF comment: HG's objection is written in a grammatically correct English sentence. This sentence and manner of communication prioritizes a model of a competent English reader able to understand written text. HG, like most other proponents of this model, regard it as paramount and assume that readers will be able to construct or modify data to satisfy the model's assumptions. A measurement model is no different. Instruments are texts that are written, read and interpreted using the same cognitive operations employed in any act of reading. HG would no more attempt written communication in terms of a model that allows ungrammatical constructions, mixed languages and orthographies, or stray marks than measurement should be attempted in terms of models that legitimate just any kind of data. GIGO.
HG: "Thus, Andrich (2004) claims that this model satisfies the conditions of 'fundamental measurement' and as such attains the status of measurement in the physical sciences"
JML comment: From a practical perspective, most measurement in the physical sciences is based on additivity, "one more unit is the same amount extra, no matter how much there already is." Additivity can be demonstrated for Rasch parameter values (Rasch measures) (Wright 1988), so Rasch measures have the practical status of physical measures.
WPF comment: Measurement in physics is often misconstrued as primarily a matter of accessing concrete objects. On the contrary, the laws of science project unrealistic and unobservable phenomena, like balls rolling on frictionless planes, or objects left entirely to themselves with no external influence, or a point-like mass swinging on a weightless string. Rasch models are exactly like models in physics in this respect of positing unobservable ideals that serve as heuristic guides to inference and decision making.
HG: "- a view about measurement in the social sciences that in a slightly different context Gould (1981) has labelled 'physics envy'."
JML comment: "Overcoming Physics Envy" (Clarke & Primo, 2012) begins "How scientific are the social sciences? Economists, political scientists and sociologists have long suffered from an academic inferiority complex: physics envy. They often feel that their disciplines should be on a par with the 'real' sciences and self-consciously model their work on them, using language ('theory,' 'experiment,' 'law') evocative of physics and chemistry."
Yes, Rasch analysts also share this feeling. But is it a bad feeling? Haven't "theory," "experiment," "law" generated 400 years of obvious progress in physics and chemistry? Would social science be possible without theories and hypotheses to guide our thoughts, experiments to verify our conclusions, laws (observed regularities) to encapsulate those conclusions into communicable and useful forms? It is the same with measurement. "How much?" is a basic question in both "real" science and social science. Additive measures of well-defined variables are the most straight-forward way for us to think about, communicate and utilize "much"-ness.
"Overcoming Physics Envy" ends "Rather than attempt to imitate the hard sciences, social scientists would be better off doing what they do best: thinking deeply about what prompts human beings to behave the way they do."
But "thinking deeply" is exactly what Rasch facilitates. The bulk of the raw data is segmented into well-behaved, understandable dimensions on which carefully-thought-out defensible inferences about human beings can be based. The ill-behaved remnants of the raw data are perplexing, perhaps inexplicable. We can think deeply about these remnants and perhaps generate new insights from them about human behavior, but these confusing remnants do not impede us from making progress.
WPF comment: Many social scientists have long been doing what they do best. Beginning from the emergence of qualitative methods in the 1960s and 1970s, there has been less and less concern with imitating any other field, while more and more effort has been invested in creative thinking. Recent studies of model-based reasoning in science (for instance, Nersessian, 2006, 2008) show that scientific thinking is not qualitatively different from any other kind of thinking. The goal is not to imitate physics or any one field, but to think productively in a manner common to all fields. Rasch (1960) explicitly draws from Maxwell's method of analogy, which is exactly the example Nersessian (2002) uses to illustrate model-based reasoning (Fisher, 2010).
Now let us consider Goldstein (2010), his response to Panayides et al. (2010). Goldstein asserts that the Rasch "model is inadequate, and that claims for its efficacy are exaggerated and technically weak." Here is the evidence Goldstein presents in support of this generalization.
HG: Around 1980, in the United Kingdom, "the advocates of using Rasch, notably Bruce Choppin, had a weak case and essentially lost the argument. It was this failure to make a convincing case that led to the dropping of the use of this model for the [United Kingdom]."
JML comment: Around 1980, a convincing case could not be made for any psychometric methodology, as my employer at the time, Mediax Associates, discovered. However, indications were more hopeful for Rasch than for any of its competitors. Linacre (1995) demonstrates that the deficiencies in the British educational system, confirmed by Bruce Choppin's application of Rasch methodology, were crucial in its rejection.
HG: "the essence of the criticisms remains and centres around the claim that the model provides a means of providing comparability over time and contexts when different test items are used."
JML comment: In 1980, the empirical evidence for comparability was weak, even though the theoretical basis was strong. By 1997, the empirical evidence was also strong (Masters, 1997). By 2012, so many testing agencies have maintained comparability for many years by using Rasch methodology that it is now routine.
WPF comment: Bond (2008) reports one such routinely maintained basis for comparability. Re-analysis of data from items used on tests over periods of 7 to 22 years at one major testing agency showed that "correlations between the original and new item difficulties were extremely high (.967 in mathematics, .976 in reading)." Bond continues, saying "the largest observed change in student scores moving from the original calibrations to the new calibrations was at the level of the minimal possible difference detectable by the tests, with over 99% of expected changes being less than the minimal detectable difference."
HG: "Misconceptions and inaccuracies. First, .... all claims about item characteristics being group-independent and abilities being test-independent, can be applied to [Classical, IRT and Rasch] types of model."
JML comment: Here is an experiment. Simulate a dataset of 1000 persons and 200 items according to each of the models. Split each dataset in two, the 500 higher-scoring persons, and the 500 lower-scoring persons. Analyze each pair of resulting datasets separately. To investigate group-independence, cross-plot the pairs of item difficulty estimates. Do they follow a statistically straight line? No, except for Rasch models or models that approximate Rasch models.
Now split the original datasets in two again, the 100 higher-scored items, and the 100 lower-scored items. Analyze the pairs of resulting datasets separately. To investigate test independence, cross-plot the two sets of person ability estimates. Do they follow a statistically straight line? No, except for Rasch models and estimation procedures that impose the same person distribution on both datasets. In summary, all claims cannot be applied to all models. Only Rasch models support the claims.
HG: "Secondly, ... a 2-dimensional set of items (representing different aspects of mathematics) could actually appear to conform to a (unidimensional) Rasch model, so that fitting the latter would be misleading."
JML comment: Yes, a dataset that balances two distinct dimensions can appear unidimensional on first inspection, so current Rasch best-practice is to include an investigation of the dimensionality of a dataset. All empirical datasets are multidimensional to some extent. In this example, the decision must be made as to whether the different aspects of mathematics (say, arithmetic and algebra) are different enough to be considered different "dimensions" (say, for the purpose of identifying learning difficulties) or are merely different strands of a superordinate dimension (say, for the purpose of Grade advancement).
WPF comment: Yes, Smith (1996) illustrates the value of a Principal Components Analysis of Rasch model residuals, showing its value in detecting multidimensionality when two or more constructs are roughly equally represented in an item set. PCA's strength in this situation is complemented by the sensitivity of the usual fit statistics when items primarily represent a single construct and only a few are off-construct or otherwise problematic.
HG: "Thirdly, the authors claim that there are no sample distributional assumptions associated with the Rasch model. This cannot be true, however, since all the procedures used to estimate the model parameters.... necessarily make distributional assumptions."
JML comment: Yes, different estimation methods make different assumptions. For instance, many Rasch maximum-likelihood estimation methods (including CMLE, JMLE, PMLE) make no assumptions about the distributions of the person abilities and item difficulties, but do assume that the randomness in the data is normally distributed. This assumption is routinely validated using fit statistics.
WPF comment: The term "assumption" here is misused. An assumption is something taken for granted, something left unexamined on the basis of its status as something in no need of attention. What HG refers to as assumptions are in fact the very opposite. What distinguishes the art and science of measurement from everyday assumptions about what are matters of fact is that very close attention is paid to the requirements that must be satisfied for inference to proceed.
HG: "Fourthly, ... the authors.. claim that a 'fundamental requirement' for measurement is that for every possible individual the 'difficulty' order of all items is the same. This is ... extremely restrictive. ... I also find it difficult to see any theoretical justification for such invariance to be a desirable property of a measuring instrument."
JML comment: The difficulty hierarchy of the items defines the latent variable. The easy items define what it means to be low on the latent variable. The hard items define what it means to be high on the latent variable. We measure a person's ability on a latent variable (for instance, "arithmetic") in order to make inferences about that person's arithmetic performance. If the definition of the latent variable changes depending on the person's ability level, then we cannot make general statements such as "division" is more difficult than "addition" (Wright, 1992). We must add the impractical restrictive phrase, "for people at such-and-such ability level". The inferential value of the latent variable is severely diminished.
WPF comment: Being unable to see any theoretical justification for invariance as a desirable property of a measuring instrument belies fundamental misconceptions of what instruments are and how they work. Invariance is the defining property of instruments, no matter if they are musical, surgical, or measuring. Without invariant measures, orchestras and laboratories would be impossible. "The scientist is usually looking for invariance whether he knows it or not. ... The quest for invariant relations is essentially the aspiration toward generality, and in psychology, as in physics, the principles that have wide applications are those we prize (Stevens 1951, p. 20). Perhaps HG terms invariance restrictive because he misconceives it in some kind of absolute way, as Guttman did. In actual practice, the uncertainty ranges within which items fall vary across different kinds of applications. Screening tolerates more uncertainty than accountability, which tolerates more than diagnosis, and which can in turn tolerate more than research investigations of very small effect sizes.
HG: "Fifthly, the authors do not seem to appreciate the problem of item dependency. .... There are all kinds of subtle ways in which later responses can be influenced by earlier ones."
JML comment: An advantage of Rasch methodology is that detailed analysis of Rasch residuals provides a means whereby subtle inter-item dependencies can be investigated. If inter-item dependencies are so strong that they are noticeably biasing the measures, then Rasch methodology supports various remedies. For instance, it may be advantageous to combine the dependent items into polytomous super-items (so effectively forming the items into testlets).
WPF comment: One of the significant reasons for requiring unidimensionality and invariance is, in fact, to reveal anomalous local dependency. "To the extent that measurement and quantitative technique play an especially significant role in scientific discovery, they do so precisely because, by displaying significant anomaly, they tell scientists when and where to look for a new qualitative phenomenon" (Kuhn, 1977, p. 205). As another writer put it, expect the unexpected or you won't find it (van Oech, 2001). If you begin with the intention of modeling dependencies, every data set and every instrument will be different, and all of the differences distinguishing them will be hidden in the modeled interactions. The predominance of modeling of this kind is precisely why the social sciences have made so little progress. Real progress will be made only when we implement uniform measurement standards capable of supporting the kind of distributed cognition common in language communities (Fisher, 2012), whether one defines those communities in terms of English or Mandarin, or in terms of Newton's Second Law and Le Système international d'unités [International System of Units].
HG: "Sixthly, ... This comes dangerously close to saying that the data have to fit the preconceived model rather than finding a model that fits the data. It is quite opposed to the usual statistical procedure whereby models (of increasing complexity) are developed to describe data structures. Indeed, the authors are quite clear that the idea of 'blaming the data rather than the model' is an important shift from standard statistical approaches. In my view that is precisely the weakness of the authors' approach."
JML comment: What is here perceived to be "dangerous" and "weakness", most of Science perceives to be necessary and strength. In general throughout Science, a theory is constructed that usefully explains and predicts important aspects of the data. This theory then becomes the screen through which future data are validated. Only if some future data cannot be coerced to conform to this theory, and those data are shown to be valid, is this theory bypassed in favor of some other theory and perhaps only for those data. Rasch theory is useful in that it constructs additive unidimensional measures from ordinal data. CTT and non-Rasch IRT may provide better statistical descriptions of specific datasets, but the non-linearity of their estimates and their sample-distribution-dependent properties render them less useful for inference.
WPF comment: Again, by writing in English and on a technical subject, HG must require readers who fit his preconceived model of the particular kind of person able to understand his text. When he takes the measure of the situation and puts it in words, he makes no effort whatsoever to find a model for his text that will fit any person at all who happens to approach it. He very restrictively requires readers capable of reading English and of comprehending somewhat technical terms. He gladly sets aside the vast majority of the world population who are unable to comprehend, or who are merely uninterested in, his text. In positing the Pythagorean theorem or Newton's laws, we do exactly the same kind of thing, focusing our attention on the salient aspects of a situation and ignoring the 99.999% of the phenomena that do not correspond. Our failure to do this more routinely in the social sciences says more about the way we misunderstand language, cognition, and our own instruments than it does about any kind of supposed shortcoming in Rasch theory.
HG: "Finally, ... The old Rasch formulation is just one, oversimple, special case. All of these models are in fact special kinds of factor analysis, or structural equation, models which have binary or ordered responses rather than continuous ones. As such they can be elaborated to describe complex data structures, including the study of individual covariates that may be related to the responses, multiple factors or dimensions, and can be embedded within multilevel structures."
JML comment: Rasch models construct additive measures (with known precision) from binary or ordered responses. Additive measures are ideal for further statistical analysis. Far from being obsolete, Rasch models are seen to be useful building-blocks on which to build elaborate statistical structures.
WPF comment: HG's observation assumes that measurement is primarily achieved by means of data analysis. But once an instrument is calibrated, and the item estimates persist in their invariant pattern across samples and over time, does not further data analysis become exceedingly redundant? Only the most counter-productive and obstructionist kind of person would resist the prospect of capitalizing on the opportunity to make great efficiency gains by fixing the unit at a standard value. Yes, Rasch mixture, multilevel, multifaceted, item bundle, etc. models are highly useful, but an important goal is to create a new metrological culture in the social sciences. Qualitative and quantitative data and methods need to be blended in the context of instruments tuned to the same scales. Only then will we find paths to new ways of harmonizing relationships.
HG: "Attempting to resurrect the Rasch model contributes nothing new."
JML comment: Only in the UK has the Rasch model needed resurrection. However, "attempting to resurrect the Rasch model" forces us to reconsider the philosophy underlying Social Science. Is Social Science to become exclusively qualitative with an endless accumulation of suggestive case studies but no counts of anything? Is Social Science to become exclusively quantitative with its focus solely on summary statistics and arcane descriptive models? Or is Social Science to become a synergistic blend of qualitative and quantitative? This is the ideal toward which Rasch methodology strives as it attempts to construct meaningful, sometimes new, qualitatively-defined unidimensional variables out of counts of inevitably messy ordered observations.
WPF comment: The point is to be able to persist in questioning, to continue the conversation. Statistical models can sometimes describe data to death, meaning that they become so over-parameterized that nothing of value can be generalized from that particular situation to any other. All models are wrong, as Rasch (1960, pp. 37-38; 1973/2010) stressed. But even though there are no Pythagorean triangles in the real world, they still prove immensely useful as heuristic guides to inference in tasks as concrete as real estate development, titling, and defending property rights. If we can resist the pressures exerted by HG and others bent on prematurely closing off questioning about potential general invariances, we may eventually succeed in creating real value in social science. But if we instead focus only on ephemeral local specifics inapplicable beyond their immediate contexts, we will continue to be subject to aspects of our existence that we do not understand.
John Michael Linacre (JML)
William P. Fisher, Jr. (WPF)
References
Andrich, D. (2004) Controversy and the Rasch model: a characteristic of incompatible paradigms? Medical Care 42, Suppl. 1: 1-7.
Bond, T. (2008). Invariance and item stability. Rasch Measurement Transactions, 22(1), 1159 www.rasch.org/rmt/rmt221h.htm
Clarke K. A. & Primo D.M. (2012) Overcoming 'Physics Envy'. New York Times, April 1, 2012, New York Edition, p. SR9. www.nytimes.com/2012/04/01/opinion/sunday/the-social-sciences-physics-envy.html
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P. (1996). "From Data Mining to Knowledge Discovery in Databases". AI Magazine 17(3), 37-54.
Fisher, W. P., Jr. (2010). The standard model in the history of the natural sciences, econometrics, and the social sciences. Journal of Physics: Conference Series, 238(1), iopscience.iop.org/1742-6596/238/1/012016/pdf/1742-6596_238_1_012016.pdf.
Fisher, W. P., Jr. (2012, May/June). What the world needs now: A bold plan for new standards. Standards Engineering, 64(3), 1 & 3-5. ssrn.com/abstract=2083975
Goldstein, H. (1980) Dimensionality, bias, independence and measurement scale problems in latent trait test score models. British Journal of Mathematical and Statistical Psychology, 33, 2: 234-46.
Goldstein H. (2010) Rasch measurement: a response to Payanides [sic], Robinson and Tymms. www.bristol.ac.uk/cmm/team/hg/response-to-payanides.pdf
Goldstein, H. (2012) Francis Galton, measurement, psychometrics and social progress. Assessment in Education: Principles, Policy & Practice, 19(2), May 2012, 147-158. www.bristol.ac.uk/cmm/team/hg/full-publications/2012/Galton.pdf
Gould, S.J. (1981) The Mismeasure of Man. New York: W.W. Norton.
Karabatsos, G. (2003). A comparison of 36 person-fit statistics of Item Response Theory. Applied Measurement in Education, 16, 277-298.
Kuhn, T. S. (1977). The essential tension: Selected studies in scientific tradition and change. Chicago, Illinois: University of Chicago Press.
Linacre J.M. (1995) Bruce Choppin, visionary. Rasch Measurement Transactions, 8(4), p. 394. www.rasch.org/rmt/rmt84e.htm
Masters G.N. (1997) Where has Rasch Measurement Proved Effective? Rasch Measurement Transactions, 11(2), 568. www.rasch.org/rmt/rmt112j.htm
Nersessian, N. J. (2002). Maxwell and "the method of physical analogy": Model-based reasoning, generic abstraction, and conceptual change. In D. Malament (Ed.), Essays in the history and philosophy of science and mathematics (pp. 129-166). Lasalle, Illinois: Open Court.
Nersessian, N. J. (2006, December). Model-based reasoning in distributed cognitive systems. Philosophy of Science, 73, 699-709.
Nersessian, N. J. (2008). Creating scientific concepts. Cambridge, Massachusetts: MIT Press.
Panayides, P., Robinson, C., Tymms, P. (2010) The assessment revolution that has passed England by: Rasch measurement. British Educational Research Journal, 36 (4), 611-626. dro.dur.ac.uk/6405/
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980). Copenhagen, Denmark: Danmarks Paedogogiske Institut.
Rasch, G. (1973/2011, Spring). All statistical models are wrong! Comments on a paper presented by Per Martin-Löf, at the Conference on Foundational Questions in Statistical Inference, Aarhus, Denmark, May 7-12, 1973. Rasch Measurement Transactions, 24(4), 1309 www.rasch.org/rmt/rmt244d.htm
Smith, R. M. (1996). A comparison of methods for determining dimensionality in Rasch measurement. Structural Equation Modeling, 3(1), 25-40.
Stevens, S. S. (1951). Mathematics, measurement, and psychophysics. In S. S. Stevens (Ed.), Handbook of experimental psychology (pp. 1-49). New York: John Wiley & Sons.
Von Oech, R. (2001). Expect the unexpected (or you won't find it): a creativity tool based on the ancient wisdom of Heraclitus. New York: The Free Press.
Wright B.D. (1988) Rasch model from Campbell concatenation: additivity, interval scaling. Rasch Measurement Transactions, 2(1), 16. www.rasch.org/rmt/rmt21b.htm
Wright, B.D. (1992) IRT in the 1990s: Which Models Work Best? 3PL or Rasch? Rasch Measurement Transactions, 6(1), pp. 196-200. www.rasch.org/rmt/rmt61a.htm
Harvey Goldstein's Objections to Rasch Measurement: A Response from Linacre and Fisher. John Michael Linacre & William P. Fisher, Jr. … Rasch Measurement Transactions, 2012, 26:3 p. 1383-9
Rasch Publications | ||||
---|---|---|---|---|
Rasch Measurement Transactions (free, online) | Rasch Measurement research papers (free, online) | Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch | Applying the Rasch Model 3rd. Ed., Bond & Fox | Best Test Design, Wright & Stone |
Rating Scale Analysis, Wright & Masters | Introduction to Rasch Measurement, E. Smith & R. Smith | Introduction to Many-Facet Rasch Measurement, Thomas Eckes | Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr. | Statistical Analyses for Language Testers, Rita Green |
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar | Journal of Applied Measurement | Rasch models for measurement, David Andrich | Constructing Measures, Mark Wilson | Rasch Analysis in the Human Sciences, Boone, Stave, Yale |
in Spanish: | Análisis de Rasch para todos, Agustín Tristán | Mediciones, Posicionamientos y Diagnósticos Competitivos, Juan Ramón Oreja Rodríguez |
Forum | Rasch Measurement Forum to discuss any Rasch-related topic |
Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement
Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.
Coming Rasch-related Events | |
---|---|
June 23 - July 21, 2023, Fri.-Fri. | On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com |
Aug. 11 - Sept. 8, 2023, Fri.-Fri. | On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com |
The URL of this page is www.rasch.org/rmt/rmt263g.htm,
Website: www.rasch.org/rmt/contents.htm,