The New Rules of Measurement

is the striking title of a recent book edited by Susan E. Embretson and Scott L. Hershberger (Mahway, NJ: Lawrence Erlbaum, 1999). There are 11 informative chapters packed with real-life Rasch-related applications. Solid theory is presented, graphically and through practical implications, rarely as bald algebra.

But how I wish my copy had a global replace feature! In almost every instance where the letters IRT appear, one must replace them with Rasch. For instance, "IRT item parameters are not biased by the population ability distribution" (p. 2). As has been demonstrated repeatedly (e.g., RMT 6:2, 217), this is a characteristic of only the Rasch model and not at all a general characteristic of IRT models.

Old Rule 1. The standard error of measurement applies to all scores in a particular population.

New Rule 1. The standard error of persons differs between persons with different response patterns, but generalizes across [similar] populations.

Of course, theorists in the classical tradition know that different raw scores have different standard errors. Nevertheless, "if the score distribution approaches normality, and if obtained scores do not extend over the entire possible range, the standard error of measurement is probably uniform at all score levels" (Guilford, 1965 p. 445). Indeed, a plot on p. 50 (reprinted below) of New Rules confirms that S.E.s can be reasonably uniform across most of the range of raw scores. Also, since the easiest way to compute raw score standard errors is from reliability coefficients, most classical analysts never go beyond computing one global standard error estimate.

So what are the real implications of Rule 1? As New Rules points out, standard errors of measures increase to infinity as scores become extreme. Standard errors of raw scores decrease to zero, misleading the analyst into believing that zero and perfect scores imply exact knowledge of the location of examinees on the latent variable. Further, examinee measures (as opposed to raw scores) are each identified with their own standard error, irrespective of who, if any one, takes the same test. Decisions can be made on an individual rather than group basis.

No, as New Rules clarifies, the Spearman-Brown prophecy formula is not revoked. Provided everything stays the same, a longer test of the same sort of items is more reliable than a shorter test. But a longer test is not necessarily more reliable than a different, shorter test. Of course, classicists know this, "Internal-consistency reliability is the greatest when ... the variance of items is greatest. This is when the proportion passing an item is .50" (Guilford p. 464). But classicists couldn't do much with this knowledge, because everyone had to take the same test, and test content was fixed. Now there are item banks and computer-adaptive testing. For instance, a 20-item on-target test can measure more reliably than a 30-item test on which an examinee achieves 80% success, and that can be more reliable than a 50-item test with 90% success.

Old Rule 3. Comparing test scores across multiple forms depends on test parallelism or test equating.

New Rule 3. Comparing test scores across multiple forms is optimal when test difficulty levels vary between persons.

What? Is test equating abolished? No - the emphasis has shifted. The goal is no longer to match the new test to the old test, it is to match the new test to the new person. Item banks are the key. (How did a reference to Wright & Bell, 1984, escape the editors of New Rules?) With pre-calibrated items, parallel forms and equi-percentile equating are obsolete.

Old Rule 4. Unbiased assessment of item properties depends on representative samples from the target population.

New Rule 4. Unbiased estimates of item properties may be obtained from unrepresentative samples.

What does bias mean? It means incorrect decisions due to poor test-to-sample targeting. What does representative mean? It means the sample ability distribution matches that of the population. Classical item selection criteria, such as p-value for item difficulty and discrimination index for item quality, are optimal for items targeted on the sample. If the distribution of the pilot sample does not match the distribution of the test population, replacing "bad" items could make the test worse, not better! But even under the best of circumstances, classical analysis is biased against those items which best measure the high and low performers.

Now items are assessed on their own merits. Each item is chosen for the role it plays in constructing measures for those examinees on whom it is targeted, without giving misleading information about others who might happen to encounter it. Each item is designed to be as similar to the other items as possible, in the sense of measuring the same construct and eliciting the same type of behavior from respondents. Each item is also designed to be as different from the other items as possible, in the sense of obtaining its own share of brand-new information about the performance level of respondents.

These four rules are those identified by Susan Embretson (p. 11-14). But New Rules reaches much farther. For instance, a new rule is that raw scores have substantive implications (p. 247-8). Another new rule is that the hierarchy of item difficulty reflects a meaningful, valid construct (p. 248-9). An additional new rule is that examinee response patterns have diagnostic meaning (p. 250-252). And still more rules emerge in chapter after chapter.

Guilford JP. 1965. Fundamental Statistics in Psychology and Education. New York: McGraw-Hill.

Wright BD, Bell SR. 1984. Item banks: what, why, how? Journal of Educational Measurement, 21:4, 331-345.

The New Rules for Measurement Embretson S.E. commented by Linacre, J.M. … Rasch Measurement Transactions, 1999, 13:2 p. 692

Rasch Books and Publications
Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, 2nd Edn. George Engelhard, Jr. & Jue Wang	Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene	Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver	Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone	Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes	Statistical Analyses for Language Testers (Facets), Rita Green	Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind	Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M	Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind	Rasch Measurement: Applications, Khine	Winsteps Tutorials - free Facets Tutorials - free	Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre	Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan
Other Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free	An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse	Rasch Measurement Theory Analysis in R, Wind, Hua	Applying the Rasch Model in Social Sciences Using R, Lamprianou	El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar	Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch	Rasch Models for Measurement, David Andrich	Constructing Measures, Mark Wilson	Best Test Design - free, Wright & Stone Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias	Diseño de Mejores Pruebas - free, Spanish Best Test Design	A Course in Rasch Measurement Theory, Andrich, Marais	Rasch Models in Health, Christensen, Kreiner, Mesba	Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
Apr. 21 - 22, 2025, Mon.-Tue.	International Objective Measurement Workshop (IOMW) - Boulder, CO, www.iomw.net
Jan. 17 - Feb. 21, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Feb. - June, 2025	On-line course: Introduction to Classical Test and Rasch Measurement Theories (D. Andrich, I. Marais, RUMM2030), University of Western Australia
Feb. - June, 2025	On-line course: Advanced Course in Rasch Measurement Theory (D. Andrich, I. Marais, RUMM2030), University of Western Australia
May 16 - June 20, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 20 - July 18, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com
July 21 - 23, 2025, Mon.-Wed.	Pacific Rim Objective Measurement Symposium (PROMS) 2025, www.proms2025.com
Oct. 3 - Nov. 7, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com

The New Rules of Measurement

What Every Psychologist and Educator Should Know