Jansen, van den Wollenberg and Wierda (JWW, 1988) object to the bias in unconditional [joint] maximum likelihood estimation (JMLE, formerly UMLE or UCON) of Rasch parameters. Their comments on the necessity of the Rasch model for measurement and their algebra are impeccable. The practical consequences of their work, however, contradict their objections. The crucial question for practitioners is whether there is a convenient correction for JMLE bias which is accurate enough for practical purposes. Psychometricians can, in fact, find firm support for the use of JMLE in the articles by JWW.

Even though JWW perseverate on their discovery that the Wright-Douglas (1977) correction of ((K-l)/K) for JMLE bias (K is the number of test items) is slightly inexact for very short tests, the conclusion JWW actually report is that the difference between bias predicted by (K/(K-1)) and the bias JWW observe "practically disappears" (their statement) when tests have more than 10 items. This JWW statement goes further than the Wright and Douglas (1977) recommendation of ((K-1)/K) for tests of more than 20 items.

JMLE bias has no effect on the relative position of items and so no effect on substantive interpretations of the variable defined by the item calibrations. There are, however, two applications in which bias in item difficulties could become a practical problem. These are the effects of item bias on person measurement and on test equating.

PERSON MEASUREMENT BIAS?

What effect does JMLE item calibration bias have on person measures? The aim of testing is to provide person measures sufficiently accurate for fair evaluation. The ((K-l)/K) correction for bias is applied to the item difficulties after they are centered at zero. This makes the measures most affected by error in bias correction to be those associated with extreme scores R = 1 and R = K-1 (R is the number of right answers and K is the number of test items). To discover the maximum effect of JWW inaccuracy in ((K-l)/K) on person measures, I use the values JWW claim to be "correct" in their Tables 1-3, their associated item distributions and test lengths (K = 5,10,15,20) to calculate the person measurement bias when R = 1.

The relevant UFORM formulae (which are exact for the uniform tests used by JWW) are derived in Wright and Douglas (1975, 21-24, 32) and applied in Wright and Stone (1979, 143-151, 212-215). I express the maximum person measurement bias in log-odds units to show its tiny magnitude and in standard error units to show its statistical insignificance.

Maximum Measurement Bias
Due to JMLE Item Calibration
After Correction (K-l)/K in Logits | ||||
---|---|---|---|---|

Test Length | Item Parameter Range | |||

K | -2,+2 | -3,+ 1 | -3,+3 | -4,+4 |

5 10 15 20 |
.05 .02 .02 .02 |
.11 .06 .05 .04 |
.20 .09 .06 .05 |
.43 .18 .15 .10 |

Maximum Measurement Bias
Due to JMLE Item Calibration
After Correction (K-l)/K in Standard Errors of Measurement | ||||
---|---|---|---|---|

Test Length | Item Parameter Range | |||

K | -2,+2 | -3,+ 1 | -3,+3 | -4,+4 |

5 10 15 20 |
.04 .02 .02 .02 |
.09 .05 .05 .04 |
.15 .08 .05 .05 |
.31 .15 .14 .09 |

For method of computation see BEST TEST DESIGN, Wright and Stone, 1979, pages 143-151, 212-215.

One sees immediately that, even for K = 5, JMLE item bias is of no practical consequence as far as person measurement is concerned. Except for the 5 item, 8 logit test, a very rare configuration, maximum measurement bias is less than .21 logits (less than .16 standard errors of measurement!).

For tests of usual length and width - more than 10 items, less than 6 logits - the maximum measurement bias due to JWW's results is ALWAYS less than .09 logits (less than .08 standard errors or measurement!). Even these minute discrepancies only occur when scores are extreme, R = 1 or R = K-1. When tests are on target, observed scores cluster around K/2 where JMLE measurement bias is zero. It is clear that person measurement bias cannot be a reason to avoid JMLE.

TEST EQUATING BIAS?

What effect does JMLE item calibration bias have on test equating? The Rasch way to equate two tests is to include a subset of common items in both, to calibrate each test separately, to plot the resulting pairs of item estimates for the common items and to use the intercept of a line with a slope of one fitted to these common item points as the equating constant (Wright and Stone 1979, 108-118).

In this procedure inaccuracy in ((K-1)/K) tends to cancel, especially when tests are similar in length and difficulty (the usual situation) because then the inaccuracy is similar for the two calibrations. If, however, tests differ substantially in length and difficulty, then fitting a line with a slope adjusted to the distributions of common item difficulties can remove the effect of bias.

The least biased and most efficient way to equate two or more tests linked by a network of common items and/or common persons is to combine the data from each administration into one large matrix with a column for every item included in any test and a row for every person included in any sample, indicating missing data whenever a person does not take an item. The single Rasch analysis of this one large matrix provides item calibrations and person measures on a common linear scale for all items and all persons involved in any test (Wright and Linacre 1985, Schulz 1987, Wright, Schulz, Congdon and Rossner 1987).

CONDITIONAL ESTIMATION?

JWW advocate a minimum chi-square pair-wise estimation as their cure for the effects of ((K-1)/K) inaccuracy on JMLE. They would have done better by their readers to remind them, instead, of the logically equivalent but statistically superior maximum likelihood pair-wise estimation described by Rasch (1960/1980, 171-172) and Choppin (1968) and applied extensively by Choppin (1976, 1977, 1978, 1983). This method has significant antecedents in Case V of Thurstone's 1927 Law of Comparative Judgement (1927a, 1927b), Bradley and Terry's 1952 method of paired comparisons and Luce's 1959 probabilistic theory of choice. It is easy to use and understand, and generalizes directly to rating scale and partial credit models (Wright and Masters 1982, 67-72, 82-85). Should a real situation actually arise where conditional estimation is seriously deemed worth the trouble, then the Rasch/Choppin pair-wise approach is the method of choice.

CONCLUSION

For practitioners working with tests of more than 10 items, the articles by Jansen, van den Wollenberg and Wierda give no reason at all to avoid unconditional maximum likelihood estimation of Rasch item calibrations and person measures. In fact their articles provide data which firmly supports the adequacy of this practice.

Benjamin D. Wright, 1988

MESA Research Memorandum Number 45

MESA PSYCHOMETRIC LABORATORY

REFERENCES

Bradley, R.A., and Terry, M.E. Rank analysis of incomplete block designs. I. The method of paired comparisons, Biometrika. 1952, 39, 324-45.

Choppin, B.H., An item bank using sample-free calibration. Nature, 1968, 219. 870-872.

Choppin, B.. Recent developments in item banking. In D.N.M. de Gruiter and L.J.T. Van der Kamp (Eds.) Advances in Psychological and Educational Measurement, New York: Wiley, 1976.

Choppin, B.H., Developments in item banking. In R. Sumner (Ed.) Monitoring National Standards of Attainment in Schools. Windsor: NFER, 1977.

Choppin, B.H., Item Banking and the Monitoring of Achievement. Slough: NFER, 1978.

Choppin, B.H., A Fully Conditional Estimation Procedure for Rasch Model Parameters. Los Angeles: UCLA CSE Technical Report No. 196, ERIC Document No. ED 228267, 1983.

Jansen, P.G., Van den Wollenberg, A.L., Wierda, F.W. (1988) Correcting unconditional parameter estimates in the Rasch model for inconsistency. Applied Psychological Measurement 12(3) 297-306.

Luce, R.D. Individual Choice Behavior. New York: Wiley, 1959.

Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests. (Copenhagen: 1960) Chicago: MESA Press, 1992.

Schulz, E.M. One-step vertical equating with MSCALE. Presented at the Fourth International Workshop on Objective Measurement,University of Chicago, 17 April, 1987, and the American Educational Research Association Annual Meeting. Washington, 22 April, 1987.

Thurstone, L.L. A law of comparative judgment. Psychological Review, 1927a, 34, 273-86.

Thurstone, L.L. The method of paired comparisons for social values. Journal of Abnormal and Social Psychology, 1927b, 21, 384-400.

Van den Wollenberg, A.L., Wierda, F.W., Jansen, P.G. (1988) Consistency of Rasch model parameter estimation: A simulation study. Applied Psychological Measurement 12(3) 307-313.

Wright, B.D. and Douglas, G.A. Best test design and self-tailored testing. Research Memorandum No. 19, Statistical Laboratory, Department of Education, University of Chicago, 1975.

Wright, B.D. and Douglas, G.A. Best procedures for sample-free item analysis. Applied Psychological Measurement 1977 1 281-294.

Wright, B.D. and Linacre, J.M. MICROSCALE. Westport: MEDIAX, 1985.

Wright, B.D. and Masters, G.N. Rating Scale Analysis. Chicago MESA Press, 1982.

Wright, B.D., Schulz, E.M., Congdon, R.T. and Rossner, M. The MSCALE Program for Rasch Measurement. Chicago: MESA Press, 1987.

Wright, B.D. and Stone, M.H. Best Test Design. Chicago: MESA Press, 1979.

This appeared in

*Applied Psychological Measurement*

12 (3) pp. 315-318, September 1988.

Go to Top of Page

Go to Institute for Objective Measurement Page

Coming Rasch-related Events | |
---|---|

Aug. 11 - Sept. 8, 2023, Fri.-Fri. | On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com |

Aug. 29 - 30, 2023, Tue.-Wed. | Pacific Rim Objective Measurement Society (PROMS), World Sports University, Macau, SAR, China https://thewsu.org/en/proms-2023 |

Oct. 6 - Nov. 3, 2023, Fri.-Fri. | On-line workshop: Rasch Measurement - Core Topics (E. Smith, Facets), www.statistics.com |

June 12 - 14, 2024, Wed.-Fri. | 1st Scandinavian Applied Measurement Conference, Kristianstad University, Kristianstad, Sweden http://www.hkr.se/samc2024 |

Our current URL is www.rasch.org

*The URL of this page is www.rasch.org/memo45.htm*