MESA Memo 25: Interpretation of Item Bias

Student grouping schemes based on test performance often have an adverse impact upon minorities. This underscores the need for a technique which helps determine whether the impact stems from biased items in the test or from differential abilities on the intended trait.

One way to detect biased items follows from the Rasch model and the way it estimates item difficulties. The model removes the ability distribution of any group used to estimate item difficulties; hence, difficulty estimates should be statistically equivalent for groups distinguished only by their ability distributions. If, on the contrary, difficulty estimates shift significantly from group to group, this suggests that the item interacts with particular characteristics of the groups such as race or sex.

Of course, difficulties estimated with distinct groups cannot be identical; at best, they can only be statistically equivalent. Therefore, the point at which a difficulty shift makes a difference requires some definition.

Rasch estimates are derived from a logistic transformation of the proportion right. For a typical test, difficulties range between +2.5 and -2.5 logits. Therefore, when an item estimate shifts a half logit (ten percent of the range from -2.5 to +2.5) for separate calibration samples, we might consider that shift to be conspicuous. Wright and Douglas (1975) show that if the shift is less than half a logit and unsystematic, then it has little effect on the accuracy of measurement for tests of more than twenty items approximately on target. This "half logit" rule works fairly well; however, we can qualify it with a statistical observation.

Associated with each estimate of item difficulty is a standard error. This, in concert with the estimate of difficulty, can be used to compute a "t" statistic which evaluates the significance of the shift in estimates.

where D_i1 is the item difficulty estimate and S_i1, its standard error for group 1 on item i, and D_i2 is the estimate and S_i2 the standard error for group 2 on the same item i. A t-value of more than 2 is often considered significant; however, in this case where we are looking at more than twenty items at a time, it seems reasonable (following Bonferroni) to increase the t-value for "statistical significance" to 2.4.

Adding this t-test qualification to the half logit "rule" insures against over-reacting to differences that could have resulted from chance only. For example, estimates of an item's difficulty could shift more than a half logit because the item is off target for both calibration samples not because it interacts with the group's race or sex. That the item is off target manifests itself as large standard errors. The magnitude of them makes an apparently dramatic shift not significant once the t-value is computed. Conversely, shifts less than a half logit may be significant because the standard errors are small--a manifestation that the item is on target.

But like all simple techniques, this one has a shortcoming. The size of the standard errors for all items is associated with the size of the calibration sample. Given sufficiently large samples, "significant" t-values which refer to differences considerably less than a half logit might result.

The application of this technique to a sixty-six (66) item reading test taken by 8,000 black and white high school students shows that it can implicate items which once identified can be seen to have obvious sources of bias.

Four samples (white boys and girls and black boys and girls) of 450 students each whose scores ranged from 50-63 were randomly selected and combined to obtain four contrasting calibrations: white (900) v. black (900) and boy (900) v. girl (900). Estimates were compared and items with t>2.4 selected for content analysis.

The six items on Table I, Girl Items, can be grouped into three categories: boy-in-trouble, housekeeping and language arts. Item 51 (wet sleeping bag) and 23 (stealing money) portray boys in trouble. It is easy to suspect that such passage content distracts boys from concentrating on the reading task of getting the right answer. Boys will identify with the boys in trouble and become more concerned with figuring out how to avoid such trouble in their own lives than with finding the specified right answer. The "right answer" is trivial compared with learning to avoid accusations of personal impropriety or sleeping in the cold. It is useful to add that girls do better but not significantly better (1<t<2.4) on the four other items which portray boys-in-trouble.

TABLE I Girl Items (Items Easier for Girls than Boys)
Item Number	Item Name	Boy Diff (SE)	Girl Diff (SE)	Shift (B-G)	SED	t
33 46 27 51 36 23	border pattern lang keys picnic litter wet sleep bag good essay stealing money	.140(.113) .760(.091) .702(.093) .369(.104) .128(.114) -.381(.140)	-.453(.150) .353(.108) -.307(.110) .078(.128) -.308(.141) -.904(.188)	.593 .406 .395 .438 .436 .563	.188 .141 .144 .165 .181 .234	3.2 2.9 2.7 2.7 2.4 2.4
mean shift for 6 items = .472

Items 33 and 27 are girl items because they are housekeeping items. Item 33 requires the reader to construct a border of the kind found on cloth or rug designs. To girls, most of whom take sewing, the word "border" is familiar and meaningful. Even more important, the construction of the correct border requires a willingness to follow directions as if one were cutting a dress pattern or preparing a recipe. Item 27, on the other hand, deals with cleaning-up after a picnic. This item favors girls because they value cleanliness more than boys and because their experiences and training make the answer obvious to them. At class picnics, girls typically do "their share to clean-up litter," - while boys do their best to amuse themselves until it's time to go home.

Items 46 and 36 are language arts items which generally favor girls. Item 46 requires students to use etymology and language tables while item 36 is easier if one already knows what makes a good essay. Also girls do better, but not significantly better (t=1.8 and t=2.1), on two other language arts items, one which requires selecting the best order for a series of sentences and one that requires the use of a book index. Since girls generally do better in language arts and enjoy it more than boys, it is not surprising that they do better on these "language arts" items.

TABLE II Boy Items (Items Easier for Boys than Girls)
Item Number	Item Name	Girl Diff (SE)	Boy Diff (SE)	Shift (G-B)	SED	t
1 21 16 44 2	astronaut football train trestle talking ants father's grass	1.064(.086) .959(.089) 1.527(.077) -.111(.130) -.214(.136)	.453(.101) .373(.100) 1.196(.081) -.614(.155) -.686(.160)	.611 .486 .331 .503 .472	.133 .134 .112 .202 .210	4.6 3.6 3.0 2.5 2.3
mean shift for 5 items = .481

The sources of bias for four of five items on Table 11 are clear; the source for item 16, however, is elusive. Item 21 favors boys because it asks about football. Item 1 is a boy's adventure item; it is about an astronaut and of the same form as item 21. Item 2 is easy if a person has had some experience with using weed killer on lawns. It is safe to assume that boys have had more experience than girls helping father use poisons to "eradicate" weeds. The passage connected with item 44 is about ants. Given the expectation that girls disdain "bugs," it seems reasonable to find that boys do significantly better on this item.

Finally, item 16 appears to be about boys in trouble; therefore, we might expect girls to do better than boys on it. Boys, however, excel on this item. Perhaps this is because it reveals to boys how to stay out of trouble. Items 14 and 15, upon which girls do better, relate to "Jimmy" who is in trouble: caught on a railroad trestle as a train approaches. But, item 16 relates to "Bill" and his decision not to follow Jimmy across the trestle. Given the choice of nearly being killed as Jimmy or remaining safe as Bill, most boys would rather identify with Bill and safety. Experiences teach boys the awful and exciting terror of flirting with trains, but since most are like Bill, they survive boyhood adventures along the rails. Thus, just as in life, it is essential for them to identify with Bill and find the exact paragraph which "helps you know" that "Bill didn't follow Jimmy across the railroad trestle."

Whites do better than blacks on items shown on Table III because success on them requires knowledge of key words associated with "white culture." Item 2 turns on knowledge of the word "eradicate'' Item 8 requires students to group words by categories. To answer 37, the student must know the meaning of "assume," a word possibly more familiar to whites than blacks. The key word in item 55 is "guide word," and in 33 it is "border." This last item is both a white and girl item; black boys have the most difficulty with it.

TABLE III White Items (Items Easier for Whites than Blacks)
Item Number	Item Name	Black Diff (SE)	White Diff (SE)	Shift (B-W)	SED	t
55 2 37 8 33	guide word father's grass assume word categories border pattern	1.400(.071) -.235(.119) .880(.080) -.008(.108) -.059(.106)	.882(.103) -.911(.214) .480(.118) -.569(.183) -.476(.175)	.518 .676 .400 .561 .535	.125 .245 .143 .212 .205	4.1 2.8 2.8 2.6 2.6
mean shift for 5 items = .538

Five of these six items on Table IV fall into two categories, while item 12 stands alone. Item 12 requires a student to interpret a slang expression ("Talking through his hat") in the context of a sentence. Given the skill with which young blacks create and use slang and informal figures of speech, it is not surprising that they do well on this item.

TABLE IV Black Items (Items Easier for Blacks than Whites)
Item Number	Item Name	White Diff (SE)	Black Diff (SE)	Shift (W-B)	SED	t
12 58 15 28 24 53	word play lightening train trestle picnic litter stealing money shark gills	.741(.108) 1.087(.096) 2.175(.078) 1.053(.087) 2.018(.079) -.054(.146)	.175(.101) .663(.086) 1.870(.067) 1.158(.075) 1.698(.068) -.531(.135)	.566 .424 .305 .345 .320 .477	.148 .129 .103 .115 .104 .199	3.8 3.3 3.0 3.0 3.1 2.4
mean shift for 6 items = .406

Items 24 and 28 are items whose answers are easy if one has a realistic view of life. Perhaps blacks develop a more realistic view of life than whites at an earlier age and, because of this, black students have greater insight into the right answers for these items. Item 24 is tied to a passage about Ron suspecting Carter of stealing money. Later, Ron discovers that his suspicions were unfounded. The item asks what was suggested but not stated in the passage. The answer is that Ron did not confront Carter with his suspicions. Since blacks more than whites are the victims of crime, they are more aware of the possibility of persons perpetrating crime. But since blacks more than whites are also the victims of violence stemming from altercations, they are acutely aware of the consequences which may follow from accusing someone of wrongdoing. Hence, that Ron did not talk to Carter about his suspicions is more obvious to blacks than whites.

Item 28 relates to a passage involving two boys eating their lunch in the park prior to a ball game. Tim eats and then does not want to take time to clean-up his litter because he wants to have more time to practice before the game, but Dave wants to clean-up before leaving. Dave does not think it is right to pollute the area, while Tim observes "Man, if you think you're goin' to cut down on pollution by picking up a couple of pieces of paper, you're crazy." Item 28 asks which idea would Tim probably believe. The answer is "one person alone cannot solve the problems of pollution." The reason this item is easier for blacks than whites seems obvious. So many black students taking this test either live in slum areas, or in areas on their way to becoming slums. These students are realistic enough to know that, alone, they cannot solve the problem of pollution in their neighborhoods. Other black students taking this test live in tidy working or middle-class neighborhoods and they also know that one person cannot solve the problems of pollution. Their parents often belong to block clubs organized to keep the neighborhood free from litter and pollution. Thus, that blacks would understand so clearly what Tim believes makes sense; the circumstances of their lives taught them the right answer.

Items 15, 53 and 58 seem to share no common element; however, all require students to locate a specific piece of information in the item's passage. Why should this kind of item be easier for blacks? The answer to this question is uncertain. One explanation could be that urban blacks are often subjected to so-called "canned," "teacher-proof" reading programs which emphasize, among other skills, locating specific information in the passages. Thus, blacks may have more experience and practice answering this kind of question.

The foregoing suggests that the use of Rasch estimates of item difficulty can implicate biased items. The next important task is to determine how much such biased items affect estimates of ability. Ultimately, we want to know whether the disproportionate numbers of blacks or boys in the lower ranks of a track system stem from biased items or from lower abilities.

In the test used for this paper, there are items which clearly favor boys, girls, blacks, and whites. While this bias may tend to cancel itself out over all of these 66 items, what if a test were constructed in which all items favored only one group? What would this mean for estimates of ability?

In Tables I-IV the mean difficulty shift for all items in each table was computed; the grand mean is .47. Were we to add this to the measure of a person, the effect on scores of a totally biased test of sixty-six items can be seen in Table V.

TABLE V Effect of a Totally Biased 66-Item Test
Raw Score Level	Ability Level	Bias	Biased Ability	Biased Raw Score	Advantage due to Bias
55 46 34 21 11	2.02 1.06 .07 -.98 -2.01	+ .47 + .47 + .47 + .47 + .47	= 2.49 = 1.53 = .54 =- .51 =-1.54	58-59 50-51 39-40 26-27 15-16	+ 3 or 4 + 4 or 5 + 5 or 6 + 5 or 6 + 4 or 5

A boy, for example, who scored only 34 on the original test of sixty-six items could score 39-40 on a test biased entirely in his favor (with the magnitude of bias, .47). The "extra" five or six score points could mean the difference between passing or failing the test. Clearly, then, the use of a totally biased test could have dramatic consequences for the classification and placement of students.

In conclusion, the use of shifts in the estimates of Rasch item difficulties to implicate biased items appears to have potential. Even if an item with a significant shift has no obvious explanation for the source of its bias, the use of Rasch estimates would still make sense as a point of departure. Rasch item difficulty is the only fully estimable item parameter. The Rasch model corrects the estimate of item difficulty for the distribution of person ability in the group. Thus, alternative estimates of item difficulties are expected to be statistically equivalent. Therefore, the Rasch model provides a clear and statistically defensible expectation for the variation of difficulty estimates across groups - they should be statistically equivalent. If they are not, the reason for the discrepancies in item estimates from separate calibration samples may not be obvious, but at least there is a statistical justification for reviewing items and perhaps removing them from the test. If, as is often the case, items are "cheap" to construct, a set of items which retain their difficulties across groups can eventually be constructed and used to measure males and females from different racial backgrounds on a single variable, "fairly."

Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free	An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse	Rasch Measurement Theory Analysis in R, Wind, Hua	Applying the Rasch Model in Social Sciences Using R, Lamprianou	El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar	Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch	Rasch Models for Measurement, David Andrich	Constructing Measures, Mark Wilson	Best Test Design - free, Wright & Stone Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias	Diseño de Mejores Pruebas - free, Spanish Best Test Design	A Course in Rasch Measurement Theory, Andrich, Marais	Rasch Models in Health, Christensen, Kreiner, Mesba	Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen
Rasch Books and Publications: Winsteps and Facets
Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene	Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver	Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone	Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale	Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes	Rasch Models for Solving Measurement Problems (Facets), George Engelhard, Jr. & Jue Wang	Statistical Analyses for Language Testers (Facets), Rita Green	Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind	Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind	Rasch Measurement: Applications, Khine	Winsteps Tutorials - free Facets Tutorials - free	Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre	Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan

Coming Rasch-related Events
Apr. 21 - 22, 2025, Mon.-Tue.	International Objective Measurement Workshop (IOMW) - Boulder, CO, www.iomw.net
Jan. 17 - Feb. 21, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Feb. - June, 2025	On-line course: Introduction to Classical Test and Rasch Measurement Theories (D. Andrich, I. Marais, RUMM2030), University of Western Australia
Feb. - June, 2025	On-line course: Advanced Course in Rasch Measurement Theory (D. Andrich, I. Marais, RUMM2030), University of Western Australia
May 16 - June 20, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 20 - July 18, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com
July 21 - 23, 2025, Mon.-Wed.	Pacific Rim Objective Measurement Symposium (PROMS) 2025, www.proms2025.com
Oct. 3 - Nov. 7, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com

The Identification and Interpretation of Item Bias