# The Identification and Interpretation of Item Bias

Student grouping schemes based on test performance often have an adverse impact upon minorities. This underscores the need for a technique which helps determine whether the impact stems from biased items in the test or from differential abilities on the intended trait.

One way to detect biased items follows from the Rasch model and the way it estimates item difficulties. The model removes the ability distribution of any group used to estimate item difficulties; hence, difficulty estimates should be statistically equivalent for groups distinguished only by their ability distributions. If, on the contrary, difficulty estimates shift significantly from group to group, this suggests that the item interacts with particular characteristics of the groups such as race or sex.

Of course, difficulties estimated with distinct groups cannot be identical; at best, they can only be statistically equivalent. Therefore, the point at which a difficulty shift makes a difference requires some definition.

Rasch estimates are derived from a logistic transformation of the proportion right. For a typical test, difficulties range between +2.5 and -2.5 logits. Therefore, when an item estimate shifts a half logit (ten percent of the range from -2.5 to +2.5) for separate calibration samples, we might consider that shift to be conspicuous. Wright and Douglas (1975) show that if the shift is less than half a logit and unsystematic, then it has little effect on the accuracy of measurement for tests of more than twenty items approximately on target. This "half logit" rule works fairly well; however, we can qualify it with a statistical observation.

Associated with each estimate of item difficulty is a standard error. This, in concert with the estimate of difficulty, can be used to compute a "t" statistic which evaluates the significance of the shift in estimates.

ti12 = ( Di1 - Di2 ) / sqrt( Si12 + Si22 )

where Di1 is the item difficulty estimate and Si1, its standard error for group 1 on item i, and Di2 is the estimate and Si2 the standard error for group 2 on the same item i. A t-value of more than 2 is often considered significant; however, in this case where we are looking at more than twenty items at a time, it seems reasonable (following Bonferroni) to increase the t-value for "statistical significance" to 2.4.

Adding this t-test qualification to the half logit "rule" insures against over-reacting to differences that could have resulted from chance only. For example, estimates of an item's difficulty could shift more than a half logit because the item is off target for both calibration samples not because it interacts with the group's race or sex. That the item is off target manifests itself as large standard errors. The magnitude of them makes an apparently dramatic shift not significant once the t-value is computed. Conversely, shifts less than a half logit may be significant because the standard errors are small--a manifestation that the item is on target.

But like all simple techniques, this one has a shortcoming. The size of the standard errors for all items is associated with the size of the calibration sample. Given sufficiently large samples, "significant" t-values which refer to differences considerably less than a half logit might result.

The application of this technique to a sixty-six (66) item reading test taken by 8,000 black and white high school students shows that it can implicate items which once identified can be seen to have obvious sources of bias.

Four samples (white boys and girls and black boys and girls) of 450 students each whose scores ranged from 50-63 were randomly selected and combined to obtain four contrasting calibrations: white (900) v. black (900) and boy (900) v. girl (900). Estimates were compared and items with t>2.4 selected for content analysis.

The six items on Table I, Girl Items, can be grouped into three categories: boy-in-trouble, housekeeping and language arts. Item 51 (wet sleeping bag) and 23 (stealing money) portray boys in trouble. It is easy to suspect that such passage content distracts boys from concentrating on the reading task of getting the right answer. Boys will identify with the boys in trouble and become more concerned with figuring out how to avoid such trouble in their own lives than with finding the specified right answer. The "right answer" is trivial compared with learning to avoid accusations of personal impropriety or sleeping in the cold. It is useful to add that girls do better but not significantly better (1<t<2.4) on the four other items which portray boys-in-trouble.

 TABLE I Girl Items (Items Easier for Girls than Boys) ItemNumber ItemName BoyDiff (SE) GirlDiff (SE) Shift(B-G) SED t 33 46 27 51 36 23 border pattern lang keys picnic litter wet sleep bag good essay stealing money .140(.113) .760(.091) .702(.093) .369(.104) .128(.114) -.381(.140) -.453(.150) .353(.108) -.307(.110) .078(.128) -.308(.141) -.904(.188) .593 .406 .395 .438 .436 .563 .188 .141 .144 .165 .181 .234 3.2 2.9 2.7 2.7 2.4 2.4 mean shift for 6 items = .472

Items 33 and 27 are girl items because they are housekeeping items. Item 33 requires the reader to construct a border of the kind found on cloth or rug designs. To girls, most of whom take sewing, the word "border" is familiar and meaningful. Even more important, the construction of the correct border requires a willingness to follow directions as if one were cutting a dress pattern or preparing a recipe. Item 27, on the other hand, deals with cleaning-up after a picnic. This item favors girls because they value cleanliness more than boys and because their experiences and training make the answer obvious to them. At class picnics, girls typically do "their share to clean-up litter," - while boys do their best to amuse themselves until it's time to go home.

Items 46 and 36 are language arts items which generally favor girls. Item 46 requires students to use etymology and language tables while item 36 is easier if one already knows what makes a good essay. Also girls do better, but not significantly better (t=1.8 and t=2.1), on two other language arts items, one which requires selecting the best order for a series of sentences and one that requires the use of a book index. Since girls generally do better in language arts and enjoy it more than boys, it is not surprising that they do better on these "language arts" items.

 TABLE II Boy Items (Items Easier for Boys than Girls) ItemNumber ItemName GirlDiff (SE) BoyDiff (SE) Shift(G-B) SED t 1 21 16 44 2 astronaut football train trestle talking ants father's grass 1.064(.086) .959(.089) 1.527(.077) -.111(.130) -.214(.136) .453(.101) .373(.100) 1.196(.081) -.614(.155) -.686(.160) .611 .486 .331 .503 .472 .133 .134 .112 .202 .210 4.6 3.6 3.0 2.5 2.3 mean shift for 5 items = .481

The sources of bias for four of five items on Table 11 are clear; the source for item 16, however, is elusive. Item 21 favors boys because it asks about football. Item 1 is a boy's adventure item; it is about an astronaut and of the same form as item 21. Item 2 is easy if a person has had some experience with using weed killer on lawns. It is safe to assume that boys have had more experience than girls helping father use poisons to "eradicate" weeds. The passage connected with item 44 is about ants. Given the expectation that girls disdain "bugs," it seems reasonable to find that boys do significantly better on this item.

Finally, item 16 appears to be about boys in trouble; therefore, we might expect girls to do better than boys on it. Boys, however, excel on this item. Perhaps this is because it reveals to boys how to stay out of trouble. Items 14 and 15, upon which girls do better, relate to "Jimmy" who is in trouble: caught on a railroad trestle as a train approaches. But, item 16 relates to "Bill" and his decision not to follow Jimmy across the trestle. Given the choice of nearly being killed as Jimmy or remaining safe as Bill, most boys would rather identify with Bill and safety. Experiences teach boys the awful and exciting terror of flirting with trains, but since most are like Bill, they survive boyhood adventures along the rails. Thus, just as in life, it is essential for them to identify with Bill and find the exact paragraph which "helps you know" that "Bill didn't follow Jimmy across the railroad trestle."

Whites do better than blacks on items shown on Table III because success on them requires knowledge of key words associated with "white culture." Item 2 turns on knowledge of the word "eradicate'' Item 8 requires students to group words by categories. To answer 37, the student must know the meaning of "assume," a word possibly more familiar to whites than blacks. The key word in item 55 is "guide word," and in 33 it is "border." This last item is both a white and girl item; black boys have the most difficulty with it.

 TABLE III White Items (Items Easier for Whites than Blacks) ItemNumber ItemName BlackDiff (SE) WhiteDiff (SE) Shift(B-W) SED t 55 2 37 8 33 guide word father's grass assume word categories border pattern 1.400(.071) -.235(.119) .880(.080) -.008(.108) -.059(.106) .882(.103) -.911(.214) .480(.118) -.569(.183) -.476(.175) .518 .676 .400 .561 .535 .125 .245 .143 .212 .205 4.1 2.8 2.8 2.6 2.6 mean shift for 5 items = .538

Five of these six items on Table IV fall into two categories, while item 12 stands alone. Item 12 requires a student to interpret a slang expression ("Talking through his hat") in the context of a sentence. Given the skill with which young blacks create and use slang and informal figures of speech, it is not surprising that they do well on this item.

 TABLE IV Black Items (Items Easier for Blacks than Whites) ItemNumber ItemName WhiteDiff (SE) BlackDiff (SE) Shift(W-B) SED t 12 58 15 28 24 53 word play lightening train trestle picnic litter stealing money shark gills .741(.108) 1.087(.096) 2.175(.078) 1.053(.087) 2.018(.079) -.054(.146) .175(.101) .663(.086) 1.870(.067) 1.158(.075) 1.698(.068) -.531(.135) .566 .424 .305 .345 .320 .477 .148 .129 .103 .115 .104 .199 3.8 3.3 3.0 3.0 3.1 2.4 mean shift for 6 items = .406

Items 24 and 28 are items whose answers are easy if one has a realistic view of life. Perhaps blacks develop a more realistic view of life than whites at an earlier age and, because of this, black students have greater insight into the right answers for these items. Item 24 is tied to a passage about Ron suspecting Carter of stealing money. Later, Ron discovers that his suspicions were unfounded. The item asks what was suggested but not stated in the passage. The answer is that Ron did not confront Carter with his suspicions. Since blacks more than whites are the victims of crime, they are more aware of the possibility of persons perpetrating crime. But since blacks more than whites are also the victims of violence stemming from altercations, they are acutely aware of the consequences which may follow from accusing someone of wrongdoing. Hence, that Ron did not talk to Carter about his suspicions is more obvious to blacks than whites.

Item 28 relates to a passage involving two boys eating their lunch in the park prior to a ball game. Tim eats and then does not want to take time to clean-up his litter because he wants to have more time to practice before the game, but Dave wants to clean-up before leaving. Dave does not think it is right to pollute the area, while Tim observes "Man, if you think you're goin' to cut down on pollution by picking up a couple of pieces of paper, you're crazy." Item 28 asks which idea would Tim probably believe. The answer is "one person alone cannot solve the problems of pollution." The reason this item is easier for blacks than whites seems obvious. So many black students taking this test either live in slum areas, or in areas on their way to becoming slums. These students are realistic enough to know that, alone, they cannot solve the problem of pollution in their neighborhoods. Other black students taking this test live in tidy working or middle-class neighborhoods and they also know that one person cannot solve the problems of pollution. Their parents often belong to block clubs organized to keep the neighborhood free from litter and pollution. Thus, that blacks would understand so clearly what Tim believes makes sense; the circumstances of their lives taught them the right answer.

Items 15, 53 and 58 seem to share no common element; however, all require students to locate a specific piece of information in the item's passage. Why should this kind of item be easier for blacks? The answer to this question is uncertain. One explanation could be that urban blacks are often subjected to so-called "canned," "teacher-proof" reading programs which emphasize, among other skills, locating specific information in the passages. Thus, blacks may have more experience and practice answering this kind of question.

The foregoing suggests that the use of Rasch estimates of item difficulty can implicate biased items. The next important task is to determine how much such biased items affect estimates of ability. Ultimately, we want to know whether the disproportionate numbers of blacks or boys in the lower ranks of a track system stem from biased items or from lower abilities.

In the test used for this paper, there are items which clearly favor boys, girls, blacks, and whites. While this bias may tend to cancel itself out over all of these 66 items, what if a test were constructed in which all items favored only one group? What would this mean for estimates of ability?

In Tables I-IV the mean difficulty shift for all items in each table was computed; the grand mean is .47. Were we to add this to the measure of a person, the effect on scores of a totally biased test of sixty-six items can be seen in Table V.

 TABLE V Effect of a Totally Biased 66-Item Test Raw ScoreLevel AbilityLevel Bias BiasedAbility BiasedRaw Score Advantagedue to Bias 55 46 34 21 11 2.02 1.06 .07 -.98 -2.01 + .47 + .47 + .47 + .47 + .47 = 2.49 = 1.53 = .54 =- .51 =-1.54 58-59 50-51 39-40 26-27 15-16 + 3 or 4 + 4 or 5 + 5 or 6 + 5 or 6 + 4 or 5

A boy, for example, who scored only 34 on the original test of sixty-six items could score 39-40 on a test biased entirely in his favor (with the magnitude of bias, .47). The "extra" five or six score points could mean the difference between passing or failing the test. Clearly, then, the use of a totally biased test could have dramatic consequences for the classification and placement of students.

In conclusion, the use of shifts in the estimates of Rasch item difficulties to implicate biased items appears to have potential. Even if an item with a significant shift has no obvious explanation for the source of its bias, the use of Rasch estimates would still make sense as a point of departure. Rasch item difficulty is the only fully estimable item parameter. The Rasch model corrects the estimate of item difficulty for the distribution of person ability in the group. Thus, alternative estimates of item difficulties are expected to be statistically equivalent. Therefore, the Rasch model provides a clear and statistically defensible expectation for the variation of difficulty estimates across groups - they should be statistically equivalent. If they are not, the reason for the discrepancies in item estimates from separate calibration samples may not be obvious, but at least there is a statistical justification for reviewing items and perhaps removing them from the test. If, as is often the case, items are "cheap" to construct, a set of items which retain their difficulties across groups can eventually be constructed and used to measure males and females from different racial backgrounds on a single variable, "fairly."

Robert E. Draba
MESA Memorandum No. 25
March 1977

Go to Top of Page
Go to Institute for Objective Measurement Page

Rasch Publications
Rasch Measurement Transactions (free, online) Rasch Measurement research papers (free, online) Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch Applying the Rasch Model 3rd. Ed., Bond & Fox Best Test Design, Wright & Stone
Rating Scale Analysis, Wright & Masters Introduction to Rasch Measurement, E. Smith & R. Smith Introduction to Many-Facet Rasch Measurement, Thomas Eckes Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr. Statistical Analyses for Language Testers, Rita Green
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar Journal of Applied Measurement Rasch models for measurement, David Andrich Constructing Measures, Mark Wilson Rasch Analysis in the Human Sciences, Boone, Stave, Yale
in Spanish: Análisis de Rasch para todos, Agustín Tristán Mediciones, Posicionamientos y Diagnósticos Competitivos, Juan Ramón Oreja Rodríguez

 FORUM Rasch Measurement Forum to discuss any Rasch-related topic

Coming Rasch-related Events
Jan. 25-26, 2017, Wed.-Thurs. In-person workshop: Measurement with the Rasch Model (M. Pampaka, J. Williams, Winsteps), Manchester, UK, website
Feb. 27 - June 24, 2017, Mon.-Sat. On-line: Advanced course in Rasch Measurement Theory (EDUC5606), Website
March 31, 2017, Fri. Conference: 11th UK Rasch Day, Warwick, UK, www.rasch.org.uk
April 2-3, 2017, Sun.-Mon. Conference: Validity Evidence for Measurement in Mathematics Education (V-M2Ed), San Antonio, TX, Information
April 26-30, 2017, Wed.-Sun. NCME, San Antonio, TX, www.ncme.org
April 27 - May 1, 2017, Thur.-Mon. AERA, San Antonio, TX, www.aera.net
May 26 - June 23, 2017, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 30 - July 29, 2017, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com
July 31 - Aug. 3, 2017, Mon.-Thurs. Joint IMEKO TC1-TC7-TC13 Symposium 2017: Measurement Science challenges in Natural and Social Sciences, Rio de Janeiro, Brazil, imeko-tc7-rio.org.br
Aug. 7-9, 2017, Mon-Wed. PROMS 2017: Pacific Rim Objective Measurement Symposium, Sabah, Borneo, Malaysia, proms.promsociety.org/2017/
Aug. 11 - Sept. 8, 2017, Fri.-Fri. On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com
Aug. 18-21, 2017, Fri.-Mon. IACAT 2017: International Association for Computerized Adaptive Testing, Niigata, Japan, iacat.org
Sept. 15-16, 2017, Fri.-Sat. IOMC 2017: International Outcome Measurement Conference, Chicago, jampress.org/iomc2017.htm
Oct. 13 - Nov. 10, 2017, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Jan. 5 - Feb. 2, 2018, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Jan. 10-16, 2018, Wed.-Tues. In-person workshop: Advanced Course in Rasch Measurement Theory and the application of RUMM2030, Perth, Australia (D. Andrich), Announcement
Jan. 17-19, 2018, Wed.-Fri. Rasch Conference: Seventh International Conference on Probabilistic Models for Measurement, Matilda Bay Club, Perth, Australia, Website
May 25 - June 22, 2018, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 29 - July 27, 2018, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com
Aug. 10 - Sept. 7, 2018, Fri.-Fri. On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com
Oct. 12 - Nov. 9, 2018, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com