Detecting and Correcting Test Item Bias with a Logistic Response Model

Introduction

None of the techniques proposed to identify biased test items have satisfactorily solved the problem of how to measure persons fairly regardless of their race, sex, or cultural background. The extensive litigation on the use of tests to classify minority students is only one illustration of the confusion that shrouds the definition and identification of test bias. Until this confusion gives way to clarity, progress in research on the acquisition of basic skills will be thwarted because valid measures of achievement will not be available.

To replace confusion with clarity requires a lucid conception of unbiased measurement, one expressed in terms of an explicit measurement model that is theoretically rigorous and technically feasible. Both are important. A theory is only as useful as it is practical and a technique is only as meaningful as it is consistent with a coherent and comprehensive point of view. We propose to apply such a measurement model to answer the questions:

1. Which items are biased and for whom?

2. Which items define the trait to be measured?

3. Which persons are properly measured by items that define the trait?

The model we propose to use is the Rasch logistic response model (Rasch, 1960, 1961, 1966; Wright, 1968; Wright and Panchapakesan, 1969). This model can incorporate most of the previous work on bias because it begins with similar measurement assumptions. But since its procedures are rational extensions of the model, the analysis of bias can be pursued further and in a more systematic and integrated manner than has previously been done. In particular, these procedures identify items which can lead to a valid measure for any person and therefore can be used not only to detect and correct biased measurements for any group but also to detect a biased measurement for any individual. If our purpose is not only to construct unbiased tests but to maintain unbiased measurement in the subsequent course of testing, this capability to detect bias in every individual test protocol is vital. Only with such a capability on the individual level can we discover an unexpectedly biased test protocol and so be in a position to protect that particular individual from the damage of a biased measurement.

BACKGROUND

Previous Attempts to Detect Bias

Much current interest in test bias stems from litigation regarding the use of tests to classify and sort minorities for employment (Blumrosen, 1972) and educational (Beckwith, 1973) opportunities. The impact of this litigation and the concern for fairness in selection has stimulated the investigation of techniques to detect bias. These investigations divide into two categories depending upon whether the criterion of bias is external or internal.

Some studies have investigated the extent to which test scores predict success in college (Harris and Reitzel, 1967; Cleary, 1968; Bowers, 1970 and Temp, 1971) and success on the job (Sadacca and Brackett, 1973; Campbell, 1973). Cleary (1968) considers a test biased if the common regression line systematically mix-estimates minority performance (also Cardall and Coffman, 1964 and Anastasi, 1968, p.559). Systematic error is said to indicate that the test has "differential validity" (Green, 1975).

Even though shortcomings of the regression approach have been identified (Darlington, 1971 and Linn and Werts, 1971) and "selection models" offered as a better way to achieve fair selection (Einhorn and Bass, 1971; Thorndike, 1971; Darlington, 1971 and Cole, 1973), there remains a fundamental problem attending any technique that relies on an external criterion: namely, the assumption that the criterion is an unbiased measure. If the criterion itself is biased, then regression would make a fair test seem unfair (Hollman, 1973).

The difficulty of constructing an unbiased criterion (Petersen and Novick, 1974) encourages us to study techniques for detecting and correcting test bias which use only an internal criterion and hence only the information contained in the responses of persons to test items. Although some say there is no "objective statistical definition of bias without a criterion variable" (Potthoff, 1966, p.82), techniques for detecting bias have been proposed which use no external criterion. These techniques typically address one of two closely related questions: Which items are biased or which items define the trait?

Assuming that a test is any collection of items intended to measure a single trait (Green and Draper, 1972), related techniques have been used to determine whether the same items define the same trait for different groups. One compares the factor structures for different groups taking the same items (Matuszek and Oakland, 1972 and Hollman, 1973). Another uses the item-score point biserials for different groups taking the same items to select the most "effective" items for each group, constructs separate "tests" and compares them (Green and Draper, 1972). Similar factor structures or "tests" are then said to indicate that groups respond similarly and hence fairly to these items.

Even though these techniques attempt to answer a vital question, namely which items define the trait, they have limitations. Any technique. which relies upon correlation as the indication of bias is vulnerable to the variation on the trait in the samples studied. Where groups differ in their trait variability, then items can appear biased when they are not.

The other group of techniques focuses upon the question: Which items are biased? The use of chi-square and log_e linear models to analyze responses to distractors as an indirect way to detect biased items (Veale and Foreman, 1975 and Maw, 1976) is intriguing, but its destiny is uncertain, because weighting distractors does not seem to have much impact upon the ordering of students (Hakstian and Kansup, 1975). However, this technique might become useful in revealing the reason a particular multiple-choice item is biased once such an item has been discovered by a more direct approach.

The direct approach to item bias begins with the unadjusted comparison of item difficulties calculated in the traditional manner (Litaker, 1974). An item for which the proportion right varies from group to group is suspect. The trouble with this approach is that it depends upon the unlikely assumption that abilities are equivalently distributed across groups. This shortcoming can only be corrected by removing the various distributions of abilities from the estimates of item difficulty.

Analysis of variance is one way to approximate this removal and so to study the relative difficulty of items across groups (Cardall and Coffman, 1964 and Cleary and Hilton, 1968). A significant item by group interaction implies that the relative difficulty of some items has not remained constant, an indication of bias. A problem with this approach, however, is the heterogeneity of item difficulty variance which is maximum for items at fifty percent difficulty and zero for items at zero or one hundred percent. This problem can be eased by a transformation such as the arc-sin used to stabilize item difficulty variance by Fremer (Cleary and Hilton, 1968, p.63). But this statistical convenience has no ready interpretation from a measurement perspective.

A refinement of the analysis of variance approach uses a cultural group by ability group contingency table. "An item is considered unbiased if for persons with the same ability in the area being measured, the probability of a correct response on the item is the same regardless of the population group membership of the individual" (Scheuneman, 1975). This definition of an unbiased item is very promising. What it needs is a further development of its mathematical foundations so that its potential can be more fully exploited.

Echtemacht (1972) moves in this direction when he transforms traditional item difficulties into scaled probits for each of a series of groups and exhibits the ordered differences between these scaled difficulties for each pair of groups on a normal plot. He reasons that if the relative item difficulties are the same for a pair of groups, then the ordered differences should fall on a straight line with intercept and slope bearing only on the difference in ability distribution of the two groups. However, items may be biased and still lie near the normal line or unbiased and lie away from it. When item difficulties are similarly ordered but differently dispersed, we would ordinarily consider the test to be biased although the ordered differences could easily fall on a normal line. A line which is exceptionally steep or flat would indicate items which are not functioning in the same manner for the two groups. On the other hand, if an item is equally inappropriate for either group, the difference in difficulties wilt tend to be large even if no bias in the usual sense is present. Here within group fit is confounded with between group fit. The remedy for both problems resides in the use of a response model that permits items to be tested for fit both within and between groups and which provides an expected variance for testing the slope of the normal line.

The Rasch logistic response model can meet these requirements. Its application frees the comparison of item difficulties across groups from differences in the distribution of person ability within groups. It specifies a logistic transformation of the traditional item difficulties as the only reasonable transformation. The variance of item difficulty estimates now corresponds to the real situation in which information is maximum in the center and minimum at the extremes. The maximum likelihood estimation techniques applicable to the model lead to useful asymptotic estimates of the variance of parameter estimates. All this makes it possible to identify tests which are biased in ways which do not change the relative difficulties of items but rather their scale of measurement, to separate biased items from items which misfit for other reasons and to specify the magnitude of residual variance to be expected when items and persons together fit the measurement model. 1

[1. Durovic (1975) grasped the implications of the Rasch model for detecting item bias, but did not reach the systematic analysis of residuals from the model necessary to exploit fully the potential of the model for dealing with the test bias problem.]

The Requirements for Good Measurement

Only when we have a clear idea what is meant by a good measurement can we recognize observations which do not lead to valid measures. If we formulate the ideal model for a fair test and if that model adequately accounts for the observed situation, we have demonstrated measures free from bias or other contaminates in that situation. The extent to which the model does not account for the situation is an indication of the seriousness of the bias problem.

The discrepancies between the observed data and what would be expected according to the model, if the test were functioning as a valid measuring instrument, provide material for the diagnosis of the sources of invalidity. Procedures which we believe will prove useful for this sort of diagnosis at both the test development and application phases will be discussed later. First we need to formulate a model for what we will mean by a good measurement.

We wish to obtain measures for any person on some trait (such as reading comprehension) We will refer to the person's position on the trait as his "ability." At the very least a good measurement model should require that a valid test satisfy the following conditions:

1. A more able person always has a better chance of success on an item than does a less able person.

2. Any person has a better chance of success on an easy item than on a difficult one.

3. These conditions can only be the consequence of the person's and the item's position on the trait and so they must hold regardless of the race, sex, etc. of the person measured.

These simple conditions have far-reaching implications. They mean we will consider only tests composed of items which form a homogeneous set. This excludes, for example, tests scored by combining a mixture of math and reading items since a person high in reading but low in math could succeed on the difficult reading items but fail on the easy math items. It also excludes situations such as the observation of quantitative problem solving ability with word problems when the persons tested vary enough in their ability to read to vary substantially in their ability to understand the problem statements.

Implicit in these considerations is the notion that the difficulty of an item is an inherent property of the item which adheres to the item under all relevant circumstances without reference to any particular population of persons to whom the item might be administered. Analogously, a person's ability is defined as a characteristic of the person without reference to any particular set of items. This is equivalent to saying that the model describes an interaction between a person and an item as governed by two and only two parameters--an ability for the person and a difficulty for the item.

A major consequence of these considerations is that it is possible to derive an estimator for each parameter that is independent of all other parameters. All the information about a person's ability contained in his responses to a set of items is contained in the simple, unweighted count of the number of items which he answered correctly. Raw score is a sufficient statistic for ability. For item difficulty the sufficient statistic is the number of persons who responded correctly to that item. Traditional item analysis which characterizes items by the proportion answering them correctly and persons by the proportion of items answered correctly is, of course, entirely consistent with this result. However, traditional methods fail to take advantage of the potential for measurement that is implicit in this practice.

Problems with Traditional Methods

The traditional definition of item difficulty, stopping where it does, is sample dependent. It is only relevant to a particular group tested and changes if a group with a different distribution of ability is encountered. The proportion metric is distorted by floor and ceiling effects and so is not an adequate basis for linear analysis.

To overcome these deficiencies would require adjusting the observations for the ability distribution of the sample that produced them to obtain estimates of item difficulty which are independent of the sample ability distribution. The floor and ceiling effects could be addressed by a non-linear model for the regression of score on ability which expanded the scale near its limits, increasing the standard errors of measurement in a way more realistically reflecting the poorer precision of measurement fn the extreme regions.

There is also evidence that the rules for item selection that follows from the traditional approach do not lead to the optimal set of items (Birnbaum, 1968, p.465). the usual criteria for including an item are that it have a high discrimination (as shown by the point biserial correlation of item score with test score) and that the difficulty (proportion correct) be near fifty percent. Tucker (1946) demonstrated that the perfect achievement of these ideals would be unfortunate, since in that case a test of any number of items would be no better than a test of only one item.

The factors leading to unusually high discriminations are still more disturbing. High discriminations are frequently due to the influence of an extraneous variable rather than to a stronger relationship of the item with the intended trait. An item requiring knowledge of the word "sonata" might be an effective index of some form of intelligence for pupils from high socioeconomic backgrounds (Davis, 1948). It might also work well for low SES pupils so that within each group the more able student would be more likely to be familiar with the word. However, high SES pupils of a given ability are more likely to be familiar with "sonata" than are low SES pupils of the same ability because of differences in exposure to this culturally biased word. Typically, high SES students perform better on achievement tests. Items like "sonata" which classify pupils into SES groups, will also classify them into achievement groups. The greater the difference in the levels of group achievement, the more effective such culturally biased items will appear. If items are selected on the basis of high discrimination, culturally biased items will be selected, producing tests with greater and greater bias.

This fault will also flaw the common practice of describing the trait measured in terms of the items most correlated with total score, since it will lead to defining the trait in terms of the most biased items.

The Rasch Logistic Response Model

Georg Rasch (1960) provided a rethinking of the measurement problem which overcomes most of the deficiencies of traditional item analysis. 1 [1. Other response models with more than one item parameter have been proposed to overcome these deficiencies (Lord, 1968, 1975; Birnbaum. 1968; Bock (1972). However, consistent estimators of the additional parameters do not in general exist (Neyman and Scott, 1948; Andersen, 1973) nor have attempts to apply these models been successful (Lord. 1975). As a result we will use Rasch's one-parameter model for which there are consistent estimators and which has proven useful in a wide variety of applications (Passmore, 1975) as the backbone for our proposed study.] Rasch's stochastic response model describes the probability of a successful outcome of a person on an item as a function of only the person's ability and the item's difficulty. Using only the traditional requirement that a measurement be based on a set of homogeneous items monotonically related to the trait to be measured, Rasch derived his measurement model in the form of a simple logistic expression and demonstrated that in this form the item and person parameters are statistically separable. Andersen (1973) elaborated and refined the mathematical basis for the model. Wright and Panchapakesan (1969) developed practical estimation procedures that made application of the model feasible.

Rasch's model, while based on the same requirement of the sufficiency of total score relied on by traditional methods, offers new and promising opportunities for advancing our understanding of test bias. Since the parameters of the model are separable, it is possible to derive estimators for each parameter entirely independently of the others. The logistic transformation assigns an ability of minus infinity to a score of zero and plus infinity to a score of one hundred percent. This eliminates the bounds on the ability range and puts the standard errors of measurement into a reasonable relationship with the information provided by observed score. The tests of item fit which are the basis for item selection are sensitive to high discriminations as well as to low and so lead to the selection of those items which form a consistent definition of the trait and to the rejection of exceptional items. Finally, the explicitness of the mathematical expression of the model facilitates statistical statements about the significance of individual person-item interactions and so makes both a very general and a very detailed analysis for bias possible.

The Rasch model provides an explicit framework for comparing observed with expected outcomes. The expected outcome of administering an item to a person is that predicted by the model assuming that the item is fair with respect to that person and that the person was adequately motivated to bring his full ability to bear on the item. The model permits us to assess the likelihood of the observed result, and hence, to make statements about the appropriateness of the particular item for the particular person.

These residual differences between observed and expected can be organized into a variety of indices the particular form of which depends on the nature of the problem being investigated. Our current work suggests that many familiar disturbance. such as guessing, speededness, and cultural bias, may be diagnosed through linear analysis of residuals.

We propose to build on the traditional definition of item bias: that an item is biased if it measures differently in different groups, and on the analysis of residuals from the Rasch model. We anticipate that this approach will ultimately incorporate much of the previous work on test bias by exploiting the assumptions underlying that work to the fullest.

Fit Analysis for the Study of Test Bias

If the result of the interaction between person n and item i is denoted by Xni with the value "1" if the person's response is correct and "0" otherwise, the Rasch logistic model for the probability of the person's success on that item may be expressed as:

(1) P(Xni=1|Bn,Di)= exp(Bn-Di) / (1+exp(Bn-Di))

where Bn is the ability of person n and Di is the difficulty of item i. This will be written as Pni for convenience. The expectation and variance of Xni; are

(2) E(Xni) = Pni and Var(Xni) = Pni(l-Pni)

The analysis of fit is an analysis of the difference between the observed outcome, Xni, and the outcome expected under the model, i.e., Pni. This residual expressed in the proportion metric is:

(3) xni = Xni - Pni

We can standardize this "metric" residual by its standard deviation to obtain a "statistical" residual

(4) Zni = (Xni - Pni) / sqrt(Pni(1-Pni))

which leads naturally to the familiar statistics for significance tests of fit.

Many factors which might produce departures from the model can be expressed as linear functions of the residual in a logistic, or ability, metric. While it is not possible to compute a logistic residual directly, it is possible to transform the proportion residual into an approximately logistic residual by multiplying it by the derivative of (Bn-Di) with respect to Pni,

(5) d(Bn-Di)/dPni = 1/(Pni(1-Pni))

This approximates the logistic residual1 as: [1. Unfortunately, the logistic residual does not exist between plus and minus one. The extent to which this influences the analysis will be investigated. However, it is the extreme values which are of greatest interest to us and in any case the variance stabilizing weighted analysis eliminates this discontinuity.]

(6) Yni = (Xni - Pni) / (Pni(1-Pni))

for which

(7) E(Yni) = 0.0

and Var(Yni) = 1/(Pni(1-Pni))

As a result, the natural solution for the problem of heterogeneity of variance becomes weighted least squares where the weights are uniquely defined by the model as

(8) Wni = Pni(l-Pni)

To illustrate the relevance of the logistic residual to the study of test bias, consider the following common situation. An item has been successfully tested and calibrated on an appropriate sample so that a reliable estimate of its difficulty, Di, is available. However, when applied to a new sample from another population, the item turns out to be more difficult and less discriminating. This was found to be the case for several items in an analysis of a reading comprehension test (Wright and Mead, 1976a) which involved a test calibrated on white sixth graders and then administered to a large sample of black students. Items referring to natural science passages were found to be relatively more difficult and less discriminating for blacks.

We can develop these ideas formally by writing the probability of a success in the new sample, for which we suspect the item may be biased, as

(9) P(Xni = 1|Lni) = exp(Lni)/(1+exp(Lni))

If this differs for a new sample when compared to the calibrating sample by a change in difficulty and in discrimination, the probability may also be expressed as:

(10) P(Xni=1|Bn,Di,A0i,A1i) = exp(A0i+A1i(Bn-Di)) / (1+ exp(A0i+A1i(Bn-Di)))

where A0i is an index of the difficulty shift, centered at 0, and A1i is an index of item discrimination, centered at 1.

These probabilities are equal if the exponents are equal, therefore,

(11) Lni = A0i + A1i(Bn-Di) + Eni

The error term, Eni, is included to reflect our belief that the linear model will not explain every outcome exactly. Expression (11) can be rewritten as a residual model:

(12) Lni - (Bn-Di) = A0i + (A1i-l)(Bn-Di) + Eni

The left-hand side is now the difference between the observed logit, and the model logit, (Bn-Di), which is the residual we have approximated by Yni. Therefore, the regression of Yni on estimates of (Bn-Di) is sensitive to test bias of this form.

Tests of Item Fit: The fit of item i to the model for a particular sample of persons can be assessed with the statistic

(13) Vi^2 = Sum (n=1 to N) Zni^2 = Sum (n = 1 to N) Yni^2*Wni

This is approximately chi square distributed with N-1 degrees of freedom (Panchapakesan, 1969; Wright and Panchapakesan, 1969).1 [1. The degrees of freedom are (N-l)(L-l)/L when the same persons are used to calibrate and test the item.] It assesses how similar item i is to the typical item used to represent the variable of interest. Since Vi^2 will be large for both high and low discriminations, the items ultimately selected to define the variable will be the central items and not extremes in either direction.

The residual model (11) is a refinement of item fit analysis. When the analysis is performed on all persons in group j (where j designates a new sample), we have

(14) Yni = A0ij + A1ij(bn-di) + Eni

where A0ij is the difficulty shift for item i in sample j,

A1ij is the discrimination index (now centered at 0) for item i in sample j,

and (bn-di) is the estimate of (Bn-Di) from the calibrating sample.

The analysis of variance associated with this model is shown in Table I.

The measurement model specifies that the total weighted residual mean square will have an expectation of one when there are no departures from the model. As a result, this mean square may be used as an overall test of fit for group j on these items. The extent to which its square root exceeds one represents the inflation in the standard error of measurement due to misfit in the group j data. The mean square for "Remainder" also has an expectation of one and its square root is the inflation in the standard error of measurement that would remain if the effects of bias could be removed.

The mean squares for difficulty shift and discrimination variation provide statistical tests for the questions:

1) Do the items maintain their relative difficulties for the new sample regardless of its distribution of ability?

2) Is the scale of measurement implied by the items unchanged in the new sample?

If the answer to either question is no, we would look into the individual A's, to determine which items are responsible for the problem. The A's furthest from zero would be the most suspect. Clinical investigation of the implicated items in terms of their content and the characteristics of the sample of persons should help diagnose the source of bias. It might then be possible to discard the data from the defective items and extract fair measures without retesting.

TABLE I
Analysis of Variance for Group j
Source of Variation	Degrees of Freedom	Sums of Squares
Difficulty Shifts	L-1	SSDFj = sum (i=1 to L) A0ij^2 Sum (n=1 to N) Wni
Discrimination	L-1	SSDSj sum (i=1 to L) A1ij^2 Sum (n=1 to N) (bn-b.-di+d.)^2 Wni
Remainder	(nj-3)(L-l)	SSEj = SSRj - SSDFj - SSDSj
Residual	(nj-l)(L-l)	SSRj = Sum (i=1 to L) Sum (n=1 to N) Yni^2*Wni

One advantage of this approach is the opportunity to evaluate a variety of specific hypotheses about the nature of bias in a rigorous way. We can easily extend the approach of Table I to include data from several different groups and to analyze contrasts among groups. We are equally able to study in a similar straightforward manner any other conception of bias which can be expressed in terms of a linear model. While the analysis in Table I is parallel in many respects to analysis of variance of proportions (properly transformed) for group by item interaction, this approach is more flexible.

Tests of Person Fit: An extremely important feature of this approach is its explicit representation of the residual of each person-item interaction. These unique statistics can be used to analyze person fit in a way completely analogous to the analysis of item fit. This allows us to check for each person tested the quality of the observation obtained from him on each item. We can identify items which were inappropriate for that person, perhaps because they were too difficult and he guessed randomly, or too easy and he responded carelessly, or because they interacted unusually with some unique aspect of his experience.

The summary fit statistic for person n over all L items is

(15) sum (i=1 to L) Vn^2 = sum (i=1 to L) Zni^2 = Sum (i=1 to L) Yni^2*Wni

which is approximately chi square distributed with L-1 degrees of freedom. 1 [1. The degrees of freedom are (L-l)(N-l)/N when persons measured are also used to calibrate items.] It is also possible to perform a linear analysis of residuals for each person similar to that done for items such as:

(16) Yni = A1n(bn-di) + eni i=1,L

for which the constant term is omitted because

(17) sum (i=1 to L) Yni.Wni = Sum (i=1 to L) xni = 0 for any person n

by definition of bn. Non-zero values of A1n would imply that the person responded to items differently than the typical person in the calibration sample. The analysis of variance is parallel to that shown in Table I without the difficulty shift.

This consequence of our approach to bias is important because it enables us to continually monitor the operation of every item for every person. It then becomes theoretically possible to identify items that are "fair" for each person rather than merely tests that are "fair" for every group. The utility of this theoretical possibility will be one of our main points of investigation.

Prospects for Success

The Rasch model, a natural extension of traditional item analysis, has been useful in a wide variety of situations (Passmore, 1975).

Two tests developed by American Guidance Service--KEYMATH (Connolly, Nachtman, and Pritchett, 1971) and Woodcock Reading Mastery Test (Woodcock, 1974)--were built on Rasch principles. This involved not only the selection and calibration of items but also the development of recording forms which relate the tested person's estimated ability, in a criterion way, to specific skills and deficiencies and, in a normative way, to his grade level.

Willmott and Fowles (1974), in connection with the Sixteen Plus Examining Project at the National Foundation for Educational Research of England, applied the model successfully to tests of reading ability, English comprehension, geography, science, mathematics and physics to obtain measurements "superior to those ordinarily available."

Rentz and Bashaw (1975) used the model to equate the seven reading tests in the Anchor Test Study. Their results are equivalent to the more costly and awkward methods used in the "official" equating but required less data, less time and substantially less processing budget.

The success of these studies clearly indicates the utility of the Rasch model. Now there is evidence that the analysis of residuals from the model can be useful in the detection of disturbances like bias. Rasch, in unpublished analyses of intelligence test data which had been shown earlier not to fit his model (Rasch, 1960, pp.80-107), identified the interference of the test administration factor of speed and showed that upon appropriate adjustment for this factor, the residual data did fit his model.

Our own analysis (Wright and Mead, 1976a) of the responses of white, black and Spanish speaking students to a well-known reading comprehension test revealed systematic differences in residuals for the three groups. Among whites, residuals centered around zero with a variance statistically equivalent to that predicted by the model. Among the others, particularly blacks, the residuals tended to be positive for items based on poetry passages and negative for items based on natural science passages, suggesting a cultural bias. This result and preliminary mathematical work on the possibilities for a general analysis of residuals (Wright and Mead, 1976b; Mead, 1976) have convinced us that the potential of this approach for detecting cultural bias is substantial.

Implications

The successful development of the proposed procedures for detecting and correcting bias will have implications for pupil evaluation and measurement research. They will allow practitioners to detect biased items,-to identify items that define the intended trait for all groups, and to evaluate the test protocol of every person with respect to bias. This will enable evaluators to recognize individuals (and groups) who are "unfairly" measured by items (or tests) and to extract "fair" measures whenever possible. Only when such procedures are developed and demonstrated to be workable will it become possible to do objective research on the causes and consequences of test bias or to evaluate progress in the acquisition of basic skills.

We believe that success in developing and demonstrating the proposed procedures is very likely because the procedures follow from an explicit measurement model which embraces the traditional requirements of fair measurement.

by Benjamin D. Wright, Ronald Mead, and Robert Draba

MESA Research Memorandum Number 22
MESA PSYCHOMETRIC LABORATORY

October 1976

REFERENCES

Anastasi, A. Psychological testing. (3rd ed.) New York: Macmillan, 1968.

Andersen, E.B. Conditional inference and models for measuring. Copenhagen: Mentalhygiejnisk Forlag, 1973.

Beckwith, L. Constitutional requirements for standardized ability tests used in education. Vanderbilt Law Review, 1973, 26, 789-821.

Birnbaum, A. Some latent trait models and their use in inferring an examinee's ability. In F. Lord and M. Novick (Eds.), Statistical Theories of Mental Test Scores. Reading, Mass.: Addison-Wesley, 1968.

Blumrosen, A. Strangers in paradise: Griggs U. Duke Power Co. and the concept of employment discrimination. Michigan Law Review, 1972, 71, 59-110.

Bock, R.D. Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 1972, 37, 29-51.

Bowers, J. The comparison of GPA regression equations for regularly admitted and disadvantaged freshmen at the University of Illinois. Journal of Educational Measurement , 1 970 , 7 , 219-225 .

Campbell, Joel T., et al. An investigation of sources of bias in the prediction of job performance: A six year study. Final Project Report, Educational Testing Service, Princeton, N.J., 1973.

Cardall, C. and Coffman, W.E. A method for comparing the performance of different groups on the items of a test. Research Bulletin 64-61. Princeton, N.J.: Educational Testing Service, 1964.

Cleary, T.A. Test bias: Prediction of grades of Negro and White students in integrated colleges. Journal of Educational Measurement, 1968, S, 115-124.

Cleary, T.A. and Hilton, T.L. An investigation of item bias. Educational and Psychological Measurement, 1968, 28, 61-75.

Cole, N.S. Bias in selection. Journal of Educational Measurement, 1973, 10, 237-255

Connolly, A.J., Nachtman, W. and Pritchett, E.M. Keymath: Diagnostic Arithmetic Test. Circle Pines, Minn.: American Guidance Service, 1971.

Darlington, R.B. Another look at "culture fairness". Journal of Educational Measurement, 1971, 8, 71-82.

Davis, A. Social-class influences upon learning. Cambridge: Harvard University Press, 1948.

Durovic, J. Definitions of test bias: A taxonomy and an illustration of an alternative model. Unpublished doctoral dissertation, State University of New York at Albany, 1975.

Echtemacht, G. A quick method for determining test bias. Research Bulletin RB-72-17. Princeton, N.J.: Educational Testing Service, 1972.

Einhorn, H.J. and Bass, A.R. Methodology considerations relevant to discrimination in employment testing. Psychological Bulletin, 1971, 75, 261-269.

Green, D.R. What does it mean to say a test is biased? Paper presented at American Educational Research Association, Washington, D.C., 1975.

Green, D.R. and Draper, J.F. Exploratory studies of bias in achievement tests. Paper presented at the American Psychological Association Annual Convention, Honolulu, Hawaii, September 1972.

Harris, J. and Reitzel, J. Negro freshman performance in a predominantly non-Negro university. Journal of College Student Personnel, 1967, 8, 366-368.

Hakstian, A. and Kansup, W. A comparison of several methods of assessing partial knowledge in multiple choice tests: II. Testing procedures. Journal of Educational Measurement, 1975, 12, 231- 239.

Hollman, T. Differential validity: A problem with tests or criteria. Paper presented at Midwest Psychological Association, May 1973.

Linn, R.L. and Werts, C.E. Considerations for studies of test bias. Journal of Educational Measurement, 1971, 8, 1-4.

Litaker, R. An investigation of item bias in a language skills examination. Unpublished doctoral dissertation, University of Georgia, 1974.

Lord, F.M. Evaluation with artificial data of a procedure for estimating ability and item characteristic curve parameters. Research Bulletin 75-33. Princeton, N.J.: Educational Testing Service, 1975.

Lord. F.M. An analysis of the Verbal Scholastic Aptitude Test using Birnbaum's three-parameter logistic model. Educational and Psychological Measurement, 1968, 28, 989-1020.

Matuszek, P. and Oakland, T. A factor analysis of several reading readiness measures for different social, economic and ethnic groups. Paper presented at American Educational Research Association, Chicago, Illinois, 1972.

Maw, C. Item response pattern and group differences: An application of the log_e linear model. Unpublished dissertation proposal abstract, University of Chicago, 1976.

Mead, R.J. Assessing the fit of data to the Rasch model. Paper presented at American Educational Research Association, San Francisco, 1976.

Neyman, J. and Scott, E.L. Consistent estimates based on partially consistent observations. Econometrika, 1948, 16, 1- 32.

Panchapakesan, N. The simple logistic model and mental measurement. Unpublished doctoral dissertation, University of Chicago, 1969.

Passmore, D.L. Theory and application of Rasch measurement models - a bibliography. Unpublished manuscript, Rochester Institute of Technology, 1975.

Petersen, N.S. and Novick, M.R. An evaluation of some models for test bias. ACT Technical Bulletin No. 23. Iowa City, Iowa: The American College Testing Program, 1974.

Potthoff, R.F. Statistical aspects of the problem of biases in psychological tests (No. 479). Chapel Hill, N.C.: Institute of Statistics Mimeo Series, Department of Statistics, University of North Carolina, August 1966.

Rasch, G. Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: Danmarks Paedogogiske Institut, 1960.

Rasch, G. An individualistic approach to item analysis. In P.F. Lazarsfeld and ,N.W. Henry (Eds.), Readings in Mathematical Social Sciences. Chicago: Science Research Associates, 1966(a).

Rasch, G. An item analysis which takes individual differences into account. British Journal of Mathematical and Statistical Psychology, 1966(b), 19, 49-57.

Rentz, R.R. and Bashaw, W.L. Equating reading tests with the Rasch model. Athens, Georgia: Educational Resource Laboratory, 1975.

Sadacca, R. and Brackett, J. The validity and discriminatory impact of the Federal Entrance Examination. A report to the Urban Institute, Washington, D.C., 1973.

Scheuneman, J. A new method of assessing bias in test items. Paper presented at American Educational Research Association, Washington, D.C., 1975.

Snedecor, G.W. and Cochran, W.G. Statistical methods. (6th ed.) Ames, Iowa: Iowa State University Press, 1967, p.327.

Temp, G. Validity of the SAT for blacks and whites in thirteen integrated institutions. Journal of Educational Measurement, 1971, 8, 245-251.

Thorndike, R.L. Concepts of culture-fairness. Journal of Educational Measurement, 1971, 8, 63-70.

Tucker, L.R. Maximum validity of a test with equivalent items. Psychometrika, 1946, 11, 1-14.

Veale, J. and Foreman, C. Cultural validity of items and tests: A new approach. (Statistics Unit Measurement Research Center Tech Report 1). Iowa City, Iowa: Westinghouse Learning Corporation, 1975.

Willmott, A. and Fowles, D. The objective interpretation of test performance: The Rasch model applied. Atlantic Highlands, N.J.: NFER Publishing Co., Ltd., 1974

Woodcock, R.W. Woodcock Reading Mastery Tests. Circle Pines, Minnesota: American Guidance Service, 1974.

Wright, B.D. Sample-free test calibration and person measurement. In Proceedings of the 1967 Invitational Conference on Testing Problems. Princeton, N.J.: Educational Testing Service, 1968.

Wright, B.D. and Panchapakesan, N. A procedure for sample-free item analysis. Educational and Psychological Measurement, 1969, 29, 23-37.

Wright, B.D. and Douglas, G.A. Best test design and self- tailored testing. Research Memorandum No. 19, Statistical Laboratory, Department of Education, University of Chicago, 1975.

Wright, B.D. and Douglas, G.A. Better procedures for sample-free item analysis. Research Memorandum No. 20, Statistical Laboratory, Department of Education, University of Chicago, 1975.

Wright, B.D. and Mead, R.J. Calfit: Sample-free item calibration with a Rasch measurement model. Research Memorandum No. 18, Statistical Laboratory, Department of Education, University of Chicago, 1975.

Wright, &.D. and Mead, R.J. fit analysis of a reading comprehension test. Prepared for American Educational Research Association Training Presession, San Francisco, 1975a.

Wright, B.D. and Mead, R.J. Analysis of residuals from the Rasch model. Prepared for American Educational Research Association Training Presession, San Francisco, l976b.

Go to Top of Page
Go to Institute for Objective Measurement Page

Rasch-Related Resources: Rasch Measurement YouTube Channel

Rasch Measurement Transactions & Rasch Measurement research papers - free An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse Rasch Measurement Theory Analysis in R, Wind, Hua Applying the Rasch Model in Social Sciences Using R, Lamprianou El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.

Person-centered outcome metrology, Fisher, W. P., Jr., & Cano, S. (Eds.). Explanatory models, unit standards, and personalized learning, A. Jackson Stenner Models, measurement, and metrology, Fisher, W. P., Jr., & Pendrill, L. (Eds.) Measurement, Journal of the International Measurement Confederation Rasch Meta-Metres of Growth for Some Intelligence and Attainment Tests: A Meta-metre for some Intelligence and Attainment Tests, David Andrich, Ida Marais, Sonia Sappl

Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch Rasch Models for Measurement, David Andrich Constructing Measures, Mark Wilson Best Test Design - free, Wright & Stone
Rating Scale Analysis - free, Wright & Masters

Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias Diseño de Mejores Pruebas - free, Spanish Best Test Design A Course in Rasch Measurement Theory, Andrich, Marais Rasch Models in Health, Christensen, Kreiner, Mesba Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen

Rasch Books and Publications: Winsteps and Facets

Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland

Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes Rasch Models for Solving Measurement Problems (Facets), George Engelhard, Jr. & Jue Wang Statistical Analyses for Language Testers (Facets), Rita Green Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M

Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind Rasch Measurement: Applications, Khine Winsteps Tutorials - free
Facets Tutorials - free Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan

Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free	An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse	Rasch Measurement Theory Analysis in R, Wind, Hua	Applying the Rasch Model in Social Sciences Using R, Lamprianou	El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.
Person-centered outcome metrology, Fisher, W. P., Jr., & Cano, S. (Eds.).	Explanatory models, unit standards, and personalized learning, A. Jackson Stenner	Models, measurement, and metrology, Fisher, W. P., Jr., & Pendrill, L. (Eds.)	Measurement, Journal of the International Measurement Confederation	Rasch Meta-Metres of Growth for Some Intelligence and Attainment Tests: A Meta-metre for some Intelligence and Attainment Tests, David Andrich, Ida Marais, Sonia Sappl
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar	Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch	Rasch Models for Measurement, David Andrich	Constructing Measures, Mark Wilson	Best Test Design - free, Wright & Stone Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias	Diseño de Mejores Pruebas - free, Spanish Best Test Design	A Course in Rasch Measurement Theory, Andrich, Marais	Rasch Models in Health, Christensen, Kreiner, Mesba	Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen
Rasch Books and Publications: Winsteps and Facets
Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene	Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver	Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone	Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale	Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes	Rasch Models for Solving Measurement Problems (Facets), George Engelhard, Jr. & Jue Wang	Statistical Analyses for Language Testers (Facets), Rita Green	Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind	Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind	Rasch Measurement: Applications, Khine	Winsteps Tutorials - free Facets Tutorials - free	Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre	Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan

FORUM Rasch Measurement Forum to discuss any Rasch-related topic

Coming Rasch-related Events
May. 15 - June 12, 2026, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 19 - July 25, 2026, Fri.-Sat.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com
Aug. 31 - Sept 2 2026, Mon.-Wed.	In person: IMEKO TC1 Metrology Education and Training symposium, Klagenfurt, Austria www.photomet-edumet2026.com. Submissions by April 20
Aug. 30 - Sept. 3, 2027, Mon.-Fri.	In Person: 2027 IMEKO World Congress (TC1, Tc7, TC13, TC18, TC26), Rimini, Italy imeko2027.org

Our current URL is www.rasch.org

The URL of this page is www.rasch.org/memo22.htm