DIF Detection Examined: Which item has the real Differential Item Functioning?

Differential item functioning (DIF) occurs when an item manifests a different level of difficulty with one group, the focal group, than with another, the reference group. This would seem simple to detect, but as Donoghue & Allen (D&A, 1993) point out, DIF detection requires the analyst to think!

D&A examine the Mantel-Haenszel statistic, a simple odds estimator whose characteristics are identical to Rasch PROX DIF detection when the MH statistic works as intended (RMT 1989 3:2 51-53).

MH aggregates odds estimated from many two-by-two contingency tables, one table for each possible raw score on a test when "thin matching" is used. Each table has four cells: counts of subjects in reference and focal groups who succeed and fail on the item. D&A discover that MH malfunctions when cell counts are low. In fact, MH is incalculable when any cell count is zero. To avoid this, D&A investigate combining contingency tables (accumulating cell counts) into score group strata, i.e., "thick matching". Unfortunately, they are unable to specify a clearly superior thick matching method. Their endeavors remind us that the mindless application of any method invites confusion.

Rasch DIF detection is also based on odds estimation, but has the advantage that no matching or stratifying is needed. But Rasch techniques do require the analyst to think!

Construct a test from well-behaved items. Give the test to a reference group who perform in the intended way, and a focal group for whom some items may exhibit some degree of DIF. Calibrate the items on the two groups. Plot the item difficulty calibrations against each other. The slope of the best-fit line is not 1! The dispersion of the focal calibrations is greater than the reference calibrations! As Richard Smith pointed out to me, DIF affects item dispersion.

This mystified me. I simulated four data sets of 100 items and 500 people. One-fourth the people were in the focal group. One-fourth the items exhibited DIF against them. The DIF introduced in the four data sets was .5 logits, 1.0 logits, 1.5 logits, and 2.0 logits. The generating item difficulties were distributed uniformly [-2,2] resulting in expected item standard deviation of 1.2 logits for the reference group. The Table shows the standard deviations of the expected and estimated item difficulties for the four data sets.

	Standard Deviations of Item Difficulties
	Reference Group		Focal Group
DIF	Gen.	Est.	Gen.	Est.
0.5 1.0 1.5 2.0	1.17 1.17 1.17 1.17	1.16 1.16 1.15 1.17	1.18 1.23 1.32 1.43	1.24 1.25 1.39 1.54

While the reference group calibrations do not spread out any more than the generated item difficulties, the focal group difficulties spread out considerably. The reason is immediately apparent. DIF provides additional difficulty to some items and so is an extra source of difficulty variance, inflating the item variance for the focal group.

If the number of people in the focal group is small, so that item calibration standard errors are large, then DIF may not appear as statistically significant item difficulty shifts. But it will still be seen as a difference in the dispersion of the two sets of item difficulties.

As the number of DIF-affected items increase, the test items form two clusters: items with DIF and items without DIF. When comparisons are made with "average" item difficulties, it will seem that some items now exhibit DIF against the focal group, and some DIF against the reference group. If the groups are large enough, both sets of items will be reported as significantly biased! Where's the DIF?

This effect was reported years ago when calculators were introduced into classrooms. The focal group had calculators and found simple math problems easier. Items were reported to have DIF both against and in favor of the focal group. A little thought, however, reveals that virtually all DIF must be in favor of the focal group, and against the reference group!

So I performed another simulation: 100 items and 250 people in which 50% of the items exhibit 1.5 logits of DIF against the focal group which contains 50% of the people. The standard deviation of the item difficulties for the reference group was 0.6 logits, for the focal group 1.0 logits. I have plotted the familiar control line plot with the identity line drawn through the average of the item difficulties. Notice that in addition to the DIF in the upper left quadrant, some items lie outside the control lines in the bottom right and show DIF in favor of the focal group!

In the second plot, the analyst has thought about the substance of the test and repositioned the identity line through the mean of the items chosen to manifest no DIF. Control lines are redrawn accordingly. This plot shows clearly two sets of items: DIF items in the upper left, and items without DIF within the control lines. This simple stratagem is beyond even the most thoughtful MH analysis.

Donoghue J.R., Allen N.L. (1993) Thin versus thick matching in the Mantel-Haenszel procedure for detecting DIF. Journal of Educational Statistics 18(2) 131-154.

Later note: This suggests that the maxim, "The item with the largest DIF is the one with the real (as opposed to artificial) DIF" (Andrich & Hagquist, 2012), though useful is not definitive. The maxim implies that there is a unique reference point from which to measure DIF. We can see that different choices of reference point will result in different items with the largest DIF and so different items declared to have real DIF.
David Andrich and Curt Hagquist (2012) Real and Artificial Differential Item Functioning. Journal of Educational and Behavioral Statistics, 37, 387-416.

Real and Artifical DIF detection examined: Which item has the real Differential Item Functioning?. Luppescu S. … Rasch Measurement Transactions, 1993, 7:2 p.285-6

Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free	An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse	Rasch Measurement Theory Analysis in R, Wind, Hua	Applying the Rasch Model in Social Sciences Using R, Lamprianou	El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar	Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch	Rasch Models for Measurement, David Andrich	Constructing Measures, Mark Wilson	Best Test Design - free, Wright & Stone Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias	Diseño de Mejores Pruebas - free, Spanish Best Test Design	A Course in Rasch Measurement Theory, Andrich, Marais	Rasch Models in Health, Christensen, Kreiner, Mesba	Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen
Rasch Books and Publications: Winsteps and Facets
Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene	Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver	Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone	Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale	Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes	Rasch Models for Solving Measurement Problems (Facets), George Engelhard, Jr. & Jue Wang	Statistical Analyses for Language Testers (Facets), Rita Green	Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind	Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind	Rasch Measurement: Applications, Khine	Winsteps Tutorials - free Facets Tutorials - free	Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre	Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
May 17 - June 21, 2024, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 12 - 14, 2024, Wed.-Fri.	1st Scandinavian Applied Measurement Conference, Kristianstad University, Kristianstad, Sweden http://www.hkr.se/samc2024
June 21 - July 19, 2024, Fri.-Fri.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com
Aug. 5 - Aug. 6, 2024, Fri.-Fri.	2024 Inaugural Conference of the Society for the Study of Measurement (Berkeley, CA), Call for Proposals
Aug. 9 - Sept. 6, 2024, Fri.-Fri.	On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com
Oct. 4 - Nov. 8, 2024, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Jan. 17 - Feb. 21, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
May 16 - June 20, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 20 - July 18, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com
Oct. 3 - Nov. 7, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com

Real and Artifical DIF Detection Examined: Which item has the real Differential Item Functioning? Intelligent DIF