Sample Size and Item Calibration or Person Measure Stability

How big a sample is necessary to obtain usefully stable item calibrations?
Or how long a test is necessary to obtain usefully stable person measure estimates?

The Rasch model is blind to what is a person and what is an item, so the numbers are the same.

Rasch is the same as any other statistical analysis with a small sample:
1. Less precise estimates (bigger standard errors)
2. Less powerful fit analysis
3. Less robust estimates (more likely that accidents in the data will distort them).

Each time we calibrate a set of items on different samples of similar examinees, we expect slightly different results. In principle, as the size of the samples increases, the differences become smaller. If each sample were only 2 or 3 examinees, results could be very unstable. If each sample were 2,000 or 3,000 examinees, results might be essentially identical, provided no other sources of error are in action. But large samples are expensive and time-consuming. What is the minimum sample to give useful item calibrations " calibrations that we can expect to be similar enough to maintain a useful level of measurement stability?

Polytomies
The extra concern with polytomies is that you need at least 10 observations per category, see, for instance, Linacre J.M. (2002) Understanding Rasch measurement: Optimizing rating scale category effectiveness. Journal of Applied Measurement 3:1 85-106. or Linacre J.M. (1999) Investigating rating scale category utility. Journal of Outcome Measurement 3:2, 103-122. For the Andrich Rating-Scale Model (in which all items share the same rating scale), this requirement is almost always met. For the Masters Partial-Credit Model (in which each item defines its own rating scale) then 100 responses per item may be too few. Otherwise the actual sample sizes could be smaller than with dichotomies because there is more information in each polytomous observation.
Person Measure Estimate Stability
The requirements are symmetric for the Rasch model so you need as many items for a stable person measure as you need persons for a stable item measure. Consequently, 30 items administered to 30 persons (with reasonable targeting and fit) should produce statistically stable measures (±1.0 logits, 95% confidence). Confirmed by Azizan et al. (2020).

The first step is to clarify "similar enough." Just as no person has a height stable to within .01 or even .1 inches, no item has a difficulty stable to within .01 or even .1 logits. In fact, stability to within ±.3 logits is the best that can be expected for most variables. Lee (RMT 6:2 p.222-3) discovers that in many applications one logit change corresponds to one grade level advance. So when an item calibration is stable within a logit, it will be targeted at a correct grade level.

For groups of items, Wright & Douglas (Best Test Design and Self-Tailored Testing, MESA Memo. 19, 1975) report that, when calibrations deviate in a random way from their optimal values, "as test length increases above 30 items, virtually no reasonable testing situation risks a measurement bias [for the examinees] large enough to notice." For even shorter tests, measures based on item calibration with random deviations up to 0.5 logits are "for all practical purposes free from bias."

Theoretically, the stability of an item calibration is its modelled standard error. For a sample of N examinees, that is reasonably targeted at the items and that responds to the test as intended, average item p-values are in the range 0.5 to 0.87, so that modelled item standard errors are in the range 2/sqrt(N) < SE < 3/sqrt(N) (Wright & Stone, Best Test Design, p.136), i.e, 4/SE² < N < 9/SE². The lower end of the range applies when the sample is targeted on items with 40%-60% success rate, the higher end when the sample obtains success rates more extreme than 15% or 85% success. As a rule of thumb, at least 8 correct responses and 8 incorrect responses are needed for reasonable confidence that an item calibration is within 1 logit of a stable value.

What, then, is the sample size needed to have 99% confidence that no item calibration is more than 1 logit away from its stable value?

A two-tailed 99% confidence interval is ±2.6 S.E. wide. For a ±1 logit interval, this S.E. is ±1/2.6 logits. This gives a minimum sample in the range 4*(2.6)² < N < 9*(2.6)², i.e, 27 < N < 61, depending on targeting. Thus, a sample of 50 well-targeted examinees is conservative for obtaining useful, stable estimates. 30 examinees is enough for well- designed pilot studies. The Table suggests other ranges. Inflate these sample sizes by 10%-40% if there are major sources of unmodelled measurement disturbance, such as different testing conditions or alternative curricula.

If much larger samples are conveniently available, divide them into smaller, homogeneous samples of males, females, young, old, etc. in order to check the stability of item calibrations in different measuring situations.

Small sample size? You can certainly perform useful exploratory work using Rasch analysis with a small sample. One of the foundational books in Rasch analysis, "Best Test Design" (Wright & Stone, 1979), is based on the analysis of a sample of 35 children and 18 items. The problem is not Rasch analysis, the problem is that a small sample is small for any type of definitive statistical analysis. There would be the same problem with any other type of statistical analysis. However, one way of strengthening your findings is to analyze your data, and then simulate 100 datasets using the measures estimated from your data (using, for instance, the Winsteps "simulate data" option). Then analyze the 100 datasets. You can then draw the distributions of the crucial statistics in the 100 datasets and locate your dataset among them. The closer your empirical dataset is to the center of the distribution of the 100, the more believable are your findings.

Question: How can I justify to a Journal editor that we used IRT/Rasch analysis with a sample size of only 200 when the Journal expects at least 1,000?

Answer: The least number of participants depends on the IRT method you are using. For Rasch, it is 30 participants - see table. In many medical applications for obscure diseases and afflictions, researchers have trouble finding even 30 patients (good!), but their research is published because doing something for those 30 patients is much better than doing nothing! So, your explanation to the Journal editor must emphasize that the social and other benefits of your study outweigh the small statistical deficiencies. In fact, we would be surprised if the findings with 1,000 participants differed noticeably from the findings with 200 participants.

Item Calibrations or person measures stable within	Confidence	Minimum sample size range (best to poor targeting)	Size for most purposes
± 1 logit	95%	16 † -- 36	30 (minimum for dichotomies)
± 1 logit	99%	27 † -- 61	50 (minimum for polytomies)
± ½ logit	95%	64 -- 144	100*
± ½ logit	99%	108 -- 243	150 ‡
Definitive or High Stakes	99%+ (Items)	250 -- 20*test length	250
Adverse Circumstances	Robust	450 upwards	500

Azizan, N. H., Mahmud, Z., Rambli, A.(2020). Rasch Rating Scale Item Estimates using Maximum Likelihood Approach: Effects of Sample Size on the Accuracy and Bias of the Estimates. International Journal of Advanced Science and Technology Vol. 29, No. 4s, pp. 2526 - 2531.

Bamber, D., & van Santen, J. P. H. (1985). How many parameters can a model have and still be testable? Journal of Mathematical Psychology, 29, 443-73.
This gives a rule for when a statistical model over-parameterizes the data. It looks like the rule approximates "number of free parameters must be < square-root (number of data-points)". Since we are usually more concerned about item fit than person fit, there must be more persons than items. So, for a 100-item dichotomous test, we would need a sample of 100+ persons.

Wright, B. D. & Douglas, G. A. (1975). Best Test Design and Self-Tailored Testing. Research Memorandum No.19, Statistical Laboratory, Department of Education, University of Chicago
"They allow the test designer to incur item discrepancies, that is item calibration errors, as large as 1.0. This may appear unnecessarily generous, since it permits use of an item of difficulty 2.0, say, when the design calls for 1.0, but it is offered as an upper limit because we found a large area of the test design domain to be exceptionally robust with respect to independent item discrepancies."

Wright, B. D. & Douglas, G. A. (1976). Rasch item analysis by hand. Research Memorandum No. 21, Statistical Laboratory, Department of Education, University of Chicago
"In other work we have found that when [test length] is greater than 20, random values of [item calibration] as high as 0.50 have negligible effects on measurement."

Wright, B. D. & Panchapakesan, N. (1969). A procedure for sample-free item analysis. Educational & Psychological Measurement 29 1 23-48

Wright, B. D. & Stone M. H. (1979). Best Test Design, p.98 - "random uncertainty of less than .3 logits," referencing MESA Memo 19: Best Test and Self-Tailored Testing.
Also .3 logits in Solving Measurement Problems with the Rasch Model. Journal of Educational Measurement 14 (2) pp. 97-116, Summer 1977 (and MESA Memo 42)

Sample Size and Item Calibration Stability. Linacre JM. … Rasch Measurement Transactions, 1994, 7:4 p.328

Rasch Books and Publications
Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, 2nd Edn. George Engelhard, Jr. & Jue Wang	Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene	Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver	Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone	Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes	Statistical Analyses for Language Testers (Facets), Rita Green	Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind	Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M	Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind	Rasch Measurement: Applications, Khine	Winsteps Tutorials - free Facets Tutorials - free	Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre	Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan
Other Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free	An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse	Rasch Measurement Theory Analysis in R, Wind, Hua	Applying the Rasch Model in Social Sciences Using R, Lamprianou	El modelo métrico de Rasch: Fundamentación, implementación e interpretación de la medida en ciencias sociales (Spanish Edition), Manuel González-Montesinos M.
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar	Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch	Rasch Models for Measurement, David Andrich	Constructing Measures, Mark Wilson	Best Test Design - free, Wright & Stone Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias	Diseño de Mejores Pruebas - free, Spanish Best Test Design	A Course in Rasch Measurement Theory, Andrich, Marais	Rasch Models in Health, Christensen, Kreiner, Mesba	Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
Apr. 21 - 22, 2025, Mon.-Tue.	International Objective Measurement Workshop (IOMW) - Boulder, CO, www.iomw.net
Jan. 17 - Feb. 21, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Feb. - June, 2025	On-line course: Introduction to Classical Test and Rasch Measurement Theories (D. Andrich, I. Marais, RUMM2030), University of Western Australia
Feb. - June, 2025	On-line course: Advanced Course in Rasch Measurement Theory (D. Andrich, I. Marais, RUMM2030), University of Western Australia
May 16 - June 20, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 20 - July 18, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com
July 21 - 23, 2025, Mon.-Wed.	Pacific Rim Objective Measurement Symposium (PROMS) 2025, www.proms2025.com
Oct. 3 - Nov. 7, 2025, Fri.-Fri.	On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com

Sample Size and Item Calibration [or Person Measure] Stability