# Sample Size and Item Calibration [or Person Measure] Stability

How big a sample is necessary to obtain usefully stable item calibrations?
Or how long a test is necessary to obtain usefully stable person measure estimates?

The Rasch model is blind to what is a person and what is an item, so the numbers are the same.

Rasch is the same as any other statistical analysis with a small sample:
1. Less precise estimates (bigger standard errors)
2. Less powerful fit analysis
3. Less robust estimates (more likely that accidents in the data will distort them).

Each time we calibrate a set of items on different samples of similar examinees, we expect slightly different results. In principle, as the size of the samples increases, the differences become smaller. If each sample were only 2 or 3 examinees, results could be very unstable. If each sample were 2,000 or 3,000 examinees, results might be essentially identical, provided no other sources of error are in action. But large samples are expensive and time-consuming. What is the minimum sample to give useful item calibrations " calibrations that we can expect to be similar enough to maintain a useful level of measurement stability?

Polytomies

The extra concern with polytomies is that you need at least 10 observations per category, see, for instance, Linacre J.M. (2002) Understanding Rasch measurement: Optimizing rating scale category effectiveness. Journal of Applied Measurement 3:1 85-106. or Linacre J.M. (1999) Investigating rating scale category utility. Journal of Outcome Measurement 3:2, 103-122.

For the Andrich Rating-Scale Model (in which all items share the same rating scale), this requirement is almost always met.

For the Masters Partial-Credit Model (in which each item defines its own rating scale) then 100 responses per item may be too few.

Otherwise the actual sample sizes could be smaller than with dichotomies because there is more information in each polytomous observation.

Person Measure Estimate Stability

The requirements are symmetric for the Rasch model so you need as many items for a stable person measure as you need persons for a stable item measure. Consequently, 30 items administered to 30 persons (with reasonable targeting and fit) should produce statistically stable measures (±1.0 logits, 95% confidence). Confirmed by Azizan et al. (2020).

The first step is to clarify "similar enough." Just as no person has a height stable to within .01 or even .1 inches, no item has a difficulty stable to within .01 or even .1 logits. In fact, stability to within ±.3 logits is the best that can be expected for most variables. Lee (RMT 6:2 p.222-3) discovers that in many applications one logit change corresponds to one grade level advance. So when an item calibration is stable within a logit, it will be targeted at a correct grade level.

For groups of items, Wright & Douglas (Best Test Design and Self-Tailored Testing, MESA Memo. 19, 1975) report that, when calibrations deviate in a random way from their optimal values, "as test length increases above 30 items, virtually no reasonable testing situation risks a measurement bias [for the examinees] large enough to notice." For even shorter tests, measures based on item calibration with random deviations up to 0.5 logits are "for all practical purposes free from bias."

Theoretically, the stability of an item calibration is its modelled standard error. For a sample of N examinees, that is reasonably targeted at the items and that responds to the test as intended, average item p-values are in the range 0.5 to 0.87, so that modelled item standard errors are in the range 2/sqrt(N) < SE < 3/sqrt(N) (Wright & Stone, Best Test Design, p.136), i.e, 4/SE2 < N < 9/SE2. The lower end of the range applies when the sample is targeted on items with 40%-60% success rate, the higher end when the sample obtains success rates more extreme than 15% or 85% success. As a rule of thumb, at least 8 correct responses and 8 incorrect responses are needed for reasonable confidence that an item calibration is within 1 logit of a stable value.

What, then, is the sample size needed to have 99% confidence that no item calibration is more than 1 logit away from its stable value?

A two-tailed 99% confidence interval is ±2.6 S.E. wide. For a ±1 logit interval, this S.E. is ±1/2.6 logits. This gives a minimum sample in the range 4*(2.6)2 < N < 9*(2.6)2, i.e, 27 < N < 61, depending on targeting. Thus, a sample of 50 well-targeted examinees is conservative for obtaining useful, stable estimates. 30 examinees is enough for well- designed pilot studies. The Table suggests other ranges. Inflate these sample sizes by 10%-40% if there are major sources of unmodelled measurement disturbance, such as different testing conditions or alternative curricula.

If much larger samples are conveniently available, divide them into smaller, homogeneous samples of males, females, young, old, etc. in order to check the stability of item calibrations in different measuring situations.

Small sample size? You can certainly perform useful exploratory work using Rasch analysis with a small sample. One of the foundational books in Rasch analysis, "Best Test Design" (Wright & Stone, 1979), is based on the analysis of a sample of 35 children and 18 items. The problem is not Rasch analysis, the problem is that a small sample is small for any type of definitive statistical analysis. There would be the same problem with any other type of statistical analysis. However, one way of strengthening your findings is to analyze your data, and then simulate 100 datasets using the measures estimated from your data (using, for instance, the Winsteps "simulate data" option). Then analyze the 100 datasets. You can then draw the distributions of the crucial statistics in the 100 datasets and locate your dataset among them. The closer your empirical dataset is to the center of the distribution of the 100, the more believable are your findings.

Question: How can I justify to a Journal editor that we used IRT/Rasch analysis with a sample size of only 200 when the Journal expects at least 1,000?

Answer: The least number of participants depends on the IRT method you are using. For Rasch, it is 30 participants - see table. In many medical applications for obscure diseases and afflictions, researchers have trouble finding even 30 patients (good!), but their research is published because doing something for those 30 patients is much better than doing nothing! So, your explanation to the Journal editor must emphasize that the social and other benefits of your study outweigh the small statistical deficiencies. In fact, we would be surprised if the findings with 1,000 participants differed noticeably from the findings with 200 participants.

Item Calibrations
or person measures
stable within
Confidence Minimum sample size range
(best to poor targeting)
Size for most
purposes
± 1 logit 95% 16 † -- 36 30
(minimum for dichotomies)
± 1 logit 99% 27 † -- 61 50
(minimum for polytomies)
± ½ logit 95% 64 -- 144 100
± ½ logit 99% 108 -- 243 150
Definitive or
High Stakes
99%+ (Items) 250 -- 20*test length 250
Adverse Circumstances Robust 450 upwards 500

John Michael Linacre

Explanatory notes:
1. † Peter Kruyen (2012) "Using short tests and questionnaires for making decisions about individuals: When is Short too Short?" proposes that "as a general rule test users should strive for using tests containing at least 20 items to ensure that decisions about individuals can be made with sufficient certainty." https://www.nwo.nl/en/news-and-events/news/2012/short-questionnaires-not-suitable-for-individual-diagnostics.html. The same rule would apply for making decisions about items.
2. "For a ±1 logit interval this S.E. is ±1/2.6 logits."
An estimate's standard S.E. is the modelled standard deviation of the normal distribution of the observed estimate around its "true" value. Suppose we want to be 99% confident that the "true" item difficulty is within 1 logit of its reported estimate. Then the estimate needs to have a standard error of 1.0 logits divided by 2.6 or less = 1/2.6 = 0.385 logits.
3. "This gives a minimum sample in the range 4*(2.6)² < N < 9*(2.6)²"
With optimum targeting of a dichotomous test, the modeled probability of each response is p=0.5. Then the modeled binomial variance = 0.5*0.5 = the information in a response. Thus N perfectly targeted observations have information N * 0.5 * 0.5 = N/4. This means that the S.E. of an estimate produced by N perfectly targeted observations is S.E. = sqrt(4/N)
Similarly, for N extremely off-target observations (for a reasonable dichotomous test), p=0.13 or p=0.87. For these, the modeled binomial variance = 0.13*0.87 = the information in a response. N extremely off-target observations have information N * 0.13 * 0.87 = N/9. This means that the S.E. of an estimate produced by N perfectly targeted observations is S.E. = sqrt(9/N)
So, for N observations, the minimum S.E. is sqrt (4/N) and a reasonable maximum S.E. is sqrt(9/N). So, the minimum range of N to produce an S.E. of 0.385 logits or better regardless of targeting is sqrt(4/N) = 1/2.6 for the best case (lower limit)
and
sqrt(9/N) = 1/2.6 for the worst reasonable case (upper limit)
i.e., 4*(2.6)² < N < 9*(2.6)² is the range of minimum values of N to produce the desired S.E. (or better).
4. 1 polytomous item or person administered a polytomy ≈ 1 dichotomy * (number of categories - 1)

Azizan, N. H., Mahmud, Z., Rambli, A.(2020). Rasch Rating Scale Item Estimates using Maximum Likelihood Approach: Effects of Sample Size on the Accuracy and Bias of the Estimates. International Journal of Advanced Science and Technology Vol. 29, No. 4s, pp. 2526 - 2531.

Bamber, D., & van Santen, J. P. H. (1985). How many parameters can a model have and still be testable? Journal of Mathematical Psychology, 29, 443-73.
This gives a rule for when a statistical model over-parameterizes the data. It looks like the rule approximates "number of free parameters must be < square-root (number of data-points)". Since we are usually more concerned about item fit than person fit, there must be more persons than items. So, for a 100-item dichotomous test, we would need a sample of 100+ persons.

Wright, B. D. & Douglas, G. A. (1975). Best Test Design and Self-Tailored Testing. Research Memorandum No.19, Statistical Laboratory, Department of Education, University of Chicago
"They allow the test designer to incur item discrepancies, that is item calibration errors, as large as 1.0. This may appear unnecessarily generous, since it permits use of an item of difficulty 2.0, say, when the design calls for 1.0, but it is offered as an upper limit because we found a large area of the test design domain to be exceptionally robust with respect to independent item discrepancies."

Wright, B. D. & Douglas, G. A. (1976). Rasch item analysis by hand. Research Memorandum No. 21, Statistical Laboratory, Department of Education, University of Chicago
"In other work we have found that when [test length] is greater than 20, random values of [item calibration] as high as 0.50 have negligible effects on measurement."

Wright, B. D. & Panchapakesan, N. (1969). A procedure for sample-free item analysis. Educational & Psychological Measurement 29 1 23-48

Wright, B. D. & Stone M. H. (1979). Best Test Design, p.98 - "random uncertainty of less than .3 logits," referencing MESA Memo 19: Best Test and Self-Tailored Testing.
Also .3 logits in Solving Measurement Problems with the Rasch Model. Journal of Educational Measurement 14 (2) pp. 97-116, Summer 1977 (and MESA Memo 42)

Sample Size and Item Calibration Stability. Linacre JM. … Rasch Measurement Transactions, 1994, 7:4 p.328

Rasch Publications
Rasch Measurement Transactions (free, online) Rasch Measurement research papers (free, online) Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch Applying the Rasch Model 3rd. Ed., Bond & Fox Best Test Design, Wright & Stone
Rating Scale Analysis, Wright & Masters Introduction to Rasch Measurement, E. Smith & R. Smith Introduction to Many-Facet Rasch Measurement, Thomas Eckes Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr. Statistical Analyses for Language Testers, Rita Green
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar Journal of Applied Measurement Rasch models for measurement, David Andrich Constructing Measures, Mark Wilson Rasch Analysis in the Human Sciences, Boone, Stave, Yale
in Spanish: Análisis de Rasch para todos, Agustín Tristán Mediciones, Posicionamientos y Diagnósticos Competitivos, Juan Ramón Oreja Rodríguez

 Forum Rasch Measurement Forum to discuss any Rasch-related topic

Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
June 23 - July 21, 2023, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com
Aug. 11 - Sept. 8, 2023, Fri.-Fri. On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com