Sample Size and Item Calibration [or Person Measure] Stability

How big a sample is necessary to obtain usefully stable item calibrations?
Or how long a test is necessary to obtain usefully stable person measure estimates?

The Rasch model is blind to what is a person and what is an item, so the numbers are the same.

Rasch is the same as any other statistical analysis with a small sample:
1. Less precise estimates (bigger standard errors)
2. Less powerful fit analysis
3. Less robust estimates (more likely that accidents in the data will distort them).

Each time we calibrate a set of items on different samples of similar examinees, we expect slightly different results. In principle, as the size of the samples increases, the differences become smaller. If each sample were only 2 or 3 examinees, results could be very unstable. If each sample were 2,000 or 3,000 examinees, results might be essentially identical, provided no other sources of error are in action. But large samples are expensive and time-consuming. What is the minimum sample to give useful item calibrations " calibrations that we can expect to be similar enough to maintain a useful level of measurement stability?

Polytomies

The extra concern with polytomies is that you need at least 10 observations per category, see, for instance, Linacre J.M. (2002) Understanding Rasch measurement: Optimizing rating scale category effectiveness. Journal of Applied Measurement 3:1 85-106. or Linacre J.M. (1999) Investigating rating scale category utility. Journal of Outcome Measurement 3:2, 103-122.

For the Andrich Rating-Scale Model (in which all items share the same rating scale), this requirement is almost always met.

For the Masters Partial-Credit Model (in which each item defines its own rating scale) then 100 responses per item may be too few.

Otherwise the actual sample sizes could be smaller than with dichotomies because there is more information in each polytomous observation.

Person Measure Estimate Stability

The requirements are symmetric for the Rasch model so you need as many items for a stable person measure as you need persons for a stable item measure. Consequently, 30 items administered to 30 persons (with reasonable targeting and fit) should produce statistically stable measures.

The first step is to clarify "similar enough." Just as no person has a height stable to within .01 or even .1 inches, no item has a difficulty stable to within .01 or even .1 logits. In fact, stability to within ±.3 logits is the best that can be expected for most variables. Lee (RMT 6:2 p.222-3) discovers that in many applications one logit change corresponds to one grade level advance. So when an item calibration is stable within a logit, it will be targeted at a correct grade level.

For groups of items, Wright & Douglas (Best Test Design and Self-Tailored Testing, MESA Memo. 19, 1975) report that, when calibrations deviate in a random way from their optimal values, "as test length increases above 30 items, virtually no reasonable testing situation risks a measurement bias [for the examinees] large enough to notice." For even shorter tests, measures based on item calibration with random deviations up to 0.5 logits are "for all practical purposes free from bias."

Theoretically, the stability of an item calibration is its modelled standard error. For a sample of N examinees, that is reasonably targeted at the items and that responds to the test as intended, average item p-values are in the range 0.5 to 0.87, so that modelled item standard errors are in the range 2/sqrt(N) < SE < 3/sqrt(N) (Wright & Stone, Best Test Design, p.136), i.e, 4/SE2 < N < 9/SE2. The lower end of the range applies when the sample is targeted on items with 40%-60% success rate, the higher end when the sample obtains success rates more extreme than 15% or 85% success. As a rule of thumb, at least 8 correct responses and 8 incorrect responses are needed for reasonable confidence that an item calibration is within 1 logit of a stable value.

What, then, is the sample size needed to have 99% confidence that no item calibration is more than 1 logit away from its stable value?

A two-tailed 99% confidence interval is ±2.6 S.E. wide. For a ±1 logit interval, this S.E. is ±1/2.6 logits. This gives a minimum sample in the range 4*(2.6)2 < N < 9*(2.6)2, i.e, 27 < N < 61, depending on targeting. Thus, a sample of 50 well-targeted examinees is conservative for obtaining useful, stable estimates. 30 examinees is enough for well- designed pilot studies. The Table suggests other ranges. Inflate these sample sizes by 10%-40% if there are major sources of unmodelled measurement disturbance, such as different testing conditions or alternative curricula.

If much larger samples are conveniently available, divide them into smaller, homogeneous samples of males, females, young, old, etc. in order to check the stability of item calibrations in different measuring situations.

Small sample size? You can certainly perform useful exploratory work using Rasch analysis with a small sample. One of the foundational books in Rasch analysis, "Best Test Design" (Wright & Stone, 1979), is based on the analysis of a sample of 35 children and 18 items. The problem is not Rasch analysis, the problem is that a small sample is small for any type of definitive statistical analysis. There would be the same problem with any other type of statistical analysis. However, one way of strengthening your findings is to analyze your data, and then simulate 100 datasets using the measures estimated from your data (using, for instance, the Winsteps "simulate data" option). Then analyze the 100 datasets. You can then draw the distributions of the crucial statistics in the 100 datasets and locate your dataset among them. The closer your empirical dataset is to the center of the distribution of the 100, the more believable are your findings.

Item Calibrations
stable within
Confidence Minimum sample size range
(best to poor targeting)
Size for most
purposes
± 1 logit 95% 16 † -- 36 30
(minimum for dichotomies)
± 1 logit 99% 27 † -- 61 50
(minimum for polytomies)
± ½ logit 95% 64 -- 144 100
± ½ logit 99% 108 -- 243 150
Definitive or
High Stakes
99%+ (Items) 250 -- 20*test length 250
Adverse Circumstances Robust 450 upwards 500

John Michael Linacre

    Explanatory notes:
  1. † Peter Kruyen (2012) "Using short tests and questionnaires for making decisions about individuals: When is Short too Short?" proposes that "as a general rule test users should strive for using tests containing at least 20 items to ensure that decisions about individuals can be made with sufficient certainty." http://www.nwo.nl/en/news-and-events/news/2012/short-questionnaires-not-suitable-for-individual-diagnostics.html. The same rule would apply for making decisions about items.
  2. "For a ±1 logit interval this S.E. is ±1/2.6 logits."
    An estimate's standard S.E. is the modelled standard deviation of the normal distribution of the observed estimate around its "true" value. Suppose we want to be 99% confident that the "true" item difficulty is within 1 logit of its reported estimate. Then the estimate needs to have a standard error of 1.0 logits divided by 2.6 or less = 1/2.6 = 0.385 logits.
  3. "This gives a minimum sample in the range 4*(2.6)² < N < 9*(2.6)²"
    With optimum targeting of a dichotomous test, the modeled probability of each response is p=0.5. Then the modeled binomial variance = 0.5*0.5 = the information in a response. Thus N perfectly targeted observations have information N * 0.5 * 0.5 = N/4. This means that the S.E. of an estimate produced by N perfectly targeted observations is S.E. = sqrt(4/N)
    Similarly, for N extremely off-target observations (for a reasonable dichotomous test), p=0.13 or p=0.87. For these, the modeled binomial variance = 0.13*0.87 = the information in a response. N extremely off-target observations have information N * 0.13 * 0.87 = N/9. This means that the S.E. of an estimate produced by N perfectly targeted observations is S.E. = sqrt(9/N)
    So, for N observations, the minimum S.E. is sqrt (4/N) and a reasonable maximum S.E. is sqrt(9/N). So, the minimum range of N to produce an S.E. of 0.385 logits or better regardless of targeting is sqrt(4/N) = 1/2.6 for the best case (lower limit)
    and
    sqrt(9/N) = 1/2.6 for the worst reasonable case (upper limit)
    i.e., 4*(2.6)² < N < 9*(2.6)*sup2; is the range of minimum values of N to produce the desired S.E. (or better).


Wright B & Panchapakesan N 1969. A procedure for sample-free item analysis. Educational & Psychological Measurement 29 1 23-48

Wright B & Douglas G 1975. Best test design and self-tailored testing. MESA Memorandum No. 19. Department of Education, Univ. of Chicago

Wright, B. D. & Douglas, G. A. Rasch item analysis by hand. Research Memorandum No. 21, Statistical Laboratory, Department of Education, University of Chicago, 1976

Wright & Douglas(1976) "Rasch Item Analysis by Hand": "In other work we have found that when [test length] is greater than 20, random values of [item calibration] as high as 0.50 have negligible effects on measurement."

Wright & Douglas (1975) "Best Test Design and Self-Tailored Testing": "They allow the test designer to incur item discrepancies, that is item calibration errors, as large as 1.0. This may appear unnecessarily generous, since it permits use of an item of difficulty 2.0, say, when the design calls for 1.0, but it is offered as an upper limit because we found a large area of the test design domain to be exceptionally robust with respect to independent item discrepancies."

Wright & Stone (1979) "Best Test Design" p.98 - "random uncertainty of less than .3 logits," referencing MESA Memo 19: Best Test and Self-Tailored Testing. Benjamin D. Wright & Graham A. Douglas, 1975 . Also .3 logits in Solving Measurement Problems with the Rasch Model. Journal of Educational Measurement 14 (2) pp. 97-116, Summer 1977 (and MESA Memo 42)


Bamber, D., & van Santen, J. P. H. (1985). How many parameters can a model have and still be testable? Journal of Mathematical Psychology, 29, 443-73.
This gives a rule for when a statistical model over-parameterizes the data. It looks like the rule approximates "number of free parameters must be < square-root (number of data-points)". Since we are usually more concerned about item fit than person fit, there must be more persons than items. So, for a 100-item dichotomous test, we would need a sample of 100+ persons.


Sample Size and Item Calibration Stability. Linacre JM. … Rasch Measurement Transactions, 1994, 7:4 p.328



Rasch Publications
Rasch Measurement Transactions (free, online) Rasch Measurement research papers (free, online) Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch Applying the Rasch Model 2nd. Ed., Bond & Fox Best Test Design, Wright & Stone
Rating Scale Analysis, Wright & Masters Introduction to Rasch Measurement, E. Smith & R. Smith Introduction to Many-Facet Rasch Measurement, Thomas Eckes Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr. Statistical Analyses for Language Testers, Rita Green
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar Journal of Applied Measurement Rasch models for measurement, David Andrich Constructing Measures, Mark Wilson Rasch Analysis in the Human Sciences, Boone, Stave, Yale

To be emailed about new material on www.rasch.org
please enter your email address here:

I want to Subscribe: & click below
I want to Unsubscribe: & click below

Please set your SPAM filter to accept emails from Rasch.org

www.rasch.org welcomes your comments:

Your email address (if you want us to reply):

 

ForumRasch Measurement Forum to discuss any Rasch-related topic

Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement

Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.

Coming Rasch-related Events
Oct. 31 - Nov. 28, 2014, Fri.-Fri. On-line workshop: Rasch Applications, Part 2: Clinical Assessment, Survey Research, and Educational Measurement (W. Fisher), www.statistics.com
Nov. 14, 2014, Fri. In-person workshop: IX Workshop on Rasch Models in Business Administration, Tenerife, Canary Islands, Spain, www.institutos.ull.es/viewcontent/institutos/iude/46416/es
Nov. 30, 2014, Sun. Submission deadline: 6th Rasch Conference: Sixth International Conference on Probabilistic Models for Measurement in Education, Psychology, Social Science and Health, Cape Town, South Africa www.rasch.co.za/conference.php
Dec. 3-5, 2014, Wed.-Fri. In-person workshop: Introductory Rasch (A. Tennant, RUMM), Leeds, UK, www.leeds.ac.uk/medicine/rehabmed/psychometric
Jan. 2-30, 2015, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Jan. 12-14, 2015, Mon.-Wed. 6th Rasch Conference: Sixth International Conference on Probabilistic Models for Measurement in Education, Psychology, Social Science and Health, Cape Town, South Africa www.rasch.co.za/conference.php
March 11-13, 2015, Wed.-Fri. In-person workshop: Introductory Rasch (A. Tennant, RUMM), Leeds, UK, www.leeds.ac.uk/medicine/rehabmed/psychometric
March 20, 2015, Fri. UK Rasch User Group Meeting, London, United Kingdom, www.rasch.org.uk
March 26-27, 2015, Thur.-Fri. In-person workshop: Introduction to Rasch Measurement with Winsteps (W. Boone), Cincinnati, raschmeasurementanalysis.com
April 16-20, 2015, Thurs.-Mon. AERA Annual Meeting, Chicago IL www.aera.net
April 21-22, 2015, Tues.-Wed. IOMC 2015: International Outcomes Measurement Conference, Chicago IL www.jampress.org
May 13-15, 2015, Wed.-Fri. In-person workshop: Introductory Rasch (A. Tennant, RUMM), Leeds, UK, www.leeds.ac.uk/medicine/rehabmed/psychometric
May 18-20, 2015, Mon.-Wed. In-person workshop: Intermediate Rasch (A. Tennant, RUMM), Leeds, UK, www.leeds.ac.uk/medicine/rehabmed/psychometric
May 29 - June 26, 2015, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
July 3-31, 2015, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com
Aug. 14 - Sept. 11, 2015, Fri.-Fri. On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com
Sept. 9-11, 2015, Wed.-Fri. In-person workshop: Introductory Rasch (A. Tennant, RUMM), Leeds, UK, www.leeds.ac.uk/medicine/rehabmed/psychometric
Sept. 14-16, 2015, Mon.-Wed. In-person workshop: Intermediate Rasch (A. Tennant, RUMM), Leeds, UK, www.leeds.ac.uk/medicine/rehabmed/psychometric
Sept. 17-18, 2015, Thur.-Fri. In-person workshop: Advanced Rasch (A. Tennant, RUMM), Leeds, UK, www.leeds.ac.uk/medicine/rehabmed/psychometric
Oct. 16 - Nov. 13, 2015, Fri.-Fri. On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Sept. 4 - Oct. 16, 2015, Fri.-Fri. On-line workshop: Rasch Applications, Part 1: How to Construct a Rasch Scale (W. Fisher), www.statistics.com
Oct. 23 - Nov. 20, 2015, Fri.-Fri. On-line workshop: Rasch Applications, Part 2: Clinical Assessment, Survey Research, and Educational Measurement (W. Fisher), www.statistics.com
Dec. 2-4, 2015, Wed.-Fri. In-person workshop: Introductory Rasch (A. Tennant, RUMM), Leeds, UK, www.leeds.ac.uk/medicine/rehabmed/psychometric
Aug. 12 - Sept. 9, 2016, Fri.-Fri. On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com
The HTML to add "Coming Rasch-related Events" to your webpage is:
<script type="text/javascript" src="http://www.rasch.org/events.txt"></script>

The URL of this page is www.rasch.org/rmt/rmt74m.htm

Website: www.rasch.org/rmt/contents.htm