How big a sample is necessary to obtain usefully stable item calibrations?
Or how long a test is necessary to obtain usefully stable person measure estimates?
The Rasch model is blind to what is a person and what is an item, so the numbers are the same.
Rasch is the same as any other statistical analysis with a small sample:
1. Less precise estimates (bigger standard errors)
2. Less powerful fit analysis
3. Less robust estimates (more likely that accidents in the data will distort them).
Each time we calibrate a set of items on different samples of similar examinees, we expect slightly different results. In principle, as the size of the samples increases, the differences become smaller. If each sample were only 2 or 3 examinees, results could be very unstable. If each sample were 2,000 or 3,000 examinees, results might be essentially identical, provided no other sources of error are in action. But large samples are expensive and timeconsuming. What is the minimum sample to give useful item calibrations " calibrations that we can expect to be similar enough to maintain a useful level of measurement stability?
Polytomies 

The extra concern with polytomies is that you need at least 10 observations per category, see, for instance, Linacre J.M. (2002) Understanding Rasch measurement: Optimizing rating scale category effectiveness. Journal of Applied Measurement 3:1 85106. or Linacre J.M. (1999) Investigating rating scale category utility. Journal of Outcome Measurement 3:2, 103122. For the Andrich RatingScale Model (in which all items share the same rating scale), this requirement is almost always met. For the Masters PartialCredit Model (in which each item defines its own rating scale) then 100 responses per item may be too few. Otherwise the actual sample sizes could be smaller than with dichotomies because there is more information in each polytomous observation. 
Person Measure Estimate Stability 
The requirements are symmetric for the Rasch model so you need as many items for a stable person measure as you need persons for a stable item measure. Consequently, 30 items administered to 30 persons (with reasonable targeting and fit) should produce statistically stable measures. 
The first step is to clarify "similar enough." Just as no person has a height stable to within .01 or even .1 inches, no item has a difficulty stable to within .01 or even .1 logits. In fact, stability to within ±.3 logits is the best that can be expected for most variables. Lee (RMT 6:2 p.2223) discovers that in many applications one logit change corresponds to one grade level advance. So when an item calibration is stable within a logit, it will be targeted at a correct grade level.
For groups of items, Wright & Douglas (Best Test Design and SelfTailored Testing, MESA Memo. 19, 1975) report that, when calibrations deviate in a random way from their optimal values, "as test length increases above 30 items, virtually no reasonable testing situation risks a measurement bias [for the examinees] large enough to notice." For even shorter tests, measures based on item calibration with random deviations up to 0.5 logits are "for all practical purposes free from bias."
Theoretically, the stability of an item calibration is its modelled standard error. For a sample of N examinees, that is reasonably targeted at the items and that responds to the test as intended, average item pvalues are in the range 0.5 to 0.87, so that modelled item standard errors are in the range 2/sqrt(N) < SE < 3/sqrt(N) (Wright & Stone, Best Test Design, p.136), i.e, 4/SE^{2} < N < 9/SE^{2}. The lower end of the range applies when the sample is targeted on items with 40%60% success rate, the higher end when the sample obtains success rates more extreme than 15% or 85% success. As a rule of thumb, at least 8 correct responses and 8 incorrect responses are needed for reasonable confidence that an item calibration is within 1 logit of a stable value.
What, then, is the sample size needed to have 99% confidence that no item calibration is more than 1 logit away from its stable value?
A twotailed 99% confidence interval is ±2.6 S.E. wide. For a ±1 logit interval, this S.E. is ±1/2.6 logits. This gives a minimum sample in the range 4*(2.6)^{2} < N < 9*(2.6)^{2}, i.e, 27 < N < 61, depending on targeting. Thus, a sample of 50 welltargeted examinees is conservative for obtaining useful, stable estimates. 30 examinees is enough for well designed pilot studies. The Table suggests other ranges. Inflate these sample sizes by 10%40% if there are major sources of unmodelled measurement disturbance, such as different testing conditions or alternative curricula.
If much larger samples are conveniently available, divide them into smaller, homogeneous samples of males, females, young, old, etc. in order to check the stability of item calibrations in different measuring situations.
Small sample size? You can certainly perform useful exploratory work using Rasch analysis with a small sample. One of the foundational books in Rasch analysis, "Best Test Design" (Wright & Stone, 1979), is based on the analysis of a sample of 35 children and 18 items. The problem is not Rasch analysis, the problem is that a small sample is small for any type of definitive statistical analysis. There would be the same problem with any other type of statistical analysis. However, one way of strengthening your findings is to analyze your data, and then simulate 100 datasets using the measures estimated from your data (using, for instance, the Winsteps "simulate data" option). Then analyze the 100 datasets. You can then draw the distributions of the crucial statistics in the 100 datasets and locate your dataset among them. The closer your empirical dataset is to the center of the distribution of the 100, the more believable are your findings.
Item Calibrations stable within 
Confidence  Minimum sample size range (best to poor targeting) 
Size for most purposes 

± 1 logit  95%  16 †  36  30 (minimum for dichotomies) 
± 1 logit  99%  27 †  61  50 (minimum for polytomies) 
± ½ logit  95%  64  144  100 
± ½ logit  99%  108  243  150 
Definitive or High Stakes 
99%+ (Items)  250  20*test length  250 
Adverse Circumstances  Robust  450 upwards  500 
John Michael Linacre
Wright B & Panchapakesan N 1969. A procedure for samplefree item analysis. Educational & Psychological Measurement 29 1 2348
Wright B & Douglas G 1975. Best test design and selftailored testing. MESA Memorandum No. 19. Department of Education, Univ. of Chicago
Wright, B. D. & Douglas, G. A. Rasch item analysis by hand. Research Memorandum No. 21, Statistical Laboratory, Department of Education, University of Chicago, 1976
Wright & Douglas(1976) "Rasch Item Analysis by Hand": "In other work we have found that when [test length] is greater than 20, random values of [item calibration] as high as 0.50 have negligible effects on measurement."
Wright & Douglas (1975) "Best Test Design and SelfTailored Testing": "They allow the test designer to incur item discrepancies, that is item calibration errors, as large as 1.0. This may appear unnecessarily generous, since it permits use of an item of difficulty 2.0, say, when the design calls for 1.0, but it is offered as an upper limit because we found a large area of the test design domain to be exceptionally robust with respect to independent item discrepancies."
Wright & Stone (1979) "Best Test Design" p.98  "random uncertainty of less than .3 logits," referencing MESA Memo 19: Best Test and SelfTailored Testing. Benjamin D. Wright & Graham A. Douglas, 1975 . Also .3 logits in Solving Measurement Problems with the Rasch Model. Journal of Educational Measurement 14 (2) pp. 97116, Summer 1977 (and MESA Memo 42)
Bamber, D., & van Santen, J. P. H. (1985). How many parameters can a model have and still be testable? Journal of Mathematical Psychology, 29, 44373.
This gives a rule for when a statistical model overparameterizes the data. It looks like the rule approximates "number of free parameters must be < squareroot (number of datapoints)".
Since we are usually more concerned about item fit than person fit, there must be more persons than items. So, for a 100item dichotomous test, we would need a sample of 100+ persons.
Sample Size and Item Calibration Stability. Linacre JM. … Rasch Measurement Transactions, 1994, 7:4 p.328
Rasch Publications  

Rasch Measurement Transactions (free, online)  Rasch Measurement research papers (free, online)  Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch  Applying the Rasch Model 2nd. Ed., Bond & Fox  Best Test Design, Wright & Stone 
Rating Scale Analysis, Wright & Masters  Introduction to Rasch Measurement, E. Smith & R. Smith  Introduction to ManyFacet Rasch Measurement, Thomas Eckes  Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr.  Statistical Analyses for Language Testers, Rita Green 
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar  Journal of Applied Measurement  Rasch models for measurement, David Andrich  Constructing Measures, Mark Wilson  Rasch Analysis in the Human Sciences, Boone, Stave, Yale 


Forum  Rasch Measurement Forum to discuss any Raschrelated topic 
Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement
Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.
Coming Raschrelated Events  

May 1315, 2015, Wed.Fri.  Inperson workshop: Introductory Rasch (A. Tennant, RUMM), Leeds, UK, www.leeds.ac.uk/medicine/rehabmed/psychometric 
May 1820, 2015, Mon.Wed.  Inperson workshop: Intermediate Rasch (A. Tennant, RUMM), Leeds, UK, www.leeds.ac.uk/medicine/rehabmed/psychometric 
May 29  June 26, 2015, Fri.Fri.  Online workshop: Practical Rasch Measurement  Core Topics (E. Smith, Winsteps), www.statistics.com 
June 13, 2015, Mon.Wed.  Inperson workshop: Assessment Design and Analysis Using CTT and IRT Software Programme in R (M. Wu, RTAM), Penang, Malayasia, tamRworkshop \at/ gmail.com 
July 331, 2015, Fri.Fri.  Online workshop: Practical Rasch Measurement  Further Topics (E. Smith, Winsteps), www.statistics.com 
July 27  Nov. 20, 2015, Mon.Fri.  Online course: Introduction to Rasch Measurement Theory (D. Andrich, I. Marais, RUMM), www.education.uwa.edu.au/ppl/courses 
Aug. 2024, 2015, Thurs.Mon.  PROMS: Pacific Rim Objective Measurement Symposium 2015, Fukuoka, Japan, proms.promsociety.org/2015/ 
Sept. 911, 2015, Wed.Fri.  Inperson workshop: Introductory Rasch (A. Tennant, RUMM), Leeds, UK, www.leeds.ac.uk/medicine/rehabmed/psychometric 
Sept. 911, 2015, Wed.Fri.  Inperson workshop: Introductory Rasch (A. Tennant, RUMM), Leeds, UK, www.leeds.ac.uk/medicine/rehabmed/psychometric 
Sept. 1416, 2015, Mon.Wed.  Inperson workshop: Intermediate Rasch (A. Tennant, RUMM), Leeds, UK, www.leeds.ac.uk/medicine/rehabmed/psychometric 
Sept. 1718, 2015, Thur.Fri.  Inperson workshop: Advanced Rasch (A. Tennant, RUMM), Leeds, UK, www.leeds.ac.uk/medicine/rehabmed/psychometric 
Oct. 16  Nov. 13, 2015, Fri.Fri.  Online workshop: Practical Rasch Measurement  Core Topics (E. Smith, Winsteps), www.statistics.com 
Sept. 4  Oct. 16, 2015, Fri.Fri.  Online workshop: Rasch Applications, Part 1: How to Construct a Rasch Scale (W. Fisher), www.statistics.com 
Oct. 23  Nov. 20, 2015, Fri.Fri.  Online workshop: Rasch Applications, Part 2: Clinical Assessment, Survey Research, and Educational Measurement (W. Fisher), www.statistics.com 
Dec. 24, 2015, Wed.Fri.  Inperson workshop: Introductory Rasch (A. Tennant, RUMM), Leeds, UK, www.leeds.ac.uk/medicine/rehabmed/psychometric 
Aug. 12  Sept. 9, 2016, Fri.Fri.  Online workshop: ManyFacet Rasch Measurement (E. Smith, Facets), www.statistics.com 
The HTML to add "Coming Raschrelated Events" to your webpage is: <script type="text/javascript" src="http://www.rasch.org/events.txt"></script> 