Since candidate sample sizes vary for certification
exams, the study explores the impact of small sample sizes. The results show that while item calibration
is less accurate, candidate separation reliability can be acceptable regardless
of the candidate sample size.
Mary E. Lunz, Ph.D.
Comparison of Item Performance with Large and Small Samples|
Usually a candidate population less than 50 is
considered a small sample for multiple choice exams. Indeed, the concept of the multiple choice
exam was developed to test large samples of candidates effectively and
efficiently. However, the multiple
choice format has become so well accepted that it is now often applied to small
as well as large samples. In order to
better understand the impact of candidate sample size on test item performance,
a test with a large sample of over 1,000 candidates and a test with a small
sample of less than 20 candidates were analyzed using the Rasch model. The criteria for reviewing item performance
were 1) item separation reliability, 2) item discrimination and 3) the error
associated with the calibrated difficulty of each item.|
Item separation reliability is an indication of
the reproducibility of the item difficulties.
High item separation reliability means that there is a high probability
that items will maintain the same difficulty estimates across
examinations. For the large sample test,
the item separation reliability was 1.00, while for the small sample test it was
.75 indicating that the test item variable is less well defined when a small
sample is used to calibrate the items.
An important factor in item assessment is the
discrimination capability of each test item. It is sometimes difficult to
interpret discrimination calculated from small samples because chance
occurrences may affect a candidate's response to an item. Examples of a chance
occurrence are an able candidate getting an easy item incorrect due to, say,
reading the item too quickly and missing a better distractor, or misreading the
item. Alternately, a less able candidate may get a difficult item correct,
perhaps because they happen to be familiar with the particular fact an item
asks about, or just make a lucky guess.
Item discrimination for the small sample ranged from -.44 to .65 with an
average of .19. Item discrimination for the large sample ranged from -.12 to
.51 with an average of .22. Thus, the
pattern of item discrimination for large and small samples is different in
range, but fairly similar on average.
The biggest difference in the performance of
items for large and small samples is the error of measurement. For the large sample the average error of
measurement for item calibrations was .06 logits with a range of .06 to .09 logits, while
for the small sample, the average error of measurement for item calibrations was .68 logits
with a range of .47 to 1.84 logits. Items
with calibrated difficulties in the center of the scale (p-values 50%-60%) have
lower measurement errors for both large and small samples, than items with
extreme high or low difficulty calibrations (p-values of 5% or 100%). The small sample analysis had more items that all
candidates answered correctly and therefore higher measurement errors.
In summary, item difficulty and discrimination
are not as accurately measured using a small sample; however, the items are
calibrated with sufficient accuracy to produce an acceptable level of candidate separation (.89 for the large
sample, and .86 for the small sample) suggesting acceptably accurate measurement
of candidate performance regardless of sample size.