The standard error of measurement (S.E.) is widely used for stopping a computer-adaptive test. For instance, if the current measure estimate is more than 1.96 S.E.s from the pass-fail measure, then there is 95% confidence in the pass-fail decision. Or 2.58 S.E.s for 99% confidence. But how many items are needed to reach a desired S.E.?
If a person has probability, P, of succeeding on a dichotomous item (such as a multiple-choice question), then the statistical information in the response is P*(1-P). The standard error of the estimated measure is
S.E. = 1/sqrt(information) = 1/ sqrt(sum(P*(1-P)))
The largest information, and so the smallest standard error, occurs when P=0.5, i.e., when the CAT items are targeted exactly on the persons. But this can produce an unsatisfactory testing experience for the examinee so higher probabilities of success are targeted, such as P=.7 (for 70% success: items are selected so that the person achieves about 70% success on the administered items) and P=.8 (for 80% success). Here is a Table showing the targeting, standard error, and minimum number of items administered for a specific S.E.:
| Minimum number of CAT Items Administered | ||||||
|---|---|---|---|---|---|---|
| Targeting Probability of Success | S.E. (Logits) | |||||
| 0.5 | 0.4 | 0.3 | 0.2 | 0.15 | 0.1 | |
| P=0.5 | 16 | 25 | 45 | 100 | 178 | 400 |
| 0.6 | 17 | 27 | 47 | 105 | 186 | 417 |
| 0.7 | 20 | 30 | 53 | 120 | 212 | 477 |
| 0.8 | 25 | 40 | 70 | 157 | 278 | 625 |
| 0.9 | 45 | 70 | 124 | 278 | 494 | 1112 |
It is seen that the penalty for going from P=0.5 to P=0.6 targeting is the administration of about 5% more items. From P=0.5 to P=0.7 is about 20% more items. From P=0.5 to P=0.8 is 60% more items. P=0.9 almost triples the test length. An S.E. of 0.15 logits requires about 10 times as many items as an S.E. of 0.5 logits.
| Minimum Number of Items for 95% Confidence (|t|>=1.96) in Pass-Fail Decision | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Targeting Probability of Success | Logit Distance of Ability Estimate from Pass-Fail Point | |||||||||
| 1 | 0.9 | 0.8 | 0.7 | 0.6 | 0.5 | 0.4 | 0.3 | 0.2 | 0.1 | |
| P=0.5 | 16 | 19 | 25 | 32 | 43 | 62 | 97 | 171 | 385 | 1537 |
| 0.6 | 17 | 20 | 26 | 33 | 45 | 65 | 101 | 178 | 401 | 1601 |
| 0.7 | 19 | 23 | 29 | 38 | 51 | 74 | 115 | 204 | 458 | 1830 |
| 0.8 | 25 | 30 | 38 | 49 | 67 | 97 | 151 | 267 | 601 | 2401 |
| 0.9 | 43 | 53 | 67 | 88 | 119 | 171 | 267 | 475 | 1068 | 4269 |
When administering many items in a CAT test, it is also wise to consider item response times: "Utilizing Response Time Distributions for Item Selection in CAT," Zhewen Fan, Chun Wang, Hua-Hua Chang, and Jeffrey Douglas, Journal of Education and Behavioral Statistics, 2012.
John Michael Linacre
Computer-Adaptive Tests (CAT), Standard Errors and Stopping Rules, Linacre J.M. Rasch Measurement Transactions, 2006, 20:2 p. 1062
The URL of this page is www.rasch.org/rmt/rmt202f.htm
Website: www.rasch.org/rmt/contents.htm