Item writers often find it difficult to write multiple
choice items that comply with good-item writing guidelines. This study shows
that it is worth the extra effort spent writing good items.
Manager, Test Development and Analysis
Consequences of Flawed Items
Many guidelines for writing good multiple choice items are
intended to reduce the measurement error that results when candidates who potentially
know the information being tested get an item wrong due to the construction of
the item. Two examples of item flaws that may introduce such measurement error
are multiple true/false items, and items with negative stems.
Multiple true/false items violate the principle that items
should be focused on a single idea or issue. Multiple true-false items usually consist
of a minimal stem and distractors that are conceptually unrelated. Candidates
are required to assess each distractor independently and determine whether each
response is true or false. For example:
The common cold:
A. is transmitted through saliva only.
B. is evident in a chest X-ray
C. will most often clear up after two days.
D. is treatable with Tamiflu.
Items with negative stems require candidates to select from
the distractors the one that does NOT answer the conditions described in the
stem. Candidates may get these items incorrect because they skim over and miss
the negative word in the stem, and mistakenly choose a response that meets the
conditions in the stem. In addition, these items do not assess what the
candidate actually knows, but rather if they can identify an incorrect response
to the issue presented in the stem. For example, a candidate can answer the
question below without knowing the color of a pomegranate.
of the following is NOT red?
This study looked at the consequences of using
items with these flaws in terms of 1) item difficulty and 2) candidate outcomes.
This study is patterned after a study of items administered to medical school
students by Downing (2005). The analysis was conducted on a group of 138 items, of which 69
were flawed items and 69 were unflawed items. The item flaws were multiple
true/false and negative items.
Item p-value is the percentage of candidates who answered
the item correctly. The table below shows that the average p-value for the
flawed items was lower than for the unflawed items and the total items,
indicating these items are more difficult for candidates to answer correctly.
purposes of this study the passing standard was set arbitrarily at a score of
65% correct. Candidates outcomes were
then determined based on the total items, flawed items only and unflawed items
only. Only 37% of the candidates pass when the flawed items are used, compared
to 71% of the candidates passing when the unflawed items are used, and 52% passing
based on total items.
While this study is simulated from real data, it confirms the
impact of flawed items found by Downing. It also provides concrete evidence
that supports eliminating multiple true/false and items with negative stems
Downing, S. M.
(2005). The effects of violating standard item writing principles on tests and
students: The consequences of using
flawed test items on achievement examinations in medical education. Advances in Health Sciences Education, 10, 133-143.