The field of survey-based measurement is maturing. Calibrated survey tools are proliferating rapidly in a number of fields. Much of this work stops with the calibration of an instrument that was not designed with the intention to produce invariant measures based in sufficient statistics. The description of data-based calibrations and measures is often the only goal of publications reporting this work. That is, little or no attention is typically accorded a theory of the construct measured. This is so, even though it has long since been recognized that "there is nothing so practical as a good theory" (Lewin, 1951, p. 169), and even though the practicality of theory has been made glaringly evident in the success of the Lexile Framework for Reading (Stenner, Burdick, Sanford, & Burdick, 2006).
Accordingly, even when instruments are precisely calibrated, survey content still tends to dominate the reporting of the results and the implicit definition of the construct. But is it not likely, in this scenario, that supposedly different constructs measured in supposedly different units might actually be one and the same? Is not the real proof of understanding a capacity to construct parallel instruments from theory in a way that results in equivalent measures across samples? Would not a focus on experimental tests like this work to spur consensus on what is measurable and on what works to change measures in desired directions?
To advance survey-based science in this direction, item writers and survey data analysts should follow nineteen basic rules of thumb to create surveys that
• are likely to provide data of a quality high enough to meet the requirements for measurement specified in a probabilistic conjoint measurement (PCM) model (Suppes, Krantz, Luce & Tversky, 1989),
• implement the results of the PCM tests of the quantitative hypothesis in survey and report layouts, making it possible to read interpretable quantities off the instrument at the point of use with no need for further computer analysis (Masters, Adams & Lokan, 1994); and
• are joined with other surveys measuring the same variable in a metrology network that ensures continued equating (Masters, 1985) with a single, reference standard metric (Fisher, 1997).
An initial experiment in this direction has been sponsored under the auspices of the National Center for Special Education Accountability Monitoring (NCSEAM). With the November, 2004, reauthorization of the Individuals with Disabilities Education Act (IDEA), states are required to report parents and families perceptions of the quality of the services received by them and their children. NCSEAM designed and piloted surveys in a research study intended to provide states with scientifically defensible measures. The research began with intensive qualitative research into the constructs to be measured, involving literature reviews, focus sessions with stakeholders in several states, and Rasch analyses of survey data from several other states. This work paved the way for the intentional conceptualization of theories pertaining to several distinct constructs. A pilot study employing these surveys was itself designed so as to demonstrate as conclusively as possible the invariant comparability of the measures across independent samples of item. The success of this research has culminated in the practical application of the NCSEAM surveys to the new reporting requirement by a number of states, and with the emergence of a small community of special education and early intervention researchers, administrators, parents, and advocates who are learning how to use these tools to assess, compare, and improve the quality of programs.
For those wishing to emulate this program, the following recommendations are offered:
1. Make sure all items are expressed in simple, straightforward language. Use common words with meanings that are as unambiguous as possible.
2. Restrict each item to one idea. This means avoiding conjunctions (and, but, or), synonyms, and dependent clauses. A conjunction indicates the presence of at least two ideas in the item. Having two or more ideas in an item is unacceptable because there is no way to tell from the data which single idea or combination of ideas the respondent was dealing with. If two synonymous words really mean the same thing, only one of them is needed. If the separate words are both valuable enough to include, they need to be expressed in separate items. Dependent (if, then) clauses require the respondent to think conditionally or contingently, adding an additional and usually unrecoverable layer of interpretation behind the responses that may muddy the data.
3. Avoid "Not Applicable" or "No Opinion" response categories. It is far better to instruct respondents to skip irrelevant items than it is to offer them the opportunity in every item to seem to provide data, but without having to make a decision.
4. Avoid odd numbers of response options. Middle categories can attract disproportionate numbers of responses. Like "Not Applicable" options, middle categories allow respondents to appear to be providing data, but without making a decision. If someone really cannot decide which side of an issue they come down on, it is better to let them decide on their own to skip the question. If the data then show that two adjacent categories turn out to be incapable of sustaining a quantitative distinction, that evidence will be in hand and can inform future designs.
5. Have enough response categories. Not too few and not too many. Do not assume that respondents can make only one or two distinctions in their responses, and do not simply default to the usual four response options (Strongly Agree, Agree, Disagree, Strongly Disagree, or Never, Sometimes, Often, and Always, for instance). The LSU HSI PFS, for example, employs a six-point rating scale and is intended for use in the Louisiana statewide public hospital system, which provides most of the indigent care in the state. About 75% of the respondents in the study reported have less than a high school education, but they provided consistent responses to the questions posed. Part of the research question raised in any measurement effort concerns determining the number of distinctions that the variable is actually capable of supporting, as well as determining the number of distinctions required for the needed comparisons. Starting with six (adding in Very Strongly Agree/Disagree categories to the ends of the continuum) or even eight (adding Absolutely Agree/Disagree extremes) response options gives added flexibility in survey design. If one or more categories blends with another and isn't much used, the categories can be combined. Research that starts with fewer categories, though, cannot work the other direction and create new distinctions. More categories have the added benefit of boosting measurement reliability, since, given the same number of items, an increase in the number of functioning (used) categories increases the number of distinctions made among those measured.
6. Write questions that will provoke respondents to use all of the available rating options. This will maximize variation, important for obtaining high reliability. This is a start at conceptualizing a theory. What kinds of questions will be most likely to consistently provoke agreeable responses, no matter how agreeable a respondent is? Conversely, what kinds of questions will be most likely to consistently provoke disagreeable responses, no matter how agreeable a respondent is? What is it that makes the variable evolve in this manner, along the hierarchy defined by the agreeability continuum of the questions? Articulating these questions in advance and writing survey items that put an explicit theory into play propels measurement into higher likelihoods of obtaining the desired invariance.
7. Write enough questions and have enough response categories to obtain an average error of measurement low enough to provide the needed measurement separation reliability, given sufficient variation. Reliability is a strict mathematical function of error and variation and ought to be more deliberately determined via survey design than it currently. For instance, if the survey is to be used to detect a very small treatment effect, measurement error will need to be very low relative to the variation, and discrimination will need to be focused at the point where the group differences are effected, if statistically significant and substantively meaningful results are to be obtained. On the other hand, a reliability of .70 will suffice to simply distinguish high from low measures. Given that there is as much error as variation when reliability is below .70, and it is thus not possible to distinguish two groups of measures in data this unreliable, there would seem to be no need for instruments in that range.
8. Before administering the survey, divide the items into three or four groups according to their expected ratings. If any one group has significantly fewer items than the others, write more questions for it. If none of the questions are expected to garner very low or very high ratings, reconsider the importance of step 6 above.
9. Order the items according to their expected ratings and consider what it is about some questions that make them easy (or agreeable or important, etc.), and what it is about other questions that make them difficult (or disagreeable, unimportant, etc.). This exercise in theory development is important because it promotes understanding of the variable. After the first analysis of the data, compare the empirical item order with the theoretical item order. Do the respondents actually order the items in the expected way? If not, why not? If so, are there some individuals or groups who did not? Why?
10. Consider the intended population of respondents and speculate on the average score that might be expected from the survey. If the expected average score is near the minimum or the maximum possible, the instrument is off target. Targeting and reliability can be improved by adding items that provoke responses at the unused end of the rating scale. Measurement error is lowest in the middle of the measurement continuum, and increases as measures approach the extremes. Given a particular amount of variation in the measures, more error reduces reliability and less error increases it. Well-targeted instruments enhance measurement efficiency by providing lower error, increased reliability, and more statistically significant distinctions among the measures for the same number of questions asked and rating options offered.
11. If it is possible to write enough questions to calibrate a bank of more items than any one respondent need ever see, design the initial calibration study to have two forms that each have enough items to produce the desired measurement reliability. Use the theory to divide the items into three equal groups, with equal numbers of items in each group drawn from each theoretical calibration range. Make sure that each form is administered to samples of respondents from the same population who vary with respect to the construct measured, and who number at least 200. Convincing demonstrations of metric invariance and theoretical control of the construct become possible when the separate-sample calibrations of the items common to the two forms plot linearly and correlate highly, and when the common- and separate-form items each produce measures of their respective samples that also plot linearly and correlate highly.
12. Be sure to obtain enough demographic information from respondents to be able to test hypotheses concerning the sources of failures of invariance. It can be frustrating to see significant differences in calibration values and be unable to know if they are due to sex, age, ethnic, educational, income or other identifiable differences.
13. As soon as data from 30-50 respondents are obtained, and before more forms are printed and distributed, analyze the data and examine the rating scale structure and the model fit using a measurement analysis that evaluates each item's rating scale independently. Make sure the analysis was done correctly by checking responses in the Guttman scalogram against a couple of respondents' surveys, and by examining the item and person orders for the expected variable. Identify items with poorly populated response options and consider combining categories or changing the category labels. Study the calibration order of the category transitions and make sure that a higher category always represents more of the variable; consider combining categories or changing the category labels for items with jumbled or reversed structures. Test out recodes in another analysis; check their functioning, and then examine the item order and fit statistics, starting with the fit means and standard deviations. If some items appear to be addressing a different construct, ask if this separate variable is relevant to the measurement goals. If not, discard or modify the items. If so, use these items as a start at constructing another instrument.
14. When the full calibration sample is obtained, maximize measurement reliability and data consistency. First identify items with poor model fit. If an item is wildly inconsistent, with a mean square or standardized fit statistic markedly different from all others, examine the item itself for reasons why its responses should be so variable. Does it perhaps pertain to a different variable? Does the item ask two or more very different questions at once? It may also be relevant to find out which respondents are producing the inconsistencies, as their identities may suggest reasons for their answers. If the item itself seems to be the source of the problem, it may be set aside for inclusion in another scale, or for revision and later re-incorporation. If the item is functioning in different ways for different groups of respondents, then the data for the two groups on this item ought to be separated into different columns in the analysis, making the single item into two. Finally, if the item is malfunctioning for no apparent reason and for only a very few otherwise credible respondents, it may be necessary to omit temporarily only specific, especially inconsistent responses from the calibration. Then, after the highest reliability and maximum data consistency are achieved, another analysis should be done, one in which the inconsistent responses are replaced in the data. The two sets of measures should then be compared in plots to determine how much the inconsistencies affect the results.
15. The instrument calibration should be compared with calibrations of other similar instruments used to measure other samples from the same population. Do similar items calibrate at similar positions on the measurement continuum? If not, why not? If so, how well do the pseudo-common items correlate and how near the identity line do they fall in a plot? If the rating scale category structures are different, are the transition calibrations meaningfully spaced relative to each other?
16. Calibration results should be fed back onto the instrument itself. When the variable is found to be quantitative and item positions on the metric are stable, that information should be used to reformat the survey into a self-scoring report. This kind of worksheet builds the results of the instrument calibration experiment into the way information is organized on a piece of paper, providing quantitative results (measure, error, percentile, qualitative consistency evaluation, interpretive guidelines) at the point of use. No survey should be considered a finished product until this step is taken.
17. Data should be routinely sampled and recalibrated to check for changes in the respondent population that may be associated with changes in item difficulty.
18. For maximum utility, the instrument should be equated with other instruments intended to measure the same variable, creating a reference standard metric.
19. Everyone interested in measuring the variable should set up a metrology system, a way of maintaining the reference standard metric via comparisons of results across users and brands of instruments. To ensure repeatability, metrology studies typically compare measures made from a single homogeneous sample circulated to all users. This is an unrealistic strategy for most survey research, so a workable alternative would be to occasionally employ two or more previously equated instruments in measuring a common sample. Comparisons of these results should help determine whether there are any needs for further user education, instrument modification, or changes to the sampling design.
William P. Fisher, Jr.
Fisher, W. P., Jr. (1997). What scale-free measurement means to health outcomes research. Physical Medicine & Rehabilitation State of the Art Reviews, 11(2), 357-373.
Lewin, K. (1951). Field theory in social science: Selected theoretical papers (D. Cartwright, Ed.). New York: Harper & Row.
Masters, G. N. (1985, March). Common-person equating with the Rasch model. Applied Psychological Measurement, 9(1), 73-82.
Masters, G. N., Adams, R. J., & Lokan, J. (1994). Mapping student achievement. International Journal of Educational Research, 21(6), 595-610.
Stenner, A. J., Burdick, H., Sanford, E. E., & Burdick, D. S. (2006). How accurate are Lexile text measures? Journal of Applied Measurement, 7(3), 307-22.
Suppes, P., Krantz, D., Luce, R., & Tversky, A. (1989). Foundations of Measurement, Volume II: Geometric and Probabilistic Representations. NY: Academic Press.
Survey Design Recommendations, Fisher, W.P. Rasch Measurement Transactions, 2006, 20:3 p. 1072-4
|Rasch Measurement Transactions (free, online)||Rasch Measurement research papers (free, online)||Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch||Applying the Rasch Model 3rd. Ed., Bond & Fox||Best Test Design, Wright & Stone|
|Rating Scale Analysis, Wright & Masters||Introduction to Rasch Measurement, E. Smith & R. Smith||Introduction to Many-Facet Rasch Measurement, Thomas Eckes||Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr.||Statistical Analyses for Language Testers, Rita Green|
|Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar||Journal of Applied Measurement||Rasch models for measurement, David Andrich||Constructing Measures, Mark Wilson||Rasch Analysis in the Human Sciences, Boone, Stave, Yale|
|in Spanish:||Análisis de Rasch para todos, Agustín Tristán||Mediciones, Posicionamientos y Diagnósticos Competitivos, Juan Ramón Oreja Rodríguez|
|Forum||Rasch Measurement Forum to discuss any Rasch-related topic|
Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement
Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.
|Coming Rasch-related Events|
|June 26 - July 24, 2020, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com|
|June 29 - July 1, 2020, Mon.-Wed.||Measurement at the Crossroads 2020, Milan, Italy , https://convegni.unicatt.it/mac-home|
|July - November, 2020||On-line course: An Introduction to Rasch Measurement Theory and RUMM2030Plus (Andrich & Marais), http://www.education.uwa.edu.au/ppl/courses|
|July 1 - July 3, 2020, Wed.-Fri.||International Measurement Confederation (IMEKO) Joint Symposium, Warsaw, Poland, http://www.imeko-warsaw-2020.org/|
|Aug. 7 - Sept. 4, 2020, Fri.-Fri.||On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com|
|Oct. 9 - Nov. 6, 2020, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com|
|June 25 - July 23, 2021, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com|
The URL of this page is www.rasch.org/rmt/rmt203f.htm