Journal of Applied Measurement

GUIDELINES FOR MANUSCRIPTS

Reprinted from Smith, R.M.,
Linacre, J.M., and Smith, Jr., E.V. (2003). Guidelines for Manuscripts.
*Journal of Applied Measurement, 4*, 198-204.

Included in this editorial
are guidelines for manuscripts submitted to the ** Journal of Applied
Measurement** that involve applications of Rasch measurement. These
guidelines may also be of use to those attempting to publish Rasch measurement
applications in other journals that may not be familiar with these methods.

Following the guidelines, we provide a list of references that may assist individuals in gaining an overview of some of the material discussed in the guidelines. The guidelines and the list of references are by no means exhaustive. If you feel an important reference has been left out or have a recommendation for the guidelines, please e-mail us your suggestions (Richard Smith via www.jampress.org.

Finally, we consider this a work in progress and thank William Fisher and George Karabatsos for comments on an earlier version. We will attempt to incorporate ideas and references as we receive them. Please periodically visit the journal website at www.jampress.org for the most recent updates.

A. Describing the problem

1. Adequate references, at least reference to Georg Rasch (1960) when appropriate.

2. Adequate theory, at least exact algebraic representation of the Rasch model(s) used and citation for primary developer(s).

3. Adequate description of the measurement problem, including hypothesized definition of latent variable, identification of facets under investigation, description of rating scales or response formats.

4. Rationale for using Rasch measurement techniques. For example, this may include the preference for the unique properties that Rasch models embody, the goal of establishing generalized reference standard metrics, or empirical justification by performing, for example, a comparison of the generalizability of the estimated parameters obtained from competing models. Addressing the rationale for using Rasch measurement is particular important when reviewers are more familiar with the philosophy behind Item Response Theory or Classical True-Score Theory (CTT).

B. Describing the analysis

1. Name and citation or adequate description of software or estimation methodology employed.

2. Provide a rationale for the choice of fit statistics and the criteria employed to indicate adequate fit to the model requirements. This should include some acknowledgment of the Type I error rate that the critical values imply. Note. The mean square is not a symmetric statistic. A value of 0.7 is further from 1.0 than is 1.3. Using a 1.3/0.7 cutoff for mean squares uses a different Type I error rate for the upper and lower tail of the mean square distribution.

C. Reporting the analysis

1. Map of linear variable as defined by items.

2. Map of distribution of sample on linear variable.

3. Report on functioning of rating scale(s), and of any procedures taken to improve measurement (e.g., category collapsing).

Note: It is extremely difficult to make decisions about the use of response categories in the rating scale or partial credit model if there are less than 30 persons in the sample or 10 observations in each category. You might want to reserve that task until your samples are a little larger. If the sample person distribution is skewed you might actually need even larger sample sizes since one tail of the distribution will not be well populated. The same is true if the sample mean is offset from the mean of the item difficulties. This will result in there being few observations for the extreme categories for the items opposite the concentration of the persons.

4. Investigation of secondary dimensions in items, persons, etc. using, for example, fit statistics and other analysis of the residuals.

Note: All of the point-biserial correlations being greater that 0.30 in the rating scale and partial credit models does not lend a lot of support to the concept of unidimensionality. It is often the case that the median point-biserial in rating scale or partial credit data can be well above 0.70. A number of items in the 0.30 to 0.40 range in that situation would be a good sign of multidimensionality.

5. Investigation for local idiosyncrasies in items, persons, etc.

Note: Fit statistics for small sample sizes are very unstable. One or two unusual responses can produce a large fit statistic. Count up the number of item/person standardized residuals that are larger than 2.0. You might be surprised how few there are. Do you want to drop an item just because of a few unexpected responses?

6. Report Rasch separation and reliabilities, not KR-20 or Alpha.

Note: Reliability was originally conceptualized as the ratio of the true variance to the observed variance. Since there was no method in the true score model of estimating the SEM a variety of methods (e.g., KR-20, Alpha) were developed to estimate reliability without knowing the SEM. In the Rasch model it is possible to approach reliability the way it was originally intended rather than using a less than ideal solution.

7. Report on applicable validity issues

Note: This is of particular importance when attempting to convey the results of Rasch analysis to non-Rasch oriented readers. Attempts should be made to address the validity issues raised by Messick (1989, 1995), Cherryholmes (1988), and the Medical Outcomes Trust (1995). See Smith (2001) for one interpretation and Fisher (1994) for connecting qualitative mathematical criteria for meaningfulness with quantitative mathematical criteria.

8. Any special measurement concerns?

For example: Missing data:
*not administered* or what? Folded data: how resolved? Nested data: how
accommodated? Loosely connected facets: how were differences in local origins
removed? Measurement vs. description facets: how disentangled?

9. For tests of statistical significance, in addition to the test statistics, degrees of freedom, and p-values, we encourage authors to report and interpret effect sizes and/or confidence intervals.

D. Style and Terminology

1. Use *Score* for *Raw
Score* and *Measure* or *Calibration* for Rasch-constructed linear
measures.

2. We do not encourage the use
of *Item Response Theory* as a term for Rasch measurement.

3. Rescale from logits to user-oriented scaling.

4. If appropriate, attempt to convey the results in graphical format.

5. Do not use inappropriate language when discussing reliability and validity (e.g., the test is reliable and valid). It is the measures that are reliable and the inferences made from the item and person measures and fit information that are valid for specific purposes.

E. Common Oversights

1. Do not take the mean and
standard deviation of point biserial correlations. The statistics are more
non-linear than the raw scores. It is best to report the median and
inter-quartile range or to use a Fisher *z*-transformation before you
calculate a mean.

2. When comparing the results of several calibrations of the same data, do not use the item and person reliability as criteria for improvement. These indices suffer from the same floor and ceiling effects as their true score counterparts and hence may not accurately reflect increases in reliability. If an increase in reliability is one of your criteria for improvement, use the item and person separation indices to compare the results of multiple calibrations as these indices do not suffer from the same deficiencies.

*References*

Cherryholmes, C. (1988). Construct validity and the
discourses of research. *American *

*Journal of Education, 96*, 421-457.

Medical Outcomes Trust Scientific Advisory Committee.
(1995). Instrument Review

Criteria. *Medical Outcomes Trust Bulletin*,
1-4.

Messick, S. (1989). Validity. In R.L. Linn (Ed.),
*Educational Measurement* (3rd ed.,

pp.13-103). New York: Macmillan.

Messick, S. (1995). Validity of psychological
assessment: Validation of inferences

from persons' responses and performances as scientific
inquiry into score

meaning. *American Psychologist, 50,
741-749.*

*Rasch
Measurement Models*

Adams, R. J., Wilson, M. R., and Wang, W. C. (1997). The
multidimensional random

coefficients multinomial logit model. *Applied
Psychological*

*Measurement, 21*, 1-24.

Andrich, D. (1978). A rating formulation for ordered
response categories.

*Psychometrika, 43*, 561-574.

Andrich, D. (1988). *Rasch models for measurement*.
Sage university paper series on

quantitative measurement in the social sciences.
Newberry Park, CA:

Sage Publications.

Bond, T. G., and Fox, C. M. (2001). *Applying the
Rasch model: Fundamental*

*measurement in the human sciences*. London:
Erlbaum.

Fischer, G. H., and Molenaar, I. W. (1995). *Rasch
models: Foundations, recent*

*developments, and applications*. New York:
Springer-Verlag.

Linacre, J. M. (1989). *Many-facet Rasch
measurement*. Chicago: MESA Press.

Masters, G. N. (1982). A Rasch model for partial credit
scoring. *Psychometrika, 47*,

149-174.

Rasch, G. (1960). *Probabilistic models for some
intelligence and attainment tests*.

Copenhagen: Danish Institute for Educational Research
(Expanded edition,

1980. Chicago: University of Chicago Press).

Wright, B. D., and Masters, G. N. (1982). *Rating
scale analysis*. Chicago:

MESA Press.

Wright, B. D., and Mok, M. (2000). Rasch models
overview. *Journal of Applied*

*Measurement, 1*, 83-106.

Wright, B. D., and Stone, M. H. (1979). *Best test
design*. Chicago: MESA Press.

*Rationale
for Using Rasch Models*

Anderson, E. B. (1977). Sufficient statistics in latent
trait models. *Psychometrika, 42*,

69-81.

Andrich, D. (1989). Distinctions between assumptions and
requirements in

measurement in the social sciences. In J.A. Keats, R.
Taft, R.A. Heath, S.H.

Lovibond (Eds.), *Mathematical and Theoretical
Systems*, (pp. 7-16).

North Holland: Elsevier Science Publishers.

Andrich, D. (1995). Distinctive and incompatible
properties of two common classes of

IRT models for grade responses. *Applied Psychological
Measurement, 19*,

101-119.

Andrich, D. (2001, October). *Controversy and the
Rasch model: A characteristics of a *

*scientific revolution*. Paper presented at the
meeting of the International

Conference on Objective Measurement: Focus on Health
Care, Chicago, IL.

Andrich, D., (2002). Understanding resistance to the
data-model relationship in

Rasch's paradigm: A reflection for the next generation.
*Journal of*

*Applied Measurement, 3*, 325-359.

Bond, T. G., and Fox, C. M. (2001). *Applying the
Rasch model: Fundamental*

*measurement in the human sciences*. London:
Erlbaum.

Choppin, B. (1985). Lessons for Psychometrics from
Thermometry. *International*

*International Journal of Educational Research* (formerly
*Evaluation In Education*), *9*, 9-12.

Fisher, W. P., Jr. (1993). Scale-free measurement
revisited. *Rasch Measurement*

*7* Transactions, 272-273.
www.rasch.org/rmt/rmt71.htm.

Fisher, W. P., Jr. (1995). Opportunism, a first step to
inevitability? *Rasch*

*Measurement Transactions, 9*, 426.
www.rasch.org/rmt/rmt92.htm.

Fisher, W. P., Jr. (1996). The Rasch alternative.
*Rasch Measurement Transactions, 9*,

466-467.
www.rasch.org/rmt/rmt94.htm.

Linacre, J. M. (1996). The Rasch model cannot be
"disproved"! *Rasch Measurement *

*Transactions, 10*, 512-514
www.rasch.org/rmt/rmt103.htm.

Perline, R., Wright, B. D., and Wainer, H. (1979). The
Rasch model as additive

conjoint measurement. *Applied Psychological
Measurement, 3*, 237-256.

Romanoski, J., and Douglas, G. (2002). Test scores,
measurement, and the use

of analysis of variance: An historical overview.
*Journal of Applied*

*Measurement, 3*, 232-242.

Smith, R. M. (1992). *Applications of Rasch
measurement*. Chicago: MESA Press.

Wright, B. D. (1967). Sample-Free Test Calibration and
Person Measurement. In

B. S. Bloom (Chair), *Invitational Conference on
Testing Problems* (pp.

84-101). Princeton, NJ: Educational Testing Service.
Available at

Wright, B. D. (1977). Solving measurement problems with
the Rasch model. *Journal*

*of Educational Measurement, 14 (2)*, 97-116.
Available at

Wright, B. D., and Linacre, J. M. (1989). Observations
are always ordinal;

measurements, however, must be interval. *Archives of
Physical Medicine and*

*Rehabilitation, 70*, 857-860. Available at
www.rasch.org/memo44.htm.

Wright, B. D., and Masters, G. N. (1982). *Rating
scale analysis*. Chicago:

MESA Press.

Wright, B. D., and Stone, M. H. (1979). *Best test
design*. Chicago: MESA Press.

*Estimation
Methodology*

Fischer, G. H., and Molenaar, I. W. (1995). *Rasch
models: Foundations, recent *

*developments, and applications*. New York:
Springer-Verlag.

Linacre, J. M. (1989). *Many-facet Rasch
measurement*. Chicago: MESA Press.

Linacre, J. M. (1999). Estimation methods for Rasch
measures. *Journal of Outcome *

*Measurement, 3*, 382-405.

Wright, B. D., and Masters, G. N. (1982). *Rating
scale analysis*. Chicago:

MESA Press.

Wright, B. D., and Stone, M. H. (1979). *Best test
design*. Chicago: MESA Press.

*Assessing
Dimensionality and Fit*

Anderson, E. B. (1973). A goodness-of-fit test for the
Rasch model. *Psychometrika*,

*38*, 123-140.

Bond, T. G., and Fox, C. M. (2001). *Applying the
Rasch model: Fundamental*

*measurement in the human sciences*. London:
Erlbaum.

Engelhard, Jr., G. (1994). Examining rater errors in the
assessment of written

composition with a Many-Facet Rasch model. *Journal of
Educational*

*Measurement, 31*, 93-112.

Engelhard, Jr., G. (1996). Clarification to "Examining
rater errors in the assessment of

written composition with a Many-Facet Rasch model".
*Journal of Educational *

*Measurement, 33*, 115-116.

Fischer, G. H., and Molenaar, I. W. (1995). *Rasch
models: Foundations, recent *

*developments, and applications*. New York:
Springer-Verlag.

Glas, C. A. W. (1988). The derivation of some tests for
the Rasch model from the

multinomial distribution. *Psychometrika, 53*,
525-546.

Kelderman, H. (1984). Loglinear Rasch model tests.
*Psychometrika, 49*, 223-245.

Linacre, J. M. (1998a). Structure in Rasch residuals:
Why principal component

analysis? *Rasch Measurement Transactions, 12*,
636.

Linacre, J. M. (1998b). Detecting multidimensionality:
Which residual data-types

works best? *Journal of Outcome Measurement, 2*,
266-283.

Linacre, J. M. (1992). Prioritizing misfit indicators.
*Rasch Measurement *

*Transactions, 9*, 422-423.

Linacre, J. M., and Wright, B. D. (1994). Chi-square fit
statistics. *Rasch *

*Measurement Transactions, 8*, 360-361.

Smith, Jr., E. V. (2002). Detecting and evaluating the
impact of multidimensionality

using item fit statistics and principal component
analysis of residuals. *Journal*

*of Applied Measurement, 3*, 205-231.

Smith, R. M. (1991a). *IPARM: Item and Person analysis
with the Rasch model*.

Chicago: MESA Press.

Smith, R. M. (1991b). The distributional properties of
Rasch item fit statistics.

*Educational and Psychological Measurement, 51*,
541-565.

Smith, R. M. (1996). A comparison of methods for
determining dimensionality in

Rasch measurement. *Structural Equation Modeling,
3*, 25-40.

Smith, R. M. (1996). Polytomous mean square fit
statistics. *Rasch Measurement *

*Transactions, 10*, 516-517.

Smith, R. M. (2000). Fit analysis in latent trait
measurement models. *Journal of *

*Applied Measurement, 1*, 199-218.

Smith, R. M., Schumacker, R. E., and Bush, M. J. (1998).
Using item mean squares to

evaluate fit to the Rasch model. *Journal of Outcome
Measurement, 2*, 66-78.

Wright, B. D. (1991a). Diagnosing misfit. *Rasch
Measurement Transactions, 5*, 156.

Wright, B. D. (1991b). Factor item analysis versus Rasch
item analysis. *Rasch*

*Measurement Transactions, 5*, 134-135.

Wright, B. D. (1996a). Comparing Rasch measurement and
factor analysis. *Structural*

*Equation Modeling, 3*, 3-24.

Wright, B. D. (1996b). Local dependence, correlation,
and principal components.

*Rasch Measurement Transactions, 10*,
509-511.

Wright, B. D., and Linacre, J. M. (1994). Reasonable
mean-square fit values. *Rasch*

*Measurement Transactions, 8*, 370.

Wright, B. D., and Masters, G. N. (1982). *Rating
scale analysis*. Chicago:

MESA Press.

Wright, B. D., and Stone, M. H. (1979). *Best test
design*. Chicago: MESA Press.

*Rating
Scale Category Effectiveness*

Andrich, D. (1996). Category ordering and their utility.
*Rasch Measurement*

*Transactions, 9*, 465-466.

Andrich, D. (1998). Thresholds, steps, and rating scale
conceptualization. *Rasch*

*Measurement Transactions, 12*,
648-649.

Linacre, J. M. (1991). Step disordering and Thurstone
thresholds. *Rasch Measurement *

*Transactions, 5*, 171.

Linacre, J. M. (1999). Investigating rating scale
category utility. *Journal of Outcome *

*Measurement, 3*, 102-122.

Linacre, J. M. (2002). Optimizing rating scale category
effectiveness. *Journal of*

*Applied Measurement, 3*, 86-106.

Stone, M., and Wright, B. D. (1994). Maximizing rating
scale information. *Rasch*

*Measurement Transactions, 8*, 386.

Wright, B. D., and Linacre, J. M. (1992). Disordered
steps? *Rasch Measurement *

*Transactions, 6*, 225.

Wright, B. D., and Masters, G. N. (1982). *Rating
scale analysis*. Chicago:

MESA Press.

Zhu, W., Updyke, W. F., and Lewandowski, C. (1997).
Post-hoc Rasch analysis of

optimal categorization of an ordered-response scale.
*Journal of Outcome*

*Measurement, 1*, 286-304.

*Reliability
and Validity*

Fisher, Jr., W. P. (1994). The Rasch debate: Validity
and revolution in educational

measurement. In M. Wilson (Ed.), *Objective
measurement: Theory into*

*practice*, Vol. 2 (pp.36-72). Norwood: Ablex
Publishing Corporation.

Fisher, Jr., W. P. (1997). Is content validity valid?
*Rasch Measurement Transactions*,

*11*, 548.

Linacre, J. M. (1993). Rasch-based Generalizability
theory. *Rasch Measurement*

*Transactions, 7*, 283-284.

Linacre, J. M. (1995). Reliability and separation
monograms. *Rasch Measurement*

*Transactions, 9*, 421.

Linacre, J. M. (1996). True-score reliability or Rasch
statistical validity? *Rasch*

*Measurement Transactions, 9*, 455-456.

Linacre, J. M. (1999). Relating Cronbach and Rasch
reliabilities. *Rasch Measurement *

*Transactions, 13*, 696.

Smith, Jr., E. V. (2001). Reliability of measures and
validity of measure interpretation:

A Rasch measurement perspective. *Journal of Applied
Measurement*,

*2*, 281-311.

Wright, B. D. (1995). Which standard error? *Rasch
Measurement Transactions, 9*,

436-437.

Wright, B. D. (1996). Reliability and separation.
*Rasch Measurement Transactions, 9*,

472.

Wright, B. D. (1998). Interpreting reliabilities.
*Rasch Measurement Transactions, 11*,

602.

Wright, B. D., and Masters, G. N. (1982). *Rating
scale analysis*. Chicago:

MESA Press.

Wright, B. D., and Stone, M. H. (1979). *Best test
design*. Chicago: MESA Press.

*Metric
Development and Score Reporting*

Linacre, J. M. (1997). Instantaneous measurement and
diagnosis. In R.M. Smith (Ed.),

*Physical Medicine and Rehabilitation State of the Art
Reviews*, Vol. 11:

Outcome Measurement (pp.315-324). Philadelphia: Hanley
& Belfus, Inc.

Ludlow, L. H., and Haley, S. M. (1995). Rasch model
logits: Interpretation, use, and

transformations. *Educational and Psychological
Measurement, 55*, 967-975.

Smith, Jr., E. V. (2000). Metric development and score
reporting in Rasch

measurement. *Journal of Applied Measurement, 1*,
303-326.

Smith, R. M. (1992). *Applications of Rasch
Measurement*. Chicago: MESA Press.

Smith, R. M. (1991). *IPARM: Item and Person analysis
with the Rasch model*.

Chicago: MESA Press.

Smith, R. M. (1994). Person response maps for rating
scales. *Rasch Measurement *

*Transactions, 8*, 372-373.

Stanek, J., and Lopez, W. (1996). Explaining variables.
*Rasch Measurement*

*Transactions, 10*, 518-519.

Woodcock, R. W. (1999). What can Rasch-based score
convey about a person's test

performance? In S. E. Embretson, and S. L. Hershberger,
(Eds.), *The new rules*

*of measurement: What every psychologist and educator
should know*.

Mahwah, NJ: Erlbaum.

Wright, B. D., Mead, R. J., and Ludlow, L. H. (1980).
*Kidmap: Research*

*memorandum number 29*. Chicago: MESA
Press.

Wright, B. D., and Stone, M. H. (1979). *Best test
design*. Chicago: MESA Press.

Zhu, W. (1995). Communicating measurement. *Rasch
Measurement Transactions, 9*,

437-438.