A recent paper in Medical Care¹ has raised considerable interest due to its reporting of disordered thresholds in data collected routinely in different countries from patients who have experienced a stroke.
In our work, an adjacent-category-equal-probability Rasch-Andrich threshold defines the boundary between categories, in our case of the polytomous Functional Independence Measure (FIM®)². Where thresholds are ordered, a person location between category boundaries ensures that the probability of a response in that category is larger than of any other single category³. However, if thresholds are disordered, a person location between category boundaries will not give that category the greatest probability of being observed. In our work, for example, being observed in a higher category is taken to imply higher independence. But with disordered thresholds, as in the Figure above, an observation of "2" is more likely than an observation of "1" even for a person with a measure of -0.4 (vertical arrow) which is low even for "1". Thus, from this perspective, disordering of thresholds is a violation of the measurement construct in that there is discordance between the category probabilities and the underlying trait.
Table 1. Threshold estimates for FIM motor items.
|Dressing Upper Body||0.03||-1.09||-0.6||-0.28||0.04||0.54||1.39|
|Dressing Lower Body||0.47||-1.14||-0.46||-0.20||-0.02||0.41||1.41|
|*Walk / Wheelchair||0.24||0.16||-0.15||-0.72||-0.96||-0.27||1.95|
What does disordering look like in practice, and when does it occur? Table 1 gives the estimates for the thresholds taken from the data of 895 stroke patients which formed the basis of the Medical Care paper. The analysis used the unrestricted (partial credit) model. A likelihood ratio test (p<.001) showed that the rating scale model was less suitable. The asterisked items have disordered thresholds, with the "stairs" item displaying a particularly bizarre pattern. At this stage most items misfit the model with an overall standardized mean-square item fit with mean of -0.360 and SD of 4.462, where a mean of 0 and SD of 1 is expected.
Figure 1. Category probability curves for Bathing item.
The items fall into three types with respect to their thresholds; those that are ordered; those that have one or two thresholds disordered and those where many of the thresholds are disordered. Figure 1 shows how categories should work and, in a monotonically increasing fashion, as the trait for independence increases, so does the probability of affirming a higher category. This expected relationship breaks down slightly for the eating item (Figure 2), where at no time would categories one and two be the most probable response. For the bladder management item, this relationship is largely absent, and the item would appear to be working as a dichotomy (Figure 3).
Figure 2. Category probability curves for eating item
Figure 3. Category probability curves for the Bladder Management item.
How can this deviation from the expected pattern of response come about? An obvious place to start is the distribution of responses across the categories. In the Medical Care paper the analysis was based upon admission data. Might it be that many of the categories implying more independence had null or low frequencies? Table 2 shows that this was not the case, where disordered items are flagged.
Table 2. Category frequencies of FIM motor items.
|Item / Category||1||2||3||4||5||6||7|
|Dressing Upper Body||159||133||86||115||104||107||118|
|Dressing Lower Body||233||170||78||102||64||86||89|
|*Walk / Wheelchair||292||69||41||50||83||181||85|
Although there is a clear variation in the distribution of responses across items, all categories had sufficient numbers for estimation4. Note that the "grooming" item which is disordered, has a similar distribution (but in the opposite way) to the "bathing" item, which is ordered. Furthermore, the conditional pairwise estimation procedure employed in RUMM2020 estimates threshold parameters from all the data, not just from adjacent categories, enhancing the stability of estimates³.
Another reason for the disordering may be that different rehabilitation facilities around Europe assign values to the FIM in different ways. Certainly there are different traditions across Europe in the way in which, for example, patients are bathed within rehabilitation facilities5. Also the extent of training varies. Two regions, Sweden and Italy have extensive training programs, yet the data from these countries was just as disordered as elsewhere. Furthermore, ordered thresholds were not necessarily associated with the absence of Differential Item Functioning (DIF) across countries. Figure 4 shows the ICC by country for the "bathing item" which was ordered. However, there was significant DIF for this item (F=10.22; p<0.001), suggesting that the expected category at any given level could vary by country across the trait.
Figure 4. Plot of ICC by country for "bathing" item.
The rating scale model has been used previously for analysis of the FIM6. Has the use of the unrestricted (partial credit model) contributed to this dilemma? Although the Loge Likelihood test shows a significant worse fit for the rating scale model, if used, the extent of disordered thresholds is greater still. Indeed, every item is disordered under the rating scale model. Thus it would seem, in this data set at least, that this is not a reason as to why disordered thresholds are more common than in previous reports.
Prior to seeking a solution to these problems, how does the total raw score reflect the change in category response across the items? At first sight, in Table 3, it would appear that there is an appropriate increase in raw score as each category increases, perhaps with just the exception of the walk/wheelchair item (this is taken from the SPSS file and includes extremes). Thus higher performing patients are rated in higher categories. However, exploratory post-hoc tests suggest that raw scores cannot discriminate across some categories in six of the eight disordered items, but can do so in all the ordered items.
Table 3. FIM average motor raw score for each category of each item.
|Item / Category||1||2||3||4||5||6||7|
|Dressing Upper Body||22.1||32.9||44.1||52.9||59.4||71.8||82.2|
|Dressing Lower Body||25.4||38.7||48.5||62.6||67.6||76.3||85.9|
|*Walk / Wheelchair||28.7||28.6||37.3||42.3||54.2||58.9||84.0|
What can be done about the apparent disordering of thresholds? In the Medical Care paper we rescored items on an individual basis to try and improve fit to the model. As thresholds are estimated with respect to all categories, not just adjacent categories, the final solution was not at all obvious from the category probability curves such as those presented above. For example, the "bladder" item worked with three categories (Figure 5), while the eating item had to be dichotomized.
Figure 5. "Bladder" item after rescoring.
In the paper it was shown that the "eating", "bowel management" and "toileting" items had to be dichotomized; "bladder management" and "grooming" tritomized; "walk/wheelchair", "transfer tub" and stairs were collapsed into four categories, with the remainder working as seven category items. The paper went on to split items for DIF by country, and came up with a working solution, effectively using the FIM motor items at the county level as an item bank, linked by five common items. The final category frequencies for the rescored items are given in Table 4 (This excludes the solution after splitting for country DIF, which makes matters much more complicated; and a couple of additional patients became extreme).
Table 4. Category frequencies after rescoring.
|Dressing Upper Body||159||133||86||115||104||107||116|
|Dressing Lower Body||233||170||78||102||64||86||87|
|*Walk / Wheelchair||292||243||181||83|
The rescoring solution we found is "messy" in that some items retain all their categories, and then there is a variable reduction in the number of categories for other items. Also, as we have seen, although there is an increase in raw score across all categories for most items, there is a suggestion from post hoc tests that the raw score cannot discriminate across some categories, and these occur where thresholds are disordered. Technically, given fit to the Rasch model, items should not be collapsed further, but prior to splitting for DIF, the case can be made that these data still do not fit the model. Furthermore, there is the issue of differential fit between countries. Single country analysis had shown different fit and different rescoring solutions. For example the UK items "transfer bed" and "transfer toilet", which were ordered in the pooled data, were collapsed into four categories for the UK analysis (Figure 6).
Figure 6. Category structure for UK FIM motor scale after rescoring.
What about the use of different software? The results produced by a parallel run with Winsteps are substantively equivalent to those shown here, being limited to minor numerical differences.
It is our contention that scales should work adequately at admission to rehabilitation services, else they should not be used for assessment purposes, or as the basis for outcome measurement. The requirement is for invariance across time. Furthermore, the scale must be invariant across any relevant clinical subtypes if data are to be pooled for the diagnostic group; across diagnostic groups if they are to be pooled at the level of the rehabilitation unit, and across countries if international comparisons are to be made.
The fact that scales work in different ways across different diagnoses and countries should not be surprising given the recent insights provided by modern psychometric methods. The Medical Care paper demonstrated that despite cultural variations, a solution could be found that facilitated the pooling of data. Should we then be so worried about the lack of invariance for some items given we now have the technology to accommodate such variations?
The issue of the disordered thresholds may warrant further effort on the part of FIM users. This involves two aspects; the fundamental aspect of whether or not disordered thresholds are to be taken seriously as a violation of measurement; and the practical aspect of achieving a solution which is common across countries (and the same applies to diagnoses, or within country centers). This will require some clear thinking as to what such disordering means for clinical practice, for outcome measurement, and the pooling of data of the kind undertaken at Buffalo. Given the extent of the FIM database held in Buffalo, at least this is one outcome scale where the users have the capacity to investigate these matters thoroughly, from a well established database. Alan Tennant BA, PhD. Professor of Rehabilitation Studies, The University of Leeds, UK.
References: 1. Tennant A, Penta M, Tesio L, Grimby G, Thonnard J-L, Slade A, Lawton G, Simone A, Carter J, Lundgren-Nilsson A, Tripolski M, Ring H, Biering-Sorensen F, Marincek C, Burger H, Phillips S. Assessing and adjusting for cross cultural validity of impairment and activity limitation scales through Differential Item Functioning within the framework of the Rasch model : the Pro-ESOR project. Medical Care 2004; 42: (Supple 1) 37-48 2. Keith RA, Granger CV, Hamilton BB, Sherwin FS. The functional independence measure: A new tool for rehabilitation. In: Eisenberg MG, Grzesiak RC (Eds): Advances in Clinical Rehabilitation. New York, Springer Publishing Co; Vol. 1. p. 6-18; 1987. 3. Andrich D, Luo G. Conditional pairwise estimation in the Rasch model for ordered responses using principal components. J Applied Measurement 2003; 4:205-221. 4. Linacre JM. Investigating rating scale category utility. J Outcome Measurement 1999; 3:103-122. 5. Kucukdeveci AA, Yavuzer G, Ehan AH, Sonel B, Tennant A. Adaptation of the Functional Independence Measure for use in Turkey. Clinical Rehabil 2001; 15:311-319. 6. Grimby G , Andraon E , Holmgren E , Wright B , Linacre JM , Sundh V. Structure of a combination of Functional Independence Measure and Instrumental Activity Measure items in community-living persons: a study of individuals with cerebral palsy and spina bifida. Arch Phys Med Rehabil, 1996; 77(11): 1109-14.
Disordered Thresholds: An example from the Functional Independence Measure, Tennant A. Rasch Measurement Transactions, 2004, 17:4 p.945-948
|Rasch Measurement Transactions (free, online)||Rasch Measurement research papers (free, online)||Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch||Applying the Rasch Model 3rd. Ed., Bond & Fox||Best Test Design, Wright & Stone|
|Rating Scale Analysis, Wright & Masters||Introduction to Rasch Measurement, E. Smith & R. Smith||Introduction to Many-Facet Rasch Measurement, Thomas Eckes||Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, George Engelhard, Jr.||Statistical Analyses for Language Testers, Rita Green|
|Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar||Journal of Applied Measurement||Rasch models for measurement, David Andrich||Constructing Measures, Mark Wilson||Rasch Analysis in the Human Sciences, Boone, Stave, Yale|
|in Spanish:||Análisis de Rasch para todos, Agustín Tristán||Mediciones, Posicionamientos y Diagnósticos Competitivos, Juan Ramón Oreja Rodríguez|
|Forum||Rasch Measurement Forum to discuss any Rasch-related topic|
Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement
Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.
|Coming Rasch-related Events|
|June 29 - July 27, 2018, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com|
|July 25 - July 27, 2018, Wed.-Fri.||Pacific-Rim Objective Measurement Symposium (PROMS), (Preconference workshops July 23-24, 2018) Fudan University, Shanghai, China "Applying Rasch Measurement in Language Assessment and across the Human Sciences", www.promsociety.org|
|July 29 - August 4, 2018||Vth International Summer School `Applied Psychometrics in Psychology and Education`, Institute of Education at the Higher School of Economics, St. Petersburg, Russia, https://ioe.hse.ru/en/announcements/215681182.html|
|July 30 - Nov., 2018||Online Introduction to Classical and Rasch Measurement Theories (D.Andrich), University of Western Australia, Perth, Australia, http://www.education.uwa.edu.au/ppl/courses|
|Aug. 10 - Sept. 7, 2018, Fri.-Fri.||On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com|
|August 25 - 28, 2018, Sat.-Tue.||Análisis de Rasch introductorio (en español). (Agustín Tristán), Instituto de Evaluación e Ingeniería Avanzada. San Luis Potosí, México. www.ieia.com.mx|
|Sept. 3 - 6, 2018, Mon.-Thurs.||IMEKO World Congress, Belfast, Northern Ireland, www.imeko2018.org|
|Oct. 12 - Nov. 9, 2018, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com|
The URL of this page is www.rasch.org/rmt/rmt174a.htm