Old Rasch Forum - Rasch on the Run: 2013 January-June

Rasch Forum: 2006

Rasch Forum: 2007
Rasch Forum: 2008
Rasch Forum: 2009
Rasch Forum: 2010
Rasch Forum: 2011
Rasch Forum: 2012
Rasch Forum: 2013 July-December
Rasch Forum: 2014
Current Rasch Forum

82. Reading sub-dimension in a Listening test

GiantCorn March 27th, 2013, 8:08am: Hi Mike,

Hoping you can advise me (yet again) on the following: -

I'm interested in measuring the size of a potential "Reading" sub-dimension in English listening tests when the instructions/items are in the L2 (English) as opposed to the L1 (mother-tongue). My hypothesis is that L2 format listening tests are a less valid measure of "listening ability/comprehension" because students are having to read in English (not necessarily a bad thing if we are prepared to call the test something other than a purely listening test!) However, as a mental excercise: -

I'd like to know how large the sub-dimension is so as to gauge whether it would be better to make test items in the L1 (mother tongue) if scores are to be used for measurement/gains purposes (of a specifically - Listening Construct).

I'd like to know whether certain item types are more likely to display this "L2 reading" sub-dimension, so as to comment on how this could influence test design.

I'd like to know how the difficulty of the item is affected by any possible reading subdimension. Does a "hard item" suddenly become much easier when the item and distractors are in the L1 and if yes by how much?

I'd like to know whether based on a comparison of two test forms (one in l1, one in l2) shows that students might need more reading time when items are in L2.

So my initial thought is to have a single bank of 30 or 40 items (some old TOEFL items say). get the instructions and items translated into L1, get that checked, re-translated etc for errors.

create 2 "treatment groups" (probably 100 plus for each group) and allocate an even spread of English ability students to each group (based on entry test scores). Administer grp 1 with L1 test. Admin grp 2 with L2 version.

So with this background in mind what would be the best analysis approach? Create 2 seperate control files and run R-PCA on each and compare contrast values and dimensionality plots of the 2 grps results? DIF?

Any thoughts would be most welcome. Thank Mike!


Mike.Linacre: Thank you for your question, GC.

Your approach is good.

A. Randomly split your students into two groups: 1 and 2.
B. Administer group 1 with L1 test. Administer group 2 with L2 version.
C. Rasch analyze both groups together with a code in each student's data record, "1" or "2"
D. Compute the mean ability of group 1 = M1, and mean ability of group 2 = M2
E. Effect = M2 - M1.

GiantCorn: Brilliant! glad to know my thinking is sound.

Just a quick check regarding D:

"Mean ability" would be based on raschadjusted person measures correct? presumably I would need to also re-scale from logits too.

Mike.Linacre: GC, please do the calculation in logits. There is no need to rescale.

harmony: This is probably obvious, but it would be a good idea to give the two groups some measure of listening ability common to both so that you know that they are not at different levels of ability, which would result in a difference between the native language instructions test and L2 instructions test.

GiantCorn: Hi Harmony,

Thank you for your input, Actually yes i had already considered this. The placement test we use has a listening section. In my OP I mentioned this will be used to control across groups for this variance.

But thank you for calling this out as it mimght not have been clear from my choice of words (placement test)! If you have any further input please feel free to advise


Mike.Linacre: Good, GC. It looks like you will succeed :-)

GiantCorn: Mike,

Im at the analysis stage of this study finally. Would you mind checking that what I propose below is logical?

We created 40 MCQ test items and disctractors in english, and translated to Japanese. I followed your procedures above for building the data set. The items were 'bespoke' and not from any established popular test. I therefore must prove that the items tap into the construct of listening from my literature review, and that the person measures (derived from such a "bespoke" test) are reliable and do not distort measurement. So i propose a section to report the following: -

A) variable map and how the items fit the sample (top 60 students were not challenged by the test for example)
B) item outfit MSQ's/Zstd - i will eliminate any items >1.5 (highest MSQ was 1.45, so dont have to eliminate any)
C) ICC - i will comment on any ICCs that suggest conflict with latent variable ( have a few items with the dreaded blue V)
D) R-PCA to argue unidimesnionality and hence construct (content?) validity. (1st contrast is 2.3ev for example)

Is this sufficient in a report to convince readers that the items are "valid" measures?

As always Mike, thank you for your guidance!


Mike.Linacre: GC, content validity is determined by the intended content of the test, not psychometrically. For instance, if question 2 is "2+2=?", then it doesn't matter how good all its statistics are, it is not a "reading" item. It fails content validity.

When we have content validity, then the items should have an expected order along the latent variable from easy to hard. If the empirical difficulty order matches the expected order then we have construct validity. If the expected order is not obvious, then you may need to get some experts to order the items (but do not tell them the test statistics).

Predictive validity: are there external indicators, such as Grade Level, that give us a rough expected order to the students? If so, we expect a high correlation between the external indicators and the ability measures. This confirms "predictive validity".

GiantCorn: Hi Mike,

Thank you for the continued prodding in the right direction. OK, let me see if I understand this clearly: -


1) I want to investigate the effect size that L1 test items and distractors has over L2 items and distractors. kind of trait purity study.

2) I'd also like to if possible examine how the level of student may modify such effect size? I would expect variance in scores to be greater at lower levels of L2 ability for example.


However, I created items from scratch due to copyright worries and institutional constraint. Therefore, I need to justify those items to my readers enough that they accept that the measures represent the finite aspect of listening comprehension that we define. According to your advice above I should: -

1) Argue content validity. We built the items following advice from a leading expert in listening and the test was accepted by an assessment panel at our institution as meeting the goals of the courses we provide.

2) If readers accept 1, then provide evidence of construct validity. How about looking at the empirical difficulty order vs an expert ranking of the test items into 3 levels (say, easy, medium, difficult)? Split the variable map into thirds and examine the degree to which the predictions and data align?

3) If readers then accept 2, provide predictive validity. We have an entrance test with a listening component, which is form a well known international test. The students are then streamed into classes of 4 levels of ability based on their score. They remain in these classes for 2 years. Could I correlate our test with this external score or their general streamed level?

Don't pull your punches Mike, I'm learning from this.

Thanks X 1 million.


Mike.Linacre: Yes, that is the idea.

"I need to justify those items to my readers enough"

This sounds like a case for "Virtual Equating" - https://www.rasch.org/rmt/rmt193a.htm

It may be easier for you to administer your items along with established items to a sample, and then demonstrate their equivalence, than to recruit experts.

In practice, the validity investigations proceed in parallel. Predictive validity is the easiest if you have collected demographics on your sample as you went along. Content validity may be the most difficult if there are special-interest groups pushing for inclusion of items addressing their pet topics.

GiantCorn: Hi Mike,

OK I did what you suggested. I also conducted a Pearson product-moment correlation on raw scores of this test vs a well known listening section placement test they took. Results indicate strong positive correlation between scores on the two tests r=0.56, p=0.0. Hopefully that along with construct and fit validity sections will suffice. Thank you.

I'm onto the main analysis now. So far: -

1) L1 test Mean (0.35 logits) - L2 test Mean (0.08) = 0.27 logits. Two-tail indep samples T-test (3.65) p=0.000291. so statistically significant at p<0.01.

Before you mentioned Effect size = M1 - M2. NOT Effect size = M1-M2/pooled SD? Right?

GiantCorn: Hi again Mike,

Now I'm thinking about the best way to explain about the effect size above. I think many readers will not be familiar with rasch and logits. Many similar papers explain effect size in terms of R2 (% of variance explained). But I have been reading that these indices have their limitations too in terms of conveying the strength.

I remember reading in your literature about 1 logit = 1 year or study, do you think explaining the effect size above 0.27 logits in these terms would make sense? Or should I perhaps stick to convention? Any thoughts?

Thanks, GC

Mike.Linacre: Gc, please do report effect size = M1-M2/pooled SD if that makes sense in your application. For instance, it makes sense if we are trying to move a group normatively within the population.

The 1 logit = 1 year may apply to your situation. You would need to verify it by examining your own measures.

% of variance explained is useful if we are comparing descriptive models, but this does not apply where we are trying to construct additive measures. Often we want to maximize the unexplained variance (as in CAT tests), because maximizing explained variance leads to tests that are much too easy or much too difficult for the target population.

138. Pivot Anchoring

uve May 16th, 2013, 8:50pm: Mike,
I'm hoping you can provide some guidance to a study I've done that has attempted to replicate the work of Dr. Rita Bode, specifically her work with pivot anchoring: "Partial Credit Model and Pivot Anchoring", Introduction to Rasch Measurement, 279-295. I was presented a school climate survey consisting of 33 items and 36 respondents. Item #24 had a -.11 measure correlation. I determined the source came from person #20 who also had the 2nd highest person misfit. Deleting this person rectified the correlation. All items meet your fit guidelines. Items 1-32 follow the classic Likert format coded 1-4 but also offered an Undecided option. I coded this zero but only allowed Winsteps to calibrate the 1-4 options. Coding zero allows me to distinguish missing from undecided. Item 33 asked the respondents to rate the school overall with a letter grade A-F. No one chose F. This item was scored 1-5 with A being 5. I have included the control file with data.

The problem with the survey is that it had 10 negatively worded statements. I simply reverse scored them, but after reading Dr. Bode's paper I began to wonder if it wouldn't be prudent to pivot anchor the two groups. In case you're not familiar, her position is that it may not be equally difficult to report the presence of a positive trait than the absence of a negative one. So in addition to reverse scoring negative items I used the PCM to analyze the positive and negatively worded items, then plotted the thresholds and created an SFILE I've which I've also included. I allowed item 33 to have its own threshold set but it is not part of the pivot process. The plot of the two group thresholds is also attached. If I understand Dr. Bode's work, a pivot point needs to be chosen which represents the point at which there is at least the lowest level of endorsement of the trait (Agree, preferably Strongly Agree) as represented by the positive items and no more than the lowest level of endorsement for negative items (Disagree, preferably Strongly Disagree). These two thresholds should have a very similar location, but if not it may be evidence that the categories function differently for the two item groups. Even though Agree essentially has the same meaning for positive items as Disagree now has for reverse scored negative items, the fact that these two points are not close together suggests there are two scales. Therefore, a common threshold location for this point should be chosen (usually zero) and both shifted to it.

"In essence the calibration produces separate estimates of item difficulty for each step. By selecting the item difficulty at a step other than the average step as the estimate of an item's difficulty level, it rearranges the item in terms of its difficulty level relative to the other items." Page 292

"Because pivot anchoring uses the estimate of item difficulty from a step other than the average step for illustrating the item hierarchy, it doesn�t change the average calibration value for the set of item and therefore does not change the person measure." Page 293

Therefore, after calibration I generated an anchor file and added 1.46 to all the thresholds for the positive items, and .38 to all the thresholds of the negative items so the point of commonality would be centered at zero logits. The thresholds for item 33 were not anchored or shifted. The problem is that when I plot the person measures before and after anchoring, there is significant shift as seen in the plot also included. I deleted item 33 but this had no noticeable effect. I then realized that the initial calibration of the thresholds included person #20. I started over, but there were no noticeable changes in the thresholds, so I continued with my original anchor file. Why is the person measure shift so high? According to Bode, this shouldn't have happened.

I'm wondering if I anchored incorrectly, or is there something inherently wrong with applying this process to my data. I'd greatly appreciate any help you could provide.

Mike.Linacre: Uve, Rita Bode's procedure works well, and I have used it many times. The idea is to align the item difficulties so that the cut-point for each item (equivalent to the dichotomous item difficulty) is located at the reported item difficulty on the latent variable. So, we do an arithmetic sleight of hand. In the original analysis, we look at a Table such as Table 2.2 - https://www.winsteps.com/winman/index.htm?table2_2.htm - and see where along the line (row) for each item is the substantive cut-point (pass-fail point, benchmark, etc.) We note down its measure value on the latent variable (x-axis of Table 2.2).

Then, for each item, we compare the measure value with its reported item difficulty. The difference is the amount we need to shift the Andrich thresholds for that item. Since this can become confusing. It is usually easiest to

1) output the SFILE= from the original analysis to Excel.
2) add (item difficulty - cut-point) to the threshold values for each item. Example: we want to subtract 1 logit from the item difficulty to move the item difficulty to the cut-point on the latent variable. We add 1 logit to all the thresholds for an item, then Winsteps will subtract 1 logit from the item's difficulty.
3) Copy-and-paste the Excel SFILE= values into the Winsteps control file between SAFILE=* and *
4) Since each item now has different threshold values: ISGROUPS=0
5) This procedure should make no change to the person measures.

uve: Mike

Worked great! Thanks so much.

Another question: without anchoring, a respondent who has a raw score of 72 is reported with a measure of .18, but in the score file a raw score of 72 equates to a measure of -1.23. Why would the two be so different?

Mike.Linacre: Uve, the Scorefile= and Table 20 are computed for respondents who respond to every item. Your respondent probably has many missing responses.

uve: Thanks again. Yes, this was the case.

uve: Mike,

Is there a specific location where inlcusion of SAFILE commands should be placed in the control file? I do not get the same item calibrations when I type in the file name into Extra Specifications as when I include the SFILE data in the control file.

Also: ISGROUPS=0 in the final run preserves the item ordering and person measures, but in my case greatly expands the item measures which is obviously more noticable at the extremes. I also lose the ability to report how the two item groups function. When I keep the initial rating scale using the two groups (plus the third group for item #33), I now can report the functioning of the two primary groups using Table 3.2 with what I feel is better clarity, the item measures seem more reasonable in spread though they are ordered slightly differently, and the person measures are only slightly shifted (.18 logits). I have no solid theoretical reasoning behind this latter method, only that intuitively it makes more sense given my purpose. Am I making a mistake in your opinion?

Mike.Linacre: Uve, if you type SAFILE= at Extra Specifications, please be sure to use the specified one-line format, see https://www.winsteps.com/winman/index.htm?extraspecifications.htm Example 4.
Extra Specifications?
SAFILE=* 23,1,0.5 17,1,2.3 *

ISGROUPS=0 should make no difference to the estimated dichotomous measures. The only difference should be in their reporting. As a test, please try ISGROUPS=0 with a dichotomous file. The item difficulty estimates for non-extreme items should not change.

uve: Mike,

I've attached the output from the three control file runs. The first is merely the original file, the second is using the SAFILE with the same groups and the third is with no groups. Looking at item 15, measure is 3.67, then this jumps to 4.73 once the SAFILE is used with the same 3 groups, then to 7.34 if no groups are used.

In the last case, person measures are identical. In the second case, person measures are adjusted slightly by .18.

Not sure what's wrong.

Mike.Linacre: My bad, Uve ...

My apologies ... I thought we were discussing dichotomous data. Your data are polytomous, and already implement GROUPS= for different item types.

Much of my advice above is irrelevant.

Here is the approach:
1) from your original analysis, with your GROUPS= and no SAFILE=, output an SFILE=

2) build an SAFILE=

a) for each item, use the SFILE= value for its group

b) add the pivot anchor value to the SFILE= value

c) include the new set of values for the item in the SAFILE=. There must be entries in SAFILE= for every item. Use the SFILE= values directly if there is no change
For instance: SFILE= for the group with item 1:
1 0 .00
1 1 -.85
1 2 .85
We want to add one logit for item 1, two logits for item 2, no change for item 3
1 0 1.00 ; this is a placeholder, but is convenient to remind us of the pivot value
1 1 0.15
1 2 1.85
2 0 2.00
2 1 2.15
2 2 2.85
3 0 0.00
3 1 -.85
3 2 0.85

3) do the analysis again with ISGROUPS=0 and SAFILE=* ... *

4) The person measures will shift by the average of the pivot values.

uve: Mike,

If I understand your instructions, it sounds like that is exactly what I did previously. For example, item 1 has a measure of -1.36. Below is the SFILE threshold data from the original run of the data with the 2 item groups plus item 33 as its own third group.

1 1 .00
1 2 -1.60
1 3 -1.46
1 4 3.05

Since I'm using zero as the pivot point, I added 1.36 to all the thresholds for item 1 and got the following:

1 1 0
1 2 -0.24
1 3 -0.1
1 4 4.41

I repeated this process by adding or subtracting (whichever direction leads to zero) each item's measure to/from its thresholds and this became the source for the SAFILE.

When I ran the control file again with the SAFILE and ISGROUPS=0 in the control file, item 1 changed to -2.72 but there were no changes to person measures. Oddly, the item ordering was identical to the original run. When I kept the original 3 groups in the control file and ran it again with the SAFILE, item 1 changed much less to -1.86 but there was a small shift in person measures of about .18. However, there was a slight reordering of some items, which as I understand it is what we want. After pivot anchoring, items should rearrange themselves slightly into a more logical order.

In Dr. Bode's paper, there is a scatterplot which does show a slight shift in person measures along with a slight reordering of items after anchoring, all of which she does mention. I'm sorry if there is something very obvious here I'm missing. The control file, original SFILE and converted SFILE along with the measures from the different runs are in the attachment. Thanks for your patience.

Mike.Linacre: Yes, this does need investigating, Uve.

Your attachment is not present, so please email it directly to me: mike \-/ winsteps.com

uve: Mike,

Thanks for taking the time to look into this further. I have the file on a different computer not readily available at the moment, so I will send it off later. However, you can access it now. It's available about four posts up. :)

Mike.Linacre: OK, Uve. My apologies for getting adding and subtracting reversed.

Here is the computation:

Old item difficulty + old average thresholds (excluding bottom "0")
= Provisional New item difficulty + new average thresholds (excluding bottom "0")

If the analysis is unanchored, Winsteps will maintain the average difficulty of the items:
New item difficulty = Provisional New item difficulty - Average(Provisional New item difficulty) + Average(Old item difficulty)

So, in your example,
-1.36 + 0.00 = ? + 1.36
? = -2.72 - as you report.

Person measures with complete response strings will usually have very small or no changes.

uve: Mike,

I thought I’d share one final discovery about pivoting that I feel worked best given my situation. As mentioned in my previous posts, in my attempt to align the positive and negative worded items by adding 1.46 to the positive item thresholds and adding .38 to the negative item thresholds, this did align the Disagree/Agree threshold of the two groups at zero but inflated the mean person measure by 1.12 logits.

When I followed your fixes, the mean person measure was rectified, but this unaligned the two item groups. This also had some dramatic item measure inflation effects that really did not make sense in terms of hierarchy and content. So modifying your suggestions a bit, I instead subtracted 1.12 logits from all item thresholds of both groups after adding 1.46 and .38. Essentially, I just shifted the positive item group by.74 and the negative item group by -.34 logits which puts the Disagree/Agree (threshold 2) for both item groups both at -1.12 as can be roughly seen in the attachment (Group 1=positive, Group 2=negative) as can be seen in the attachment.

Thus, threshold 2 for both groups is aligned, the person measures are at their original values and the item measures (not shown) are only slightly adjusted from their pre-pivoting hierarchy. This new hierarchy makes more sense, though I am always fearful of hunting down and selecting a process that makes intuitive sense (perhaps fitting the best model to the data), rather than allowing the strict measurement guidelines to control the process.

Whether valid or not, I thought you would interested in my results. As always, I would be greatly interested in your thoughts on what I did.

Mike.Linacre: Thank you, Uve. Your approach is in line with Rasch (and scientific) philosophy. We control the data. The data do not control us. We only allow the data to dominate, as in the Michelson-Morley "speed of light" experiment, when our results cannot conform to our theory. When we allow the data to dominate, we are soon confused and overwhelmed by the prevailing randomness in the Universe. See "Fooled by Randomness" by Nassim Nicholas Taleb.

Ben Wright used to tell us about the time when he was a research assistant to two Nobel-prize-winning physicists. They would tell him to redo the physics experiments many, many times until Ben got the results they were expecting according to their theories. Their scientific theories dominated the data. Then they would report the successful (for them) experiment. All the other experiments were regarded as learning experiences, "how not to do the experiment".

142. Comparing person measures from two sets of items

helenC June 25th, 2013, 9:30pm: Hi! I'm looking at possible multidimensionality in my test, and read the following in winsteps online information:
"Ben Wright recommends that the analyst split the test into two halves, assigning the items, top vs. bottom of the first component in the residuals. Measure the persons on both halves of the test. Cross-plot the person measures. If the plot would lead you to different conclusions about the persons depending on test half, then there is a multidimensionality. If the plot is just a fuzzy straight line, then there is one, perhaps somewhat vague, dimension".
So that is what I am trying to do - but just not sure how I do that? I thought that maybe "compare statistics:scatterplot" would be an option...but getting totally confused by this instruction:
"One or both sets of statistics can be in a IFILE= or PFILE= file (red arrow). Since these files can have different formats, please check that the selected field number matches the correct field in your file by clicking on the Display button (blue arrow). This displays the file. Count across the numerical fields to your selected statistic. If your field number differs from the standard field number, please provide the correct details for your field in the selection box (orange box)."
What does it mean to 'count accross the numerical fields'?
Any help - suggestions - thoughts....would be most appreciated!!
Thank you so much

Mike.Linacre: Helen, you want to cross-plot measures. These are nearly always the second field in the IFILE= or PFILE= which is what Winsteps assumes. The "counting columns" applies if you are plotting an obscure statistic (such as the expected value of a point-biserial) and you are using Field Selection in the IFILE= or PFILE=. Then we need to count across the columns of numbers in the IFILE= or PFILE= file in order to tell Winsteps which column contains the statistic we want to plot.

helenC: Hi Mike, that is great! I have now done the plot (i've tried to attach it but I am told that I have -kb remaining attachment space) It looks like a general straight line with a positive slope, with alot of scattered points around the fit line - would you agree that this is generally showing that the two contrasts are part of the same dimension (i.e. not two separate scale)? Also, I was wondering if it would be valid to do an independant t-test of the measures to identify if the scores are significantly different?...id there an wasy way to do that via winsteps?

Mike.Linacre: Helen, in Winsteps, Table 23 is a more comprehensive investigation of dimensionality than a high-low split. We usually do high-low splits to test for invariance of the latent variable. A test for invariance across the two lists of the same items (or the same persons) would be Kendall Tau.

helenC: Hi Mike - that is great, thank you. I will have a go at understanding the Kendall Tau! Thank you so much for your fast reply! Very best wishes

helenC: Hi Mike - had a look at table 23 (and realisd that I've looked at this previously). My first conrast is higher than 2 eigenvalue units (2.7). And I can look at the wordings of items of the positive and negative contrast - but is there any other way of supporting/disproving that there is multidimentioanlity -rather than just subjectively looking at item content? I thought that the scatter plot or a statistical comparison of means between the two sets of items might support/disprove that? Not sure if this is making sense!! - sorry!
Also, how do I do the Kendal Tau - looked in table 23 and I can't see it?
Thank you again for all your help!!!

Mike.Linacre: Helen, in recent version of Table 23 is shown the disattenuated correlation between person measures on the upper and lower clusters of items. If a correlation is far from 1.0 then we have evidence that the two clusters of items are measuring different things.

Kendall Tau applies to the two sets of measures produced by the high-low split method, not to Table 23, but a disattenuated correlation between the two sets of measures is equally meaningful.

helenC: Thank you so much for your e-mail...sorrry for the slow reply.
I've had a look at the table of correlations - I'm just a bit confused re which 'item clusters' line I should be looking at? For contrast 1, if I look at the item-clusers '1-3' then the disattenuated correlation is 0.57, but if for item-clusters '1-2' then it is 0.94 (which I'm assuming means less liklihood of two dimensions?
Sorry to keep bothering you with this - just getting a bit confused!
Thank you for all your help!!!
This is the table:

Approximate relationships between the PERSON measures
PCA ITEM Pearson Disattenuated Pearson+Extr Disattenuated+Extr
Contrast Clusters Correlation Correlation Correlation Correlation
1 1 - 3 0.2289 0.5769 0.2844 0.7895
1 1 - 2 0.4281 0.9404 0.4840 1.0000
1 2 - 3 0.6363 1.0000 0.6783 1.0000
2 1 - 3 0.3438 1.0000 0.3822 1.0000
2 1 - 2 0.4500 1.0000 0.5118 1.0000
2 2 - 3 0.6241 1.0000 0.6183 1.0000
3 1 - 3 0.3855 1.0000 0.4182 1.0000
3 1 - 2 0.5159 1.0000 0.5563 1.0000
3 2 - 3 0.5326 1.0000 0.5494 1.0000
4 1 - 3 0.4489 1.0000 0.4732 1.0000
4 1 - 2 0.4909 0.9899 0.5736 1.0000
4 2 - 3 0.5330 1.0000 0.5432 1.0000
5 1 - 3 0.4579 1.0000 0.5168 1.0000
5 1 - 2 0.5350 1.0000 0.5889 1.0000
5 2 - 3 0.6359 1.0000 0.6869 1.0000

Mike.Linacre: Helen, that Table is roughly in descending order of expected multidimensionality. So the table is telling us that there is almost no multidimensionality in this dataset. The only contrast we need to look at more closely is between item clusters 1 and 3 of contrast 1. These are the top and bottom items in the plot of the first contrast. What is the difference between their item content? Typical differences are "physical" vs. "mental", "individual" vs. "social", "practical" vs. "theoretical", "intensity" vs. "duration".

helenC: Dear Mike - thank you so much for your reply (sorry for my slow response, I am just back from holiday).
I have looked at the item content for clusters 1 and 3 in contrast 1, and they map onto two main factors that have been found in the questionnaire in previous papers (but this was not done with Rasch analysis, rather, using traditional factor analysis). I suppose what I really want to ascertain is if this with threaten the validity of my results? The two factors are part of the same concept (1: active behaviour 2: risk-taking behviour).... Does table 23 gerenally show that there is almost no multidimentionality, apart from contrast 1-3?
Sorry again to bother you with this!
Best wishes

Mike.Linacre: Helen, you ask "Does Winsteps Table 23 generally show that there is almost no multidimensionality, apart from contrast 1-3?"

Yes, all the other disattenuated correlations are close to 1.0, so the person measures are statistically equivalent across all the other item clusters within the contrasts. However the "active behavior" vs. "risk-taking" contrast is large. This confirms the conventional factor analysis. So the relative count of items for each behavior skews the overall person measures towards one behavior or the other. You probably want the number of items for each behavior to be the same, so that they balance out.

helenC: Dear Mike, that is great, thank you so much for all your help with this - you are very kind - and there is no way I would have worked this through without you! So thank you so much!!! Helen

151. Extremely low person location

Nettosky June 25th, 2013, 10:22am: Hello all,
i'm a student working on my dissertation and moving his first steps with rasch measurement.
I am noticing some "inconsistencies" while working with my data. I have reached a complete item fit, a satisfactory item-trait interaction and high PSI after adjusting for local dependency and item tresholds, however, i can't figure out why the person-item distribution map is totally asymmetric: on a sample of 130 subjects, there are almost 30 subjects (the more able) that are not differentiated by any items and a mean person location of almost -3 (sd=around 1) and mean item location of 0 (sd around 2). The problem persists also if i exclude extreme persons from the sample. Why is this happening? How could i explain this?

Nettosky: Better yet, is that normal? I would interpret it by saying that items are "too difficult" for my sample, although i wonder that something went wrong since that looks like a targeting error or something else i can't figure out.

Mike.Linacre: Thank you for your questions, Nettosky. What type of analysis are you doing? For instance, if this is a multiple-choice test, we can see something like this when the scoring key does not match the data: https://www.rasch.org/rmt/rmt54j.htm - a typical problem is that the scoring key or the data are one column out of alignment.

Nettosky: Thanks a lot for your answer professor Linacre. It is a multiple choice test and unfortunately the scoring key matches the data! I have use an unrestricted partial credit polytomous model. the items are correctly aligned and distributed, but the "gaussian" person distribution (excuse me for my terminology but i'm a statistics newbie) is misaligned around a mean location of 3. Items seem to cover only the less able persons (the analysis concerns a disability multiple choice questionnaire based on a 5 point likert scale) and apparently there are no items discriminating the more able. Only few persons are located above 0 in the person item map.

Mike.Linacre: Nettosky, please do a standard dichotomous analysis on your data, with only the correct response scored "1". Does this analysis make sense?

In your PCM analysis, please look at Winsteps Table 14.3. Do the option statistics for each item make sense?

Nettosky: I'm sorry, i was inaccurate in describing the questionnaire: it is a polytomous survey who gives a total disability score. I am using RUMM and not winsteps. I have set all the items with a maximum score of 5 (Unadjusted scoring is: 1=0, 2=1, 3=2, 4=3, 5=4) and modified tresholds since the original scoring structure produces disordered tresholds. The problem is that the more i adjust the tresholds (for example a global rescore for all the items of 1=0, 2=1, 3=1, 4=1, 5=2), the more person location goes up, although i get acceptable values of PSI and item-trait interaction. There is no answer key since each item response produces a score that contributes to the total score.

Mike.Linacre: Thank you for the additional information, Nettosky.

There are disordered thresholds. This implies that some categories have relatively low frequency relative to their adjacent categories. Let's imagine a frequency pattern that produces the effects you are seeing:
1) No matter how much the categories are collapsed, thresholds are disordered
2) The more that categories are collapsed, the more that the person location goes up.

Here are category frequencies that could match that situation:
1 has 40 observations
2 has 10 observations
3 has 5 observations
4 has 2 observations
5 has 80 observations

If you see a pattern like this, then the item is really a dichotomy with some intermediate transitional categories.

A parallel situation in physical science is:
1 = ice, 2 = super-cooled water, 3 = water
super-cooled water is an unstable intermediate state that only exists going from water -> ice, not from ice -> water. Depending on our purpose, we collapse super-cooled water with ice (because they are below freezing point) or collapse super-cooled water with water (because they are both liquid) or maintain three separate categories.

What is the purpose of your analysis? This decides how you collapse your categories. Don't let the data control you, Nettosky. You must control the data!

Nettosky: Thanks a lot professor Linacre. I (almost) fixed the issues with the person location, although i think that my range of action is very limited due to huge floor effect (now the person location is around -1.3, but i think that i can't do any better since i have tried almost any treshold combination). I am working on a generic disability questionnaire on a sample of patients with a specific pathology. I am trying to demonstrate the unidimensionality of the test and the rasch model fitting. Now i have a good PSI, acceptable ChiSq. on all items (even without the bonferroni adjustment) and a not significative ChiSq on the item-trait interaction. I am confident with my analyses and i (hope) i am moving the last few steps. I am trying to calculate the percentage of individual t-tests outside the range ± 1.96 on the items with highest positive and negative loading on the first residual PC for unidimensionality testing. However, i am not understanding the procedure: how many items do i have to choose for the t tests comparison for each PC? If i select only one item with positive and one with negative loading RUMM stops me and tells me that have to select items with a maximum score of at least 3. Why does that happen? Thanks again! I hope this is my last question.

Mike.Linacre: Nettosky, if the t-tests are ordered by size, then please use an appropriate statistical procedure, such as Benjamini Y. & Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B, 57,1, 289-300.

If you are selecting an extreme t-test from a list of t-tests, then use the Bonferroni adjustment to the standard t-test.

The RUMM statistical procedure requires at least 3 t-tests because two degrees of freedom are lost in the computation.

For expert answers to RUMM-specific questions, please ask the Matilda Bay Club : http://www2.wu-wien.ac.at/marketing/mbc/mbc.html

Nettosky: Thank you!

152. Calibration versus Item Independence

uve June 27th, 2013, 3:54pm: Mike,

I'm hoping you can provide me some resources and guidance in regard to communicating to a general non-technical audience the difference between the assumption of item independence and the dependence of item inclusion for calibration. Or in other words, how it is that the choice of the items used in item calibration can change an item's logit value and yet this is not the same issue as the assumption of item independence. I think for the general audience these are one in the same. I believe I can make a case for independency but I'm having trouble coming up with a good conceptual framework or resource for explaining the difference.

Mike.Linacre: Uve, let's see what we can do:
1. Item local independence: we want the items all to be about the same latent variable. This is the dependence we want. We also want the items to be as different as possible about that variable. This is the independence we want. The combination is called "local independence."

2. The logit calibration of an item is determined in one of these ways:
a) relative to the average difficulty of the sample of items
b) relative to the average ability of the sample of persons
c) relative to a reference item, which is often the first item on the test
d) relative to a calibration standard, similar to "freezing point" or "boiling point".

Ultimately all calibrations will be based on (d). Currently (a) is most commonly used in Rasch analysis, and (b) is most commonly used in Item Response Theory.

153. Has connection algorithm changed or is it a bug?

ChrisMcM June 23rd, 2013, 9:30am: Hi -- I've been using FACETS 3.57 for a while, but thought it was time to get up to date, and so today bought 3.71.2.

I ran a large datafile through it which under 3.57 reported that there were two disjoint subsets (which was correct, female scenarios always used female actors, and vice-versa). The same dataset under 3.71.2 reports that there are 7465 disjoint subsets (which I don't think is anywhere near true).

Any thoughts please?



Mike.Linacre: Apologies, Chris. This definitely sounds like a bug in Facets.

Please zip up and email your Facets specification and data files to me. You can remove the Labels= section if it is confidential. mike \~/ winsteps.com

Mike.Linacre: Thanks, Chris. Bug detected - it was a due to a subroutine I had omitted during testing and had forgotten to reinstate. All well now. If anyone else has this problem, please email me. Mike L.

ChrisMcM: Pleased to report that the modification has done the trick, the program is again reporting the two subsets which I know to be in the data, and when I run it on a very large file I have, it also finds the data are connected, but does it now in 45 minutes rather than the 6 and a half hours that the previous version took. So a real improvement overall.

Many thanks to Mike in particular for getting the problem sorted so well, so efficiently, and with such good humour!

best wishes


Mike.Linacre: Excellent, Chris. It is amazing how much impact that minor tweaks to the connecitivity algorithm can have on running time.

154. Collapsing rating scale

ybae3 June 20th, 2013, 1:33pm: Dear Dr.Linacre

I would like to appreciate your advice for my question, unidimensionality with dichotomous data. 4 point rating scale worked well :)

I have another data with 5 point rating scale but I had to collapse this scale from 5 to 4 point rating scale. After collapsing the scale, I ran person analysis with the data and I deleted three misfitting persons. And then I found there was only 3-scale showed up as below .

| 2 2 71 38| -4.43 -4.44| 1.05 1.07|| NONE |( -3.97)| 2
| 3 3 99 53| -.70 -.68| .96 .74|| -2.87 | .00 | 4
| 4 4 17 9| 2.82 2.75| .96 1.11|| 2.87 |( 3.97)| 5
OBSERVED AVERAGE is mean of measures in category. It is not a parameter estimate.

| 2 NONE |( -3.97) -INF -2.88| | 81% 74%| | 2
| 3 -2.87 .22 | .00 -2.88 2.88| -2.87 | 77% 86%| .98| 4
| 4 2.87 .34 |( 3.97) 2.88 +INF | 2.87 | 90% 58%| 1.05| 5
M->C = Does Measure imply Category?
C->M = Does Category imply Measure?

CATEGORY PROBABILITIES: MODES - Structure measures at intersections
P ++------+------+------+------+------+------+------+------++
R 1.0 + +
O | |
B | 333333333 |
A | 3333 3333 |
B .8 + 333 333 +
I |22 33 33 44|
L | 2 3 3 4 |
I | 22 33 33 44 |
T .6 + 2 3 3 4 +
Y | 22 33 33 44 |
.5 + * * +
O | 33 2 4 33 |
F .4 + 3 22 44 3 +
| 33 2 4 33 |
R | 3 22 44 3 |
E |33 22 44 33|
S .2 + 22 44 +
P | 222 444 |
O | 2222 4444 |
N | 44*****22 |
S .0 +444444444444444444444444 222222222222222222222222+
E ++------+------+------+------+------+------+------+------++
-4 -3 -2 -1 0 1 2 3 4

Is this situation happening often?
Can I still use 4-point rating scale?

Thank you very much for your advice in advance!


lmb2: Youlmi, your analysis is telling us that there are only three scored categories remaining.

Probably the only observations of category 1 were by the three misfitting persons.

When we omit misfitting persons, we omit both their unmodeled noise (which causes the misfit) and their modeled information. Since omitting them may have changed the structure of the data, please look again at the amount of misfit. If their mean-squares are less than 2.0, then keep them! Their information is greater than their noise.

ybae3: Thank you for your advice.
My misfitting persons have the following values:

1. outfit MNSQ=5.76
2. rpb= -.07
3. rpb= -.47

I am re-looking at my data to find out a better analysis for the data. I will let you how it is going. Thank you again

Mike.Linacre: Youlmi: that person is hugely misfitting. the negative point-biserial indicates that the person's behavior does not match the intended latent variable. These are the statistics we see when a person is guessing or using a response set. You are correct to omit this person.

155. Winsteps ICC computation

userPY June 16th, 2013, 5:31pm: Hi,

I am trying to understand how the ICC values are computed in Winsteps. I selected a few items and used the Rasch and Step values, plugged in the formula, and compute the expected values (based on Rasch models). The numbers I come up with are different from the Winsteps ICC output. I am not sure what caused the difference.

For example, a MC item with Rasch value of -1.9097, using a theta value of -2.4397 (apologies for not choosing an easier number), I have a ICC of 0.37, but the ICC value from Winsteps is 0.083.

icc = exp(-2.4397+1.9097)/[1+exp(-2.4397+1.9097)]

I used Rasch for MC model, and the partial credit model for CR model.

Any suggestions are greatly appreciated. Thanks in advance.

Mike.Linacre: UserPY, your computation is correct for the probability of success on a dichotomous item :-)

p = 0.37 = exp(-2.4397+1.9097)/[1+exp(-2.4397+1.9097)]

For an item difficulty of -1.9097 logits, the probability of 0.083 on a dichotomous item corresponds to a theta value of:
theta = item difficulty + ln (0.083 / (1-0.083)) = -1.9097 - 2.402 = -4.312 logits.

156. Multiple Rating Scales in Facets

uve June 12th, 2013, 11:09pm: Mike,

This is my first attempt at running Facets but though I couldn't find a similar type in the tutorials or the online help. This is three facet model with judges, examinees and essays. The essays are actually one or two paragraphs with a common theme. The essays were scored with the following maximum points; 10, 15, 25, 10, 15, 25.

I've attached the data and control file. The error I get is:

"Specification is: 1,Judges
Error F11 in line 11: All Model= statements must include 3 facet locations + type [+ weight]"

Also, I wanted to paste the data from the Excel file into the control file but was unable to do this in the format I see given in the examples. When I attempted to run the Build command, the message stated it was processing, but nothing appeared. There was no data file in the Edit screen for me to look at. Not sure what happened.

I've attached the Excel data file and the control file. Thanks for your help as always.

Mike.Linacre: Very close, Uve. Well done!

1. Please remove the ? at the end of the Models= lines.

2. Please add "Labels=" between Models= and 1,Judges
This is causing the error message. Facets thinks 1,Judges could be another Models= specification.

3. To copy the data from Excel:
3a. Comment out "; Data=TOYEssayResults2.xls"
3b. Add as the last line: "Data="
3c. In the Excel worksheet:
Select All (Ctrl+a)
Copy (Ctrl+c)
3d. In the Facets specification file
on the line after Data=
Paste (Ctrl+v)
Your Excel data should now appear in tab-separated format
3e. Save your revised specification file

4. Your specification file is now correct, Uve.
Analyze it with Facets.

uve: Mike,

Thanks again. All worked out well. I was shocked to see that for each of the models, except one, only three different point observations were assigned by the judges. I've attached two of the model examples.

Model = ?,?,1,R10 ; Items: Biography
|Category Counts Cum.| Avge Exp. OUTFIT| Thresholds | Measure at |PROBABLE| THURSTONE|PEAK|
| Score Used % % | Meas Meas MnSq |Measure S.E.|Category -0.5 | from |Thresholds|Prob|
| 4 0 4 8% 8%| .35 .48 .9 | |( -2.32) | low | low |100%|
| 5 | | | | | | |
| 6 | | | | | | |
| 7 1 25 50% 58%| .94 .81 1.3 | -1.19 .53| .00 -1.37| -1.19 | -1.27 | 62%|
| 8 | | | | | | |
| 9 | | | | | | |
| 10 2 21 42% 100%| 1.07 1.21 1.1 | 1.19 .30|( 2.33) 1.39| 1.19 | 1.26 |100%|

Model = ?,?,2,R15 ; Items: Philosophy
|Category Counts Cum.| Avge Exp. OUTFIT| Thresholds | Measure at |PROBABLE| THURSTONE|PEAK|
| Score Used % % | Meas Meas MnSq |Measure S.E.|Category -0.5 | from |Thresholds|Prob|
| 7 0 9 18% 18%| .26 .14 1.1 | |( -1.72) | low | low |100%|
| 8 | | | | | | |
| 9 | | | | | | |
| 10 | | | | | | |
| 11 1 18 36% 54%| .38 .45 .7 | -.40 .39| .00 -.96| -.40 | -.69 | 43%|
| 12 | | | | | | |
| 13 | | | | | | |
| 14 | | | | | | |
| 15 2 23 46% 100%| .86 .85 1.1 | .40 .31|( 1.72) .98| .40 | .68 |100%|

It seems it would be helpful to tell Facets to ignore unused points along the scale. How is this done? Also, if I wanted to collapse categories, how would this be done?

Thanks again.

Mike.Linacre: Uve, Facets has ignored unused rating scale points and collapsed the categories. The revised scoring is shown in the "Score" column.

If you do want to score the unused categories (to see what happens) then add the letter K (for Keep) at the end of each Models= specification, e.g.
Model = ?,?,1,R10K

uve: Mike,

The essays were actually given to two groups but the examinees and raters were different for the groups. My plan is to run two separate analyses. However, I'm wondering if it would be valid to combine them into one analysis and create a dummy "group" facet as a means to compare both. Or would the group anchoring process you outline be better? What do you think?

Mike.Linacre: Uve, the "group" facet would make no difference to the disjoint groups problem. It would be useful for investigating group x item interactions. So, either group-anchor the raters or group-anchor the examinees. If you have no other information, group-anchor the raters.

157. Tutorial 2 help

OrenC June 13th, 2013, 4:46pm: Good Day,

I am a Rasch amature working through the tutorial you have on the web page... and managed to get stuck on tutorial 2 in opening the file :'(

When I am trying to input Example0.txt I am notified that there is a problem with ITEM1=0. However, when I open the actual file to edit it... ITEM1=1. I have attached screen shots of my problem.

I am sure this is something easy... but I am stuck never-the-less.

Thanks for any help,


Mike.Linacre: You are taking on a challenge, OrenC, but you will succeed :-)

In your analysis, Example0.txt starts:
; This is .....

The problem is that Ministep does not know where the TFILE=* sublist ends. We need to give Ministep that instruction too:
; This is .....

OrenC: Thank you!

It worked!

What a little * can do!


Mike.Linacre: Excellent, Oren :-)

158. controlling data file for a researh

ahmetvolkan June 12th, 2013, 8:38pm: Thank you for your question, Everest.

Here is the last line of your data file:

Please compare it with the last lines in your Labels=
1,Deneyler (elements = 7)
7=Deney 7
2,Puanlayýcý (elements = 10)

Do you see that they do not agree?

Should your Labels= be this?

1,Puanlayýcý (elements = 10)
2,Deneyler (elements = 7)
7=Deney 7

ahmetvolkan: I deleted my question wrongly. The problem is that I confused the order 1= Experiments, 2=Group, 3=Criterias but it should be 1=Group, 2= Experiments, 3= Criterias. Thank you for your answer.

159. Testing for dimensionality in Winsteps

Val_T May 31st, 2013, 3:39pm: Dear all,

As a complete Rasch novice, I have a difficulty understanding Principal Component Analysis of residuals and the dimensionality output in Winsteps so I was hoping to get some advice.

I have developed a ‘quality of life’ 39-item scale for children who are visually impaired aged 10-15 years. The scale has been administered to a representative sample of c.165 children.

First, I carried out an exploratory factor analysis which showed all items loaded highly on the first factor. I then tested all the items in Rasch. Give and take removal of up to 7 items, most of the scale is functioning well with the fit statistics within acceptable limits (.5 - 1.5), good measurement precision (as indicated by Pearson separation index and reliability) and good category ordering, with no notable DIF on age and gender. Targeting is not that great with the difference between item and person means ranging between .8-.95 logits (depending on how many items I remove).

To my understanding, my scale may have unidimensionality issues but I have difficulties interpreting the meaning of the output (see below ). These values don’t change much at all with further removal of items (even if more stringent fit criteria of .7-1.3 is applied), but the more items I remove the worse the targeting gets.

My specific questions are:

1. Is my scale multidimensional and which statistics ‘prove’ it?

2. How much of a problem is this, e.g. failing unidimensionality, can I still derive a summary score if all the other measurement properties are OK?

3. Are person-item targeting and dimensionality related (and how)?

4. Could my sample size of c.160 be affecting these results (it’s a rare clinical population)?

Many thanks for your help.

Kind regards,



Table of STANDARDIZED RESIDUAL variance (in Eigenvalue units)
-- Empirical -- Modeled
Total raw variance in observations = 47.8 100.0% 100.0%
Raw variance explained by measures = 15.8 33.1% 33.5%
Raw variance explained by persons = 6.7 14.0% 14.2%
Raw Variance explained by items = 9.1 19.0% 19.3%
Raw unexplained variance (total) = 32.0 66.9% 100.0% 66.5%
Unexplned variance in 1st contrast = 4.0 8.4% 12.5%
Unexplned variance in 2nd contrast = 2.9 6.1% 9.2%
Unexplned variance in 3rd contrast = 1.9 4.0% 5.9%
Unexplned variance in 4th contrast = 1.9 3.9% 5.9%
Unexplned variance in 5th contrast = 1.7 3.5% 5.2%


100%+ T +
| |
V 63%+ U +
A | |
R 40%+ +
I | M |
A 25%+ +
N | |
C 16%+ I +
E | P |
10%+ +
L | 1 |
O 6%+ +
G | 2 |
| 4%+ 3 +
S | 4 5 |
C 3%+ +
A | |
L 2%+ +
E | |
D 1%+ +
| |
0.5%+ +

Approximate relationships between the PERSON measures
PCA ITEM Pearson Disattenuated Pearson+Extr Disattenuated+Extr
Contrast Clusters Correlation Correlation Correlation Correlation
1 1 - 3 0.4383 0.5600
1 1 - 2 0.6673 0.9660
1 2 - 3 0.6887 0.9033
2 1 - 3 0.4697 0.6163
2 1 - 2 0.6816 0.8383
2 2 - 3 0.4554 0.6044
3 1 - 3 0.5996 0.8573
3 1 - 2 0.7413 0.9780
3 2 - 3 0.5974 0.7727
4 1 - 3 0.6578 0.9103
4 1 - 2 0.7441 1.0000
4 2 - 3 0.7747 0.9645
5 1 - 3 0.6450 1.0000
5 1 - 2 0.7538 1.0000
5 2 - 3 0.7514 0.9307

TABLE 23.2 VQoL 39items 4th Dec 2012 - actual on ZOU361WS.TXTi May 31 15:05 2013into RUMM.s

Table of STANDARDIZED RESIDUAL variance (in Eigenvalue units)
-- Empirical -- Modeled
Total raw variance in observations = 47.8 100.0% 100.0%
Raw variance explained by measures = 15.8 33.1% 33.5%
Raw variance explained by persons = 6.7 14.0% 14.2%
Raw Variance explained by items = 9.1 19.0% 19.3%
Raw unexplained variance (total) = 32.0 66.9% 100.0% 66.5%
Unexplned variance in 1st contrast = 4.0 8.4% 12.5%


-1 0 1
-+--------------------------------+--------------------------------+- COUNT CLUSTER
.7 + | A + 1 1
| | B | 1 1
.6 + C D | + 2 1
C | | |
O .5 + F | E + 2 1
N | | |
T .4 + | +
R | G| | 1 1
A .3 + H I | + 2 2
S | | |
T .2 + | +
| | |
1 .1 + JM K L| + 4 2
| | N | 1 2
L .0 +---------------------------------|---------------------------------+
O | | O | 1 2
A -.1 + | P + 1 3
D | p | o | 2 3
I -.2 + l |n m + 3 3
N | k | j | 2 3
G -.3 + h | i + 2 3
| e | d f g | 4 3
-.4 + | c + 1 3
| | b | 1 3
-.5 + a | + 1 3
-1 0 1
COUNT: 1 11 11 4 11 3 2 1 2 2 312 1 1 1 2

1 15:05 2013into RUMM.s


| 5 1 0 | 0 0 6 | 18 160 2 2 2 1 1
| 6 0 0 | 0 3 3 | 105 658 1 1 1 1 1
| 5 1 0 | 1 0 5 | 144 1028 1 1 1 1 1
| 3 3 0 | 0 2 4 | 21 168 9 2 1 1 1
| 4 2 0 | 0 3 3 | 53 449 1 1 2 1 2
| 6 0 0 | 1 3 2 | 86 582 2 2 1 1 1
| 5 1 0 | 0 5 1 | 7 66 9 2 2 1 1
| 1 5 0 | 0 1 5 | 29 282 2 2 1 1 1
| 4 2 0 | 1 2 3 | 117 725 9 2 2 1 1
| 6 0 0 | 3 0 3 | 150 1035 1 2 1 1 1
| 0 1 5 | 5 1 0 | 125 1008 1 2 2 2 3
| 0 3 3 | 6 0 0 | 119 1002 1 1 1 1 1
| 0 2 4 | 5 1 0 | 147 1031 1 2 1 1 2
| 0 2 4 | 5 1 0 | 157 1042 1 1 1 1 2
| 0 0 6 | 4 0 2 | 72 530 1 1 2 1 2
| 0 2 4 | 4 1 1 | 19 162 1 2 1 1 1
| 0 2 4 | 3 3 0 | 98 642 1 1 2 2 4
| 1 1 4 | 4 2 0 | 104 655 9 1 1 2 4
| 0 3 3 | 4 2 0 | 111 683 2 1 1 1 1
| 0 1 5 | 2 4 0 | 122 1005 1 1 2 1 2
| 0 2 4 | 3 3 0 | 138 1021 1 1 2 1 2
| 1 0 5 | 4 1 1 | 141 1024 1 1 1 1 1

TABLE 23.5 VQoL 39items 4th Dec 2012 - actual on ZOU361WS.TXTi May 31 15:05 2013into RUMM.s


| 1 3 | -.28 | .81 1.26 1.30 |i 23 Q29A_PSY |
| 1 3 | -.10 | .81 1.48 1.49 |P 32 Q45A_FUN |
| 1 1 | .65 | .63 1.31 1.27 |B 15 Q18A_IND |
| 1 3 | -.34 | .47 .90 .87 |g 13 Q16A_SOC |
| 1 3 | -.18 | .33 1.24 1.31 |m 31 Q44A_FUN |
| 1 2 | -.07 | .28 1.36 1.48 |O 20 Q25A_IND |
| 1 2 | .07 | .27 .96 1.05 |N 2 Q2A_SOCI |
| 1 3 | -.14 | .24 1.03 1.17 |o 4 Q6A_SOCI |
| 1 3 | -.34 | .21 .69 .68 |f 7 Q10A_SOC |
| 1 3 | -.43 | .20 .72 .68 |b 10 Q13A_SOC |
| 1 3 | -.42 | .20 1.23 1.39 |c 8 Q11A_SOC |
| 1 3 | -.27 | .16 .99 1.07 |j 11 Q14A_SOC |
| 1 1 | .50 | .15 .90 1.10 |E 19 Q24A_IND |
| 1 1 | .68 | .11 1.07 1.11 |A 30 Q43A_FUN |
| 1 3 | -.35 | .10 1.19 1.32 |d 6 Q9A_SOCI |
| 1 3 | -.18 | .02 .74 .70 |n 3 Q5A_SOCI |
| 1 1 | .35 | -.03 .70 .67 |G 17 Q21A_IND |
| 1 2 | .11 | -.03 .93 .99 |L 27 Q37A_FUT |
| 1 2 | .11 | -.11 .77 .76 |K 26 Q32A_PSY |
| 1 1 | .60 | -.13 .82 .80 |D 16 Q20A_IND |
| 1 3 | -.34 | -.14 1.10 1.25 |e 12 Q15A_SOC |
| 1 1 | .62 | -.18 .86 .82 |C 14 Q17A_IND |
| 1 1 | .50 | -.20 .80 .77 |F 28 Q38A_FUT |
| 1 2 | .12 | -.30 .71 .67 |J 21 Q27A_PSY |
| 1 3 | -.48 | -.30 1.03 .99 |a 9 Q12A_SOC |
| 1 3 | -.22 | -.30 .71 .65 |l 25 Q31A_SOC |
| 1 2 | .10 | -.32 .87 .80 |M 22 Q28A_PSY |
| 1 2 | .28 | -.46 1.39 1.40 |I 18 Q23A_SOC |
| 1 3 | -.13 | -.48 1.46 1.43 |p 5 Q7A_SOCI |
| 1 3 | -.31 | -.62 .92 .79 |h 24 Q30A_PSY |
| 1 2 | .30 | -.64 .98 .87 |H 29 Q42A_FUN |
| 1 3 | -.24 | -.75 .89 .85 |k 1 Q1A_SOCI |

TABLE 23.6 VQoL 39items 4th Dec 2012 - actual on ZOU361WS.TXTi May 31 15:05 2013into RUMM.s


| 1 3 | -.24 | -.75 .89 .85 |k 1 Q1A_SOCI |
| 1 2 | .07 | .27 .96 1.05 |N 2 Q2A_SOCI |
| 1 3 | -.18 | .02 .74 .70 |n 3 Q5A_SOCI |
| 1 3 | -.14 | .24 1.03 1.17 |o 4 Q6A_SOCI |
| 1 3 | -.13 | -.48 1.46 1.43 |p 5 Q7A_SOCI |
| 1 3 | -.35 | .10 1.19 1.32 |d 6 Q9A_SOCI |
| 1 3 | -.34 | .21 .69 .68 |f 7 Q10A_SOC |
| 1 3 | -.42 | .20 1.23 1.39 |c 8 Q11A_SOC |
| 1 3 | -.48 | -.30 1.03 .99 |a 9 Q12A_SOC |
| 1 3 | -.43 | .20 .72 .68 |b 10 Q13A_SOC |
| 1 3 | -.27 | .16 .99 1.07 |j 11 Q14A_SOC |
| 1 3 | -.34 | -.14 1.10 1.25 |e 12 Q15A_SOC |
| 1 3 | -.34 | .47 .90 .87 |g 13 Q16A_SOC |
| 1 1 | .62 | -.18 .86 .82 |C 14 Q17A_IND |
| 1 1 | .65 | .63 1.31 1.27 |B 15 Q18A_IND |
| 1 1 | .60 | -.13 .82 .80 |D 16 Q20A_IND |
| 1 1 | .35 | -.03 .70 .67 |G 17 Q21A_IND |
| 1 2 | .28 | -.46 1.39 1.40 |I 18 Q23A_SOC |
| 1 1 | .50 | .15 .90 1.10 |E 19 Q24A_IND |
| 1 2 | -.07 | .28 1.36 1.48 |O 20 Q25A_IND |
| 1 2 | .12 | -.30 .71 .67 |J 21 Q27A_PSY |
| 1 2 | .10 | -.32 .87 .80 |M 22 Q28A_PSY |
| 1 3 | -.28 | .81 1.26 1.30 |i 23 Q29A_PSY |
| 1 3 | -.31 | -.62 .92 .79 |h 24 Q30A_PSY |
| 1 3 | -.22 | -.30 .71 .65 |l 25 Q31A_SOC |
| 1 2 | .11 | -.11 .77 .76 |K 26 Q32A_PSY |
| 1 2 | .11 | -.03 .93 .99 |L 27 Q37A_FUT |
| 1 1 | .50 | -.20 .80 .77 |F 28 Q38A_FUT |
| 1 2 | .30 | -.64 .98 .87 |H 29 Q42A_FUN |
| 1 1 | .68 | .11 1.07 1.11 |A 30 Q43A_FUN |
| 1 3 | -.18 | .33 1.24 1.31 |m 31 Q44A_FUN |
| 1 3 | -.10 | .81 1.48 1.49 |P 32 Q45A_FUN |


Mike.Linacre: Val, thank you for your questions.

1. Is my scale multidimensional and which statistics 'prove' it?

Look at your plot. The important direction is vertically. We can see outlying cumulative strata: A, A-D, A-F (identified by the clustering algorithm), A-I. The strongest secondary dimension is defined by one of those cumulate strata. A is Q43A_FUN, but most of the other items in the strata are _IND. At the bottom of the plot most of the items are _SOC. So the contrast is between Individual and Social activity, which is a well-known contrast.

Is that contrast big enough to matter? The contrast has the strength of 4 items (empirical eigenvalue = 4 in the first Table of numbers). This contrast is bigger than the 2 or so expected by chance. But is it big enough to make a difference? The disattenuated correlation of the person measures between item clusters 1 and 3 is only 0.56. They only share about 30% (=0.56^2) of the person measure variance in common. This is low. We definitely need to look at a cross-plot of person measures to see if the contrast is in low performers, high performers or across the performance range.

2. How much of a problem is this, e.g. failing unidimensionality, can I still derive a summary score if all the other measurement properties are OK?

The Rasch measures are the best compromise between the different dimensions in the data. Multi-dimensionality is a substantive question more than a statistical question. Is an arithmetic test multidimensional? Yes, if you are diagnosing the learning difficulties of a child. No, if you are considering the grade-advancement of a child. A person's physique is multi-multi-dimensional, but we summarize it, for most purposes, by two "constant" numbers, height and weight, even though both those numbers actually vary during the day, and both numbers fail to reflect different body types. Long legs or long torsos? Muscles or fat? We need a few numbers (one number is best) that we can think with, rather than many numbers that describe our physique's situation precisely (and confuse us).

3. Are person-item targeting and dimensionality related (and how)?

Targeting does influence the "variance explained" in the data by the Rasch measures, so "variance explained" in the raw data is an exceedingly poor indicator of dimensionality. See https://www.rasch.org/rmt/rmt221j.htm . We focus on variance explained in the residuals.

4. Could my sample size of c.160 be affecting these results (it's a rare clinical population)?

Provided that your sample is representative of the population, your sample size is adequate. In fact, generous. As an experiment, you could reduce the sample size to see how small the sample must be to lose the Social/Individual contrast. I suspect it will still be there with a sample of 20 taken at random from your 160. If the rows in your data were entered randomly, then an analysis with PDELETE=+1-20 would be a test of that notion ....

Val_T: Dear Mike,

thank you very much for your response, which is generous as always. It has taken me a while to get my head around this! Our findings seem to reflect the general challenge in measuring accurately a ‘nebulous’ concept that is quality of life/ health-related quality of life. Whilst it is generally viewed as a multidimensional construct, its subdomains are likely to overlap somewhat. The only analogy I can think of is general cognitive ability ‘g’ or IQ, with its verbal and performance subdomains, as well as the subtests that make up those domains.

We have developed a completely separate (but complementary) instrument of functional vision to delineate psychosocial and functional impact of visual loss on children, given that there is so much conflation of functional impact and health related quality of life in the literature. So it is interesting that there is further possible multidimensionality within our quality of life scale, particularly as the items are framed to capture the overall impact of living with a visual impairment (albeit across various aspects of life e.g. social interactions, social participation, independence, autonomy, psychosocial adjustment and emotional wellbeing).

Unfortunately, further examination of the multidimensional structure of our instrument in Rasch (incl. conducting Rasch on potential separate domains), impacts on other psychometric properties (such as targeting and measurement precision), so we are now thinking carefully in terms of now to proceed next.

Again, many thanks for your generous advice.


Mike.Linacre: Yes, that is perceptive, Val. QOL and "g" do have analytical similarities.

160. Setting up my control file

helenC June 3rd, 2013, 2:03pm: Hi! I've used Winsteps for a number of years, but have just purchased and downloaded a newer version onto a new laptop. However, after getting the data from SPSS and making the control file...when I come to run the Rasch, I get a message that says: "none of your responses are scored 1 or above"...but they are! They are scored as 0 and 1....and I thought thet would still work? Please...I would appreciate any help!! I've been working on this for hours!! THANK YOU!!

Mike.Linacre: HelenC, yes we can help. Did you use the Excel/RSSST menu of Winsteps to convert your SPSS file into Winsteps control and data file or a different method?

Also please look at your you Winsteps control file. What is the CODES= statement? Look at your SPSS file. What are the data values for the response variables?

helenC: Hi Mike, thank you so much for your fast response! I did use the Excel/RSSST menu of winsteps. Also, I looked at the control file - and the CODES=statement is crazy! In my SPSS file, the codes are 0 and 1 for the response variables, but this is what the control file looks like:
SPSS Cases processed = 389
; SPSS Variables processed = 87
ITEM1 = 1 ; Starting column of item responses
NI = 1 ; Number of items
NAME1 = 7 ; Starting column for person label in data record
NAMLEN = 64 ; Length of person label
XWIDE = 5 ; Matches the widest data value observed
; GROUPS = 0 ; Partial Credit model: in case items have different rating scales
CODES = 1012610729101461047710078103991050410294104161008210563103841005810177+
+10967101511000610053 ; matches the data

Not sure whether this 'partial credit model' thing has something to do with it? I've run the Rasch on my old computer with an older version of winsteps - where I imported the data over from SPSS (the SAME SPSS file) and the control file is normal - and the Rasch works! So I'm a bit confused! Really wanted to use the new version I have purchased!
Thanks again for your help!
Best wishes

Mike.Linacre: HelenC, it looks like the person and item variables are reversed. NI=1, so there is one item (should that be the person label?) and NAMLEN=64, so the person label is 64 characters long (should that be 64 items?).

Please notice that, in the SPSS conversion,
item response variables
are listed before
person label variables

helenC: THANK YOU SO SO MUCH MIKE!! I really feel silly now! I've put the variables under the correct headings, and hey presto - it worked! Thank you again for your time. Helen

162. KR20 Change with IDELETE

rag June 8th, 2013, 9:24pm: I'm running WINSTEPS 3.75.1. I ran my data of 10 items and 400+ responses through WINSTEPS (all dichotomous items), and look at the KR20 statistic (0.62). After looking at fit statistics, I dropped an item using IDELETE=1 (the item number) in the specification drop down menu, and KR20 as reported in the summary statistics table 3.1 goes up to 0.70. That felt counterintuitive that KR20 would go up after reducing the number of items, though I know it is possible if the item is really poor fitting. So I ran the same statistics through Stata's KR20 command, and sure enough, reliability went down (to 0.56) after removing the same item. Interestingly, person-separation reliability didn't change.

I have done this in other cases, dropping multiple items, and KR20 consistently goes up in all of those cases.

Am I doing something wrong in WINSTEPS? Or is WINSTEPS somehow calculating KR20 reliability differently than I expect when IDELETE is invoked?


Mike.Linacre: Thank you for your question, rag.

Was IDELETE= done in the Control file or at the Extra Specifications prompt or in the "Specification" menu dialog box?

rag: IDELETE was done from the Specification menu dialog box.

Mike.Linacre: Mya apologies, rag.

IDELETE= from the Specification menu box does not recompute the basic statistics. It merely reports them selectively.

Please use IDELETE= in your control file or at the Extra Specifications prompt to compute basic statistics such as KR-20.

I will amend Winsteps to indicate this.

163. DIF in multidimensional data

lillianchen May 28th, 2013, 3:43pm: Dear Dr. Linacre:
I am examing DIF items in multidimensional data. My findings suggested that ignoring the multidimensional structure does not influence DIF item detections. Is it reasonable?

Mike.Linacre: Lillianchen, yes, that is reasonable. DIF analysis compares the difficulties (p-values) of the individual items for different groups of persons. This is independent of whether the items are on the same dimension. For instance, we could compare the difficulty (p-value) of a "geography" item for boys and girls with the difficulty (p-value) of an "arithmetic" item for boys and girls. The fact that geography and arithmetic are on different dimensions is irrelevant.

lillianchen: Thanks, Dr. Linacre. I feel more confident with my results.
I have another one DIF question in longitudinal study. Can I treat each time point as a group to run DIF test? I read some articles proposing using non-traditional way to detect DIF items in multilevel data. Should I do that in my logitudinal data too?
Thanks again for your response.

Mike.Linacre: Lilianchen, yes. DIF by time-point is an easy way to investigate item drift across time.

164. Uniform Distribution Check

uve May 24th, 2013, 6:00pm: Mike,

One of the criteria you mention in ascertaining good category functioning is uniform distributions across categories. Could you explain how to go about checking for this in a polytomous instrument using something like table 3.2 in Winsteps?

Mike.Linacre: Uve, for a rating-scale for which inferences will be made at the category level, we like to see a uniform process at work. When applied to a sample, this would produce a smooth distribution of category frequencies. Unimodal without sharp peaks or troughs. No statistical test is intended, but merely Berkson's "inter-ocular traumatic test" (= what hits you between the eyes).

uve: Mike,

I'm assuming we are referring to categories between the extreme low and high. When I think of uniform, I think of equally likely. Would we also be checking for similar probability given some level of theta? For example, if there are five options, then would we want to see the peaks of the middle three at about the same height?

Mike.Linacre: Uve, please tell me where I used the word "uniform". Thank you.

uve: I'm seeing it on the first line. But perhaps I'm confusing process with distribution.

Mike.Linacre: Uve, you write "I'm seeing it on the first line"

Uve, the first line of what?

Thought: I was probably using the word "uniform" in a colloquial way, not in the sense of a statistical test of uniformity.

165. Cutlo and dimensionality

harmony May 21st, 2013, 7:01am: Hi all:

In an effort to remove guessing behavior from the data, I employed Cutlo (-1) and Cuthi (2) as described in Winsteps. This resulted in a great improvement of item fit and had little impact on test reliability. However, when I examine the dimensionality of the test there is a drop from over 60% of variance explained to just 18%.

This seems to be an extreme result. I undserstand that data is missing as a result of employing Cuthi and Cutlo, but is this a normal result of that? Also, when reporting on the dimensionality of this test, is it appropriate to refer to the dimensionality measures before this procedure was implemented?

Mike.Linacre: Harmony, "variance explained" is independent of dimensionality. See https://www.rasch.org/rmt/rmt221j.htm

When the outlying observations were trimmed by CUTLO= and CUTHI=, the data became more like coin-tosses (which is the ideal for a CAT test). Surprisingly, if we want to maximize "variance explained", we should do the reverse of CUTLO= and CUTHI=. If we keep only the outlying observations, and remove all the on-target observations, then the variance explained by the Rasch measures will be very high.

Using CUTLO= and CUTHI= also reduces Reliability for a similar reason.

When investigating multi-dimensionality, we are interested in the size of the first PCA contrast in the residuals. If this has a high eigenvalue (= strength of many items), then there may be a secondary dimension.

harmony: As always, thanks very much for your reply Mike. I understand how the change in variance happened. The eigenvalues of the first contrast show about 4 or 5 items on a test with 110 and doesn't seem to be a big problem.

I'm trying to get a better undetrstanding of how to interpret "variance explained". In Winsteps you suggest that greater than 60% is good. Therefore, it seems obvious that 18% is not good...

Can you explain more clearly what is "good" about more than 60% of variance being explained and what the consequences -if any- are when it falls below that number?

Please forgive my novitiate questions. No doubt if I had taken more math courses when in college all of this would be obvious to me!

Mike.Linacre: Harmony, if the "variance explained" is a small percentage, then this indicates that the range of the person and/or item measures is small. This usually makes it difficult to have high Reliability and also to confirm the construct validity and predictive validity of the test.

Please tell me where in Winsteps I suggest that greater than 60% is good? Greater than 60% is definitely good, but 40% or greater is good enough for most purposes. And for some purposes, such as CAT tests, very close to zero% is good enough.

harmony: Hi Mike:

I truly appreciate your responses to help me clarify this. :)

You mention that greater than 60 % variance is good in the 2007 Winsteps Help menu, Special topics, Dimensionality: contrasts and variances:

"Rules of Thumb:

Variance explained by measures > 60% is good.

Unexplained variance explained by 1st contrast (size) < 3.0 is good.

Unexplained variance explained by 1st contrast < 5% is good.

But there are plenty of exceptions ...."

I've read the links you provided and more and feel that i have a much better grip on this. I think that the notion of "explaining variance" is a bit confusing for someone with a non-statistical background because somehow the idea of explaining the variance gets confounded with the idea of explaining or defining the latent trait to be measured.

In this test, a more careful investigation suggests that one part required more time than was given and editing misfitting responses on those items results in much better overall fit to the model, increases the separation, and results in 85 % of variance explained. This was more labor intensive, but yeilded much better results than cuthi/cutlo.

Mike.Linacre: Thank you, Harmony, and my apologies.

Unfortunately, the 2007 Winsteps Help is ancient history for the topic of "Rasch Dimensionality". This has been a fast-developing area, and the Dimensionality-related topics in Winsteps Help are tweaked almost monthly.

What seems clear-cut from the perspective of exploratory factor analysis becomes murky when looked at from the perspective of empirical latent variables and parameter distributions. It is a reflection of the difference between statistics (the model explains the data) and metrology (the data support the model) that we encounter so often - in fact, every time we use a tape measure, bathroom scale, thermometer, etc.

In particular, https://www.rasch.org/rmt/rmt221j.htm in 2008 was motivated by many questioning responses to the Winsteps 2007 Help (which was based on EFA criteria). Since then, "variance explained" has been revealed to be a misleading indicator of dimensionality. Much more direct is the size of the eigenvalue in the first contrast of the PCA residuals. See, for instance, https://www.rasch.org/rmt/rmt233f.htm

166. Rasch-Andrich threshold - violation?

gradstudent23 May 20th, 2013, 11:15am: Hi,

I've been reading about the difference between the disordering between average measure and the Rasch-Andrich thresholds, and my understanding is that the average measure disordering, along with infit/outfit over +/- 1 may indicate model misfit.

My question is about keeping or deleting items with regards to Rasch-Andrich thresholds disordering. Would we ever want to drop an item based on Rasch-Andrich disordering?


Mike.Linacre: Gradstudent23, Rasch-Andrich threshold disordering does not cause misfit of the data to the model. It indicates that there are relatively low frequency intermediate categories. This may or may not be problematic, depending on the design of your rating scale.

For instance, if the rating scale for an item is "0=fail 1=acceptable 2=brilliant", and there are reversed thresholds, then category "1=acceptable" has a relatively low frequency. This may threaten the validity of the rating process. We may decided to collapse categories 1 and 2 together to become "1=pass", because it is clear that "brilliant" does not have the exceptional meaning we intended. Or we may drop the item entirely because the item does not have the meaning we intended.

But, in Olympic Figure Skating, there is a very long rating scale but only a few skaters. Many categories have 0, 1 ratings. We expect disordered thresholds. They are not a threat to the validity of the rating scale.

167. raw variance explained by measures

Akihiro May 16th, 2013, 8:19pm: Dear Dr. Linacre

It is a pleasure to communicate with you.
I applied the Rach model for validating a 200 item standadized English language proficiency test and tried to link the results with Messickian validity freamework.

I have analyzed the data gathered with 136 participants. Only 10 items out of 200 items in the test are judged as misfit and all participants fitted the Rasch model.

However, in my process of judging the issue of unidimensionality of the test, as you recongnize in the attached file, I founf out that only 26.0 % of raw variance is explained by the measures. In addition, The Eigenvalues of unexplained variance in 1st contrast to 5th contrast are from 5.7 to 4.3.

I am confused about how to understand these results and cannot judge the issue of unidimensionality of the test at this moment.

My question is the following:
Is there any situation where most test item and particpants fit the model but the test does not seem to be unidimensional? Should I judge that the test is muti-dimensional?

I am still in the process of learning the Rasch model and its application for my research.
I would be grateful if you kindly give me any suggestions on the intepretation of the results.


Mike.Linacre: Thank you for your question, Akihiro.

You ask "Is there any situation where most test item and particpants fit the model but the test does not seem to be unidimensional?"

Reply: Yes, this is the usual situation. It is unusual that the multidimensionality in the data is strong enough to cause misfit in individual items or persons. Multidimensionality, such as the difference between "Addition" items and "Subtraction" items in an Arithmetic Test, is usually subtle. It requires an accumulation of Addition items and an accumulation of Subtraction items for the different dimensions to appear.

You ask: "Should I judge that the test is multi-dimensional?"

Reply: This depends on your purpose. Usually the difference between "Addition" items and "Subtraction" items is regarded as part of the natural variation of Arithmetic items, and is ignored. Addition and Subtraction are different "strands" of Arithmetic. However, for Arithmetic Tests that are used for investigating learning difficulties, the cognitive difference between Addition and Subtraction is important. In this situation, Addition and Subtraction are different dimensions.

So, first we look "Is there any statistical evidence of multidimensionality?" Then we ask "Does the content evidence indicate natural variation or a different dimension?"

a) "only 26.0 % of raw variance is explained by the measures"
- this is NOT evidence of multidimensionality. It is evidence that the person abilities and/or item difficulties have a narrow range: see https://www.rasch.org/rmt/rmt221j.htm

b) "The Eigenvalues of unexplained variance in 1st contrast to 5th contrast are from 5.7 to 4.3."
- this can be evidence of multidimensionality. Look at the first contrast. What is the difference in item content between the items at the top of the plot in Table 23.2 (item 70 etc.) and the items at the bottom of the plot in Table 23.2 (item 146 etc.)? Is the difference part of the natural variation in English Language items, or is it evidence for a different dimension, such as "Geography" or "History" or "Loan words from another language"?

Akihiro: Dear Dr Linacre

Many thanks for your prompt and kind reply. I will check the content of items and try to explain the dimesionality of the test.


168. 3PL Warehouse - labor/equipment utilization

AuMi May 17th, 2013, 12:41am: Hi,

The company I am working at has currently outsourced its logistics operations to a 3PL. The logistics services are import handling, collection of containers from wharf, warehousing and distribution.
We are currently trying to set up KPI's to measure the performance and efficiency of the 3PL.
The warehouses are owned by the service provider. He has products of 1-2 other customers in his warehouses.
In order to measure the efficiency of the warehouse operations we would like to measure labor/equipment/storage space utilization.

The service provider indicated our contribution of the utilization of its labor and equipment as below:

Inbound labor: available: 2 people: our companies usage: 75%, other customer's usage 25%
Forklifts: available: 3: our companies usage: 90%, other customer's usage: 10%

And based on the percentages indicated we are charged. We would like to measure if we use indeed the indicated percentage of labor and equipment in a reasonable frequency.

Do you have any advise how to measure the labor and equipment utilization and in what frequency and what is a reasonable target value?

Thank you for your help!

Mike.Linacre: AuMi, not our specialty. We are in Educational Testing, not Third-Party Logistics.

169. Polytomous Mean Interpretation

uve May 14th, 2013, 11:00pm: Mike,

If I understand correclty, the mean person measure of 2.00 for a dichotomous exam would be interpreted as an average of 88% correct. However, if we have a Likert 4 option instrument: Strongly Agree, Agree, Disagree, Stronlgy Disagree, then what could we say about the mean? Certainly not 88% correct of course. Could we say an 88% satisfaction rate? I am attempting to translate the mean into something more familiar to a general audiance.

Mike.Linacre: Uve, we could talk about the averaged rating on an item of mean difficulty.

Suppose that 4=Strongly Agree, 3=Agree, 2=Disagree, 1=Stronlgy Disagree

Then 2 logits could be an average rating of 3.6 (trending to Strongly Agree).

For the expected score matching 2 logits, see the Winsteps GRFILE= or look at the Expected Score ICC in the Graphs window.

170. CAT-based Rasch model - rules of thumb

rab7454 May 9th, 2013, 8:30pm: Rab7454, let's do some arithmetic with your numbers.

After the administration of 40 reasonably on-target items, the S.E. will be approximately:
1 / sqrt (0.16*40) = 0.4 logits, and the change in ability estimate will be approximately 1/(0.16*40) = 0.16 logits.

For rule 1, how many reasonably on-target items for an S.E. of .255?
.255 = 1 / sqrt(0.16 * L), so L = 1 /(0.16*.255*.255) = 96 items.
For perfectly on-target items, L = 1 /(0.25*.255*.255) = 61 items.
The observed average is 42 items. So this stopping rule is probably not active.

For rule 2, item information = .01. So if p is probability of success .01 = p*(1-p). Then p = .11, and (ability - difficulty) = ln(.11/.89) = -2 logits. So that is a reasonable rule: stop when the nearest item remaining to be administered is more than 2 logits away. With this small item bank, perhaps this should be increased to 2.5 logits.
If the item difficulties in the item bank have a wide range, then this is probably the most active stopping rule and explains why the mean count of 42 is so low.

For rule 3, theta change <.001, we would need to administer .001 = 1/(0.16 *L) items. L = 6,250 items. So rule 3 is problematic. Probably we would abandon rule 3.

Suggestion: compute how may times each rule applied in your data.

1. 42 items - suggests a wide item-bank difficulty range.
2. correlation 0.9: does the match our expectations? We need to compute the expected correlation between a test of 70 items and its sub-test of 40 items. A simulation study with your item bank would be informative, but a correlation of 0.9 looks reasonable. We would be suspicious if the correlation was much higher, because that would say "What is the point of administering the last 30 items? 40 items have already told us the story."
3. Only 1.5% = my guess is that almost all tests were stopped by rule 2 except for those with abilities near the mean of the difficulties in the item bank.

Rab7454, this illustrates the thought-process required when setting stopping rules. OK?

rab7454: Mike,

This is EXTREMELY helpful. Thank you! You have given me much to digest. One small question, and then I may write back with additional questions.

This formula:

1 / sqrt (0.16*40)

How did you arrive at this formula, in general, and the value .16, specifically?

Thank you!


Mike.Linacre: Rab7454, the S.E. of an ability measure = 1 / square-root(statistical information in the response string).

The statistical "Fisher" information in one dichotomous response = p*(1-p) where p is the probability of success. So, if we say p = .8 (on average, for a reasonable administration of items on a CAT test) and there are 40 items:

S.E. = 1 / sqrt ( .8 * (1-.8) * 40) = 1 / sqrt ( .16 * 40)

rab7454: Hi Mike,

The second rule was used the vast majority of the time. The first and third rules were rarely used. I understand why this is the case, after reading your explanation. Thank you!

The software does not provide the expected correlation between the full bank and mean number of items used in the CAT. Three follow-up questions:

1. Can Winsteps employ a simulated CAT, along with the expected correlation?
2. In absolute terms, what would you consider to be the minimal acceptable correlation between the CAT thetas and full bank thetas? I would imagine that a correlation of .40, for example, would not be acceptable.
3. Any other factors I should consider when using a simulated CAT to inform the stop criteria in preparation for administering the CAT on new respondents? For example, I also have the correlation betweeen the full bank theta SE and CAT theta, which is considerably low. I also have the difference between the full bank SE and CAT SE (approx .05). Not sure what to make of that, if at all.

Apologies for all of these general CAT questions, but as always, you are a wealth of information, so I turn to you to learn.

Thanks again,


Mike.Linacre: Rab7454, since the CAT tests are more than half of the items of the full test, then, even if measures on the unadministered items for each CAT test were uncorrelated with the CAT measures, the CAT-to-full-test correlation would be at least 0.7 (because at least half the variance in the full test measures would be explained by the CAT test measures).

Let's suppose the CAT test is about half the items in the full test, and the reliability of a test composed of the unadministered items is about the same as the reliability of the CAT test = R, then the correlation between the unadministered test and the CAT test would be R, and the CAT-to-full-test correlation would be something like sqrt ((1+R)/2).

Your CAT-full correlation is 0.9. The CAT test uses over half the items, and they are the better targeted items. This suggests that the reliability of the CAT test is around 0.5. Rab7454, is this close??

Answering your questions:
1. The best Winsteps can do is to simulate the full tests, and obtain the person measures.
Then use CUTLO= and CUTHI= to trim off-target responses, and obtain person measures
Correlate the two sets of measures.

2. See discussion above.

3. "bank theta SE and CAT theta, which is considerably low" - yes that is correct. Extreme measures have higher SEs, so the correlation would indicate a slight excess of high performers over low performers.

"difference between the full bank SE and CAT SE (approx .05)" - yes we can compute the expected value, assuming all items are equally informative. The difference is approximately
1/sqrt(.16*40) - 1/sqrt(.16*70) = .10.
Your difference is smaller, suggesting that the unadministered items are off-target, and so less efficient at reducing the size of the standard errors (as we would expect)

rab7454: Hi Mike,

Thank you so much for taking the time to respond to my questions. An estimated reliability of the CAT of .50 is nowhere near what I was hoping/expecting. It feels like there is a paradox here. Initially, when I mentioned that the CAT-to-Full Test theta correlation was .90, I believe you said that the CAT was probably acceptable. Then you explained that one would want to determine the expected correlation. With the information I provided, you estimated the reliability to be .50, which to me indicates, that the CAT may not be acceptable. You also said that a correlation greater than .90 could indicate that the items which were not used may not be worth including.

Did you arrive at .50 by isolating R in the formula you showed me?

corr = sqrt ((1+R)/2)
corr^2 = (1+R)/ 2
(corr^2)*2 = 1 + R
(corr^2)*2 -1 = R

corr = CAT-to-Theta correlation
2 = half of the test, on average, is being administered
R = reliability of CAT

Is my interpretation of the formula above correct? Might you have a reference for this formula, so that I could cite it when estimating reliability?

Anyway, I do not want to re-ask the same questions over and over, but it is surprising to me that a CAT with a correlation of .90 (using a little more than half of the items, on average) would produce a relaibility coefficient of .50. Most of the CAT articles I have read usually boast about the fact that the CAT produced a CAT-to-Full Scale theta correlation at or greater than .90. I'm just struggling with this notion.

Of course any additional thoughts you have on the matter would be most welcome. I would also appreciate any textbooks and/or articles that you'd recommend for someone interested in learning how to evaluate a CAT.



Mike.Linacre: rab7454, this formula, corr = sqrt ((1+R)/2), assumes that (a) the subtest is half of the full test, and (b) the subtest items are as equally informative as the other test items.

The CAT test is more than half of the full test, and the CAT test items are more informative than the other test items, so I guesstimated the CAT reliability. You have the CAT data so you can compute its actual value.

Reliability: for high reliability we must have a test with many dichotomous items. For example, a reliability of 0.9 often requires 200 items. But CAT tests are designed to administer a minimal number of items. Therefore CAT reliability is lower than full test reliability.

We can use the Spearman-Brown Prophecy Formula to estimate the CAT reliability. If the reliability of a 200 item test is 0.9, then the reliability of a 40 item test of equally informative items is
R = 0.2*0.9 / (1+ (0.2 - 1)*0.9) = 0.64
This suggests that my estimate of the reliability of the CAT test, 0.5, may be low.

Please tell us what is the reliability of the CAT test :-)

Mike.Linacre: Rab7454, computing Reliability is straight-forward.

1) Compute the mean error variances of the thetas:
a) square the S.E. of each theta = error variance of each theta
b) average the error variances across the thetas

2) Variance of thetas = S.D. of theta's squared

3) Reliability = (Variance of thetas - Average error variance of the thetas) / Variance of the thetas

171. Item Difficulty - 1PL Rasch Polytomous model

gradstudent23 May 12th, 2013, 11:01pm: Hi -

I have 400 respondents answering 12 items (possible responses are 0,1,2). So, I'm working with a 1PL Rasch Polytomous model. I want to know how difficult each of the 400 respondents find each of the 12 items so that I can get a range of difficulty for each item. The more that I think about this, the less sense it makes to calculate item difficulty for each person, that's just not possible - right? What I can do is to calculate the ability of each person for each item.

My final goal is the following: I'm trying to test for equidistance of my categories. So, I was thinking on obtaining difficulty range for each of my items. I thought I have found this on the xfile output file, on the L-Prob column - since I have a logit value for each of my 400 respondents, for each of the 12 items. However, all the values of the L-Prob are negative and I remember when I plotted the difficulties of my items I had positive and negative difficulties so everything being negative doesn't make sense. I need the range so that I can calculate what my threshold will be assuming equidistance and then compare those hypothetical equidistant thresholds to the Andrich thresholds. I was also going to use the range of my item difficulties for my sample and bootstrap them to get standard errors for my range.

Where can I get the abilities for each of my 400 respondents for each of the 12 items? Can I use the L-Prob column that I'm getting from the xfile output?

Thanks for your help!

Mike.Linacre: Gradstudent23, the ability of a student on an item is PPMEAS in XFILE=, see www.winsteps.com/winman/xfile.htm

For the threshold investigation, the partial-credit model ISGROUPS= together with the ISFILE= output would be useful www.winsteps.com/winman/isfile.htm - this would report the Andrich thresholds for each item.

The bootstrap technique sounds good, because the reported S.E.s of the thresholds are only approximate (due to dependency between the thresholds).

Equidistant Andrich thresholds does not mean equidistant categories. The categories cannot be equidistant, because, in Rasch theory, the top and bottom categories are infinitely wide. If you want the central categories to be equally wide, then please be sure that you have a precise definition of category "width". There are at least 3 definitions, producing different category widths for IRT/Rasch. In my experience, non-technical audience interpret "category width" in accord with the "expected score on an item" definition. Psychometricians often interpret "category width" in accord with Thurstone's "cumulative probability" definition. www.rasch.org/rmt/rmt194f.htm

gradstudent23: Hi Dr. Linacre -

Thanks for the quick reply. I was actually debating on how to define category. I thought it would be interesting to use the different definitions of width to test the equidistance assumption for the central categories. At any rate, I will make sure to clearly define width.

Now, I know that by definition the extreme categories are infinitely wide - but could one use the end points of the PPMEAS value or the PIMEAS values to approximate the end points? I know that those values are sample specific, but one could bootstrap those values to create a big enough number of ranges - then, one could use the standard error of those ranges to construct confidence intervals around the hypothesized equidistant thresholds from the observed sample range (coming from PPMEAS or PIMEAS). Last, one will compare those CIs against the CIs from let’s say the Andrich thresholds (I’m thinking Andrich since I know what the standard errors are from the output).

Last question, is normality being assumed for the calculation of standard errors of the Andrich thresholds?


Mike.Linacre: Gradstudent23, the PPMEAS and PIMEAS values are the measures corresponding to (top rating - 0.25) or (bottom rating + 0.25). A simulation study would be the best way to determine the variances for any of the values, and so their standard errors.

Rasch estimation assumes normality in the randomness in the data (as does all maximum likelihood estimation). Most Rasch estimation methods do not assume normal distributions of the parameters. However, some estimation methods do, such as Marginal Maximum Likelihood Estimation.

gradstudent23: Thanks for the reply!

Dr. Linacre,

I ran the analysis with the PIMEAS. What is the difference between the PIMEAS and the ITMMEA? Is one the predicted value and the other the observed value?

Also, from reading posts and the help manual, sometimes there is syntax included. Where does one run syntax? I've been trying to type commands (specifically XFILE=?) but nothing happens. I'm looking to the analogous of an edit window in SAS.


172. Combining repeated measure data

PK May 9th, 2013, 12:22pm: I am using Rasch to describe the measurement properties of a questionnaire that measures pain-related activity limitation. There are two samples:
(i) One is formed by combining data from a primary care cohort and a secondary care cohort in the same country to examine any effects of care setting (DIF by setting).

(ii) The other is formed by combining two primary care cohorts from different countries to examine any effects of language/culture (DIF by country).

All cohorts contain questionnaire data completed at the baseline consultation but one cohort is a third of the size of the others. Uniquely, that cohort also completed follow-up questionnaires at 2 subsequent time periods. A colleague has suggested that for this analysis it is OK in Rasch to combine repeated measure data as if it were independent. Combining the baseline and outcome data would result in a cohort of similar size to the others.

My question is whether it is acceptable to combine baseline and follow-up data in this way and whether you know of a reference for this practice.

Mike.Linacre: PK, it is usual for different ethnic groups to have different sub-sample sizes in a Rasch analysis and in a DIF analysis, and no one is concerned about this, so different sub-sample sizes in your analysis should not be a matter for concern.

If there is a concern that repeated measures could introduce local dependency (in the same way that repeated measures of height could introduce local dependency) then use a technique such as https://www.rasch.org/rmt/rmt251b.htm

Artificially forcing sub-sample sizes to be the same is a matter for concern. If all sub-sample sizes must be about the same size, then stratified-random-sample the larger sub-samples down to an acceptable size. Do this several times to verify that the random-sampling did not accidentally bias your findings.

PK: Thanks Mike, that is very helpful.

173. Rescoring from ordinal to measure to 0-100

avandewater April 23rd, 2013, 3:59am: Hi all,

I'm close to finalising my first Rasch analysis looking at an outcome measure for shoulder function. What I would like to do is to rescore from ordinal counts, to a 0-100 scale for easy interpretation for clinicians (rather than having logits).
The HELP function in WINSTEPS is great, and I have checked Table 20 for the scorings; I can present a scoring table from ordinal - logits measure - 0-100 score.

HOWEVER: When calculating the scores by hand (i.e. checking the scores in Table 20), I find different values!?! Can someone explain why this is? To me the prediction equation seems not right... what can I do?

Example: a clinician find a raw score of 14; this should give a measure of (14*0.4565) - 4.4476 = 1.9434. However, the Table gives the value of 1.74 logits! (there are values that the hand-calculated score is closer to the score of the next item raw score!)

Below an equation and Table 20:

Predicting Score from Measure: Score = Measure * 2.1259 + 9.7657
Predicting Measure from Score: Measure = Score * .4565 + -4.4476

| 0 -5.91E 1.91 | 8 -.43 .60 | 16 2.44 .60 |
| 1 -4.48 1.15 | 9 -.07 .60 | 17 2.83 .64 |
| 2 -3.46 .92 | 10 .29 .60 | 18 3.27 .70 |
| 3 -2.72 .81 | 11 .66 .61 | 19 3.84 .82 |
| 4 -2.11 .74 | 12 1.03 .61 | 20 4.71 1.09 |
| 5 -1.61 .68 | 13 1.39 .60 | 21 6.06E 1.88 |
| 6 -1.17 .64 | 14 1.74 .59 | |
| 7 -.79 .61 | 15 2.09 .59 | |

Thanks for a reply!
Warm regards,

Mike.Linacre: Thank you for your question, Sander.

The relationship between raw scores and Rasch measures (logits) is curvilinear, as shown by the graph in Winsteps Table 20.1 - https://www.winsteps.com/winman/index.htm?table20_1.htm

Notice that the measure for 14 is 1.74. The approximation is 1.94. This is less than the measure for 15 which is 2.09. It is also within one S.E. (.59) of 1.74. So, statistically the approximation is correct.

If you want a more exact approximation, please fit a polynomial equation to the numbers in Table 20.1. This can be done with Excel, graph, trendline. A cubic approximation is:
y = 0.0023x^3 -0.0771x^2 + 1.142x - 5.6601 where y=measure, x=raw score

avandewater: Thanks Mike.
I will try the equation you've given, and the polynomial equation in Excel - to see what the outcomes are.
I did realise estimations are within one SE, but I was surprised, for a score of 14 for example, that the approximation was closer to a score of 15 than 14 (0.15 vs 0.20)!
When reporting, would you suggest to leave the equations out, and just provide a table with scores and approximations?
Cheers, Sander

Mike.Linacre: Sander, if you are reporting a score table, then use Table 20. For most purposes, you can omit the standard error column.

Cristina: Hi!
I have problems with raw scores and table 20
(1) I suppose that same raw score = same mesasure. But TABLE 18.1 shows two persons: 0012 and 0357 with the same raw score and different measures. What is the reason for that? This can be seen in the first output of table 18. For this, data were processed with TOTALSCORE= N
(2) When processing data with TOTALSCORE=Yes, I obtain raw scores similar to those obtained by hand, and shown in the second output of table 18.1 The scores are very different from the first aoutput, and persons 0012 and 0357 get raw scores different each other.
(3) How can I get measures corresponding to raw scores with TOTALSCORE= Yes given that table 20 seems to yield raw scores and measures for TOTALSCORE= N. So, person 0011 got a raw score of 66 but the highest score in table 20.2 is 62.

Mike.Linacre: Thank you for your question, Cristina.

Table 20 is for persons who responded to all 24 items, and is for TOTALSCORE=No. Example:
| 1 29 24 -2.31 .72|1.57 1.4|1.02 .4| .51| 0001 |

| 29 -2.31 .72| 378 38 17 3.0 36 6.3 5 |

We see that person 0357 responded to 17 items
357 19 17 -3.49 1.00|1.13 .4| .41 .7| .50| 0357

The differences between TOTALSCORE=No and TOTALSCORE=Yes suggest that
1) there may be extreme items (Maximum or Minimum in Table 14)
2) items have been rescored using STKEEP=No, NEWSCORE=, IVALUES= or similar Winsteps control instructions.

If this answer does not explain the situation, Cristina, please email your Winsteps control and data file to me, and I will give you a more detailed explanation. mike \at/ winsteps.com - OK?

Cristina: Thank you Mike!!

You are right. I had the command stkeep = N. When removing that command, the raw scores by hand or in the table were similar, although the misfit of some items got worse.

174. software comparison

rab7454 May 3rd, 2013, 2:45pm: Hi Mike,

Quick question. I've been playing around with other Rasch software, which in the manual, it reports that item "easiness" is first estimated, with one item constrained to zero. As I can tell, they are fitting the model:

logit(pij) = theta_j + beta_i

and for statistical identification purposes, one of the betas is constrained to be zero.

Then they use a sum-to-zero constraint (mean-center, as I see it) and take the inverse of the sign of the logits to obtain difficulty estimates (rather than easiness estimates).

They calculate the standard errors for each beta using the delta method. Their outputted betas are mean-centered (given the sum-to-zero constraint), but they are NOT the same as those produced by Winsteps.

When I correlate the betas across the programs, I obtain a Pearson product moment correlation of 1.00000. I then regressed the Winsteps betas on the other software betas to obtain a scaling parameter. Basically, if I multiply their betas by X.XXX, I can predict Winsteps betas up to the 2nd to 3rd decimal place with very high precision (identical with very slight discrepancies that are far and few between).

Question: Why would the betas be different if they are mean-centering as does Winsteps? Aside from possible differences between estimation methods, their approach seems similar to Winsteps approach.

At the end of the day, I want to determine which software betas would be considered more accurate.


Mike.Linacre: Thank you for your question, rab7454. Since the correlations of the betas is 1.00, they are the same for most practical purposes. Here are some suggestions to make the numbers numerically identical:
1) Please verify that both programs are reporting in logits. Some software report in probits, and Winsteps can report in user-scaled units. In Winsteps, set USCALE=1
2) Adjust for estimation bias. If your dataset has only a few items, then, software implementing CMLE or PMLE (RUMM) has less estimation bias than Winsteps. In Winsteps, set STBIAS=Yes.
3) Verify that the estimation convergence criteria are much tighter than the exactness with which you want to compare estimates. In Winsteps, CONVERGE=B LCONV=.0001 RCONV=.01

Since we know the multiplier, X.XXX, we can force Winsteps to report the same values as the other software by setting USCALE=1/X.XXX. This technique is convenient when doing "Fahrenheit-to-Celsius" equating of estimates of the same item difficulties obtained from different datasets, such as high-stakes (high test discrimination) and low-stakes (low test discrimination) administrations of the same instrument.

rab7454: Thanks, Mike!

When I add


the actual estimates are MUCH closer. Slight discrepancies at the 2nd decimal place without any changes except for adding the statement above in the Winsteps file.

I think it's safe to say that the other software is estimating in logits. I'll look into the convergence criteria next, but I feel much more comfortable now.



rab7454: Hi Mike!

I have another question. I have now employed a full Rasch Rating (4-poin) scale using Winsteps and the other program, and I would like to see how similar the results are. The three thresholds are nearly identical. Interestingly, the program does not provide just three thresholds. It provides thresholds for each item across the entire continuum. However, I know it is a full Rasch rating scale because (a) I specified a full rating scale and (b) the difference between threshold 1 and 2 for item 1 is equal to difference between threshold 1 and 2 for each of the other items. Same goes for the other adjacent thresholds. I hope that makes sense.

The other program provides three "difficulty estimates per item (category)," since there are four categories. Is there a way to obtain three difficulty estimates per item from Winsteps, and have them ouputted to SPSS? I'd like to see how correlated the two sets are. My guess is that it is 1.0, similar to the dichotomous Rasch model. I just don't see a place where I can output the three difficulty estimates per item, from Winsteps. Again, there are three difficulty estimates because there are four categories per item.



Mike.Linacre: Thank you for your question, rab7454.

Yes, you are using the Andrich Rating-Scale model, so the thresholds of all the items are the same across items, relative to the difficulty of each item.

To see all those numbers in Winsteps, output the ISFILE=. If you do that from the Output Files menu, then you can output the thresholds to Excel. www.winsteps.com/winman/isfile.htm

175. Differential person functioning

gothwal May 2nd, 2013, 11:20am: Dear Dr. Linacre,

I have been able to perform DIF in my data set (and found none) but I am unsure of how to perform DPF in Winsteps. My questionnaire has 6 items (and these have 5-6 response categories) and 350 persons have responded to it. It has been interviewer administered in 3 languages. So I am interested in finding out if there was DPF between different language versions.

Would appreciate your valuable guidance on this.

Thanks very much.

Mike.Linacre: Thank you for your question, Gothwal.

Is each person tested in one language or three languages?

If each person is tested in one language, then we want DIF between the items and the language-code (in the person label). The DIF report will be for 3 languages x 6 items = 18 DIF effects (12 effects if one language is the reference language). Overall differences in language difficulty (Chinese is more difficult than Spanish) will be shown by differences in average ability for the persons being tested in each language.

If each person was tested in three languages, then we need a dataset with 18 items (6 items for each language) and then we can do DPF between each person and the item's language (in the item label). The PDF report will be for 3 languages x 350 persons = 1050 effects (or 700 effects if one language is the reference language). Overall differences in language difficult will be shown by different average difficulties for the 3 sets of items (one set for each language).

gothwal: Dear Dr. Linacre,

Thanks very much for the valuable suggestions. Each person was tested in only one language.

Of the total sample, 53%, 15% and 32% persons were tested using languages A, B and C respectively. Actually I would like to assess cross-linguistic validation using different language versions of the questionnaire. I used language A as the reference and performed DIF against language B/C (I coded language A as 1 and languages B and C (together) as 2 in the person label to perform the DIF). This way I could assess 6 DIF effects but I did not find any significant DIF and concluded that there was no DIF associated with language A versus B or C.. However I did not understand when you mentioned in your reply that I should get 12 DIF effects if I used one language as the reference. I am assuming that if I compared language A versus B and then language A versus C (though the sample size would be different for both) in my DIF assessment, that way I would have 12 DIF effects. Please let me know if I am correct in my understanding.

Another question that I have is related to assessment of local response dependency. I am not sure how to go about it in Winsteps. Please suggest.

Thanks once again in advance for your time and patience,

Vijaya Gothwal.

Mike.Linacre: Gothwal, in your design of A vs. B/C there are 6 DIF effects, one for each item.

How do the average abilities of A, B and C groups compare (Table 28 in Winsteps)? If one group has average abilities higher than the other groups then perhaps that language is easier.

Local response dependency: please look at Winsteps Table 23. Dependent items will form their own "dimension".

gothwal: Dear Dr. Linacre,

Thanks for the suggestions.

The mean abilities are -0.76 and -0.90 for languages A vs. B/C (from Table 28). So it appears that languages B/C are relatively easier as compared to that of A (higher response categories indicates more difficulty on my questionnaire - so hopefully my interpretation is correct). I am surprised at this result because my DIF assessment did not show up this. How do I report this part of the result in my paper ? Please advise.

From my understanding Table 23 provides an assessment of dimensionality which I have already performed and there is no second dimension. So would it be safe to conclude there is no response dependency. Would this not be referred to as an assessment of "local item dependency"? Do "local item dependency" and "local response dependency" mean the same.


Vijaya Gothwal.

Mike.Linacre: Vijaya, "-0.76" for A is a higher score on the questionnaire than "-0.90" for B/C.

"I am surprised at this result because my DIF assessment did not show up this."
This would have shown as DIF if the design had been crossed (persons responding to more than one language). The design is nested (each person responds to only one language) so DIF can only report item effects within language group, not across language groups.

"local response dependency" is more general than ""local item dependency" because the dependency could also be "local person dependency". If the dependent items do not form a cluster in Table 23, then it is unlikely that their dependency (if any) is impacting measurement.

gothwal: Dear Dr. Linacre,

Thanks very much (as always !) for teaching me.

Given that higher response category indicates greater difficulty with a task on my questionnaire wouldn't -0.76 indicate lower person ability (indicating relatively harder) for language A and -0.90 indicate higher person ability for languages B/C (so B/C are easier). Please clear this confusion for me.



Mike.Linacre: Vijaya, yes, that sounds correct, but your audience may have difficulty understanding your results. They may find it helpful if you reverse the scoring so that "lower score = lower performance level". In Winsteps this is done by:
(change the score ranges to match your questionnaire)

gothwal: Dear Dr. Linacre,

Thanks ! I compared the mean abilities of participants between languages A and B/C using t test and found there was no statistically significant difference between them. Therefore my question is- will it be appropriate to conclude that there was no difference in the mean person abilities between the languages OR Can this be taken to imply that the questionnaire functions similarly across the three languages (so is linguistically valid at least across these 3 languages).

Given that my design was nested it would be incorrect to report DIF (Based on your earlier reply). So how should I be reporting the above finding of lack of difference between person abilities when it comes to having used different language versions.

Please suggest.


Vijaya Gothwal.

Mike.Linacre: Gothwal, please report your group-mean t-test. This supports the hypothesis that there is no mean difference in person abilities and item difficulties across languages. It is possible that both person abilities are lower and item difficulties are lower in one language. We would need another "crossed" study to eliminate this possibility.

gothwal: Dear Dr. Linacre,

Thanks a ton for such timely help !

I really appreciate it.



176. Rasch Regression Code

phloide May 3rd, 2013, 12:43pm: I am trying to see if someone had code to immulate the code used for http://mlrv.ua.edu/2000/Vol26(2)/wright6.pdf. like the paper, I want to simulate mutiple tests (test A, test B, test C) and then a result with a passing mark. I know how to anchor person abilities, but that is about where I get lost. how are 3 tests scaled simultaniously?

Mike.Linacre: Yes, Phloide, that was one of Ben Wright's experimental papers. He improved the method a little, see https://www.rasch.org/rmt/rmt143u.htm - perhaps this will help you.

phloide: oh... I've read them both... don't know why I didn't make the connection. so, if I am using a rasch regression to evaluate multiple instruments of mixed formats (essays, labs, question/.answer, etc) for a set of persons with known abilities, this would be the procedure... right?

Mike.Linacre: Yes, phloide, that is the best we know. Unfortunately, Ben Wright was incapacitated in 2001 and so was not able to follow that line of research any further :-(

phloide: that is a shame. if you have any contact with him, tell him I appreciate his papers, and they are the crux of my discertation (if I ever get it done)

178. subscores on each task and its reliabilities

hsun April 22nd, 2013, 3:50pm: Hello everyone,

I am using the Facets to do scoring for an oral examination. The exam consists of four different tasks. I am using the three-facet model, namely examinee, examiner, and task. Besides getting logit measures based on all four tasks, I also want to get the logit measure based on each task. So to get the logit measures based on four tasks, my code for the model is:

Model = ?,?,?,RS1,1
Rating (or partial credit) scale = RS1,R4,G,O

To get the logit measure on each task, I am running the Facets seperately using the following code:
Model = ?,?,1,RS1,1 ; (1 for task1; then a seperately run of ?,?,2,RS1,1 for task 2, etc)
Rating (or partial credit) scale = RS1,R4,G,O

Is there any way to obtain the measures based on four tasks and based on each task (subscale scores) in one run? Also, how can I get the reliabilities for the candidate subscores based on each task?

Thank you very much for your help!


Mike.Linacre: Thank you for your questions, hsun.

Your method of obtaining task measures and reliabilities from separate analyses looks good. Two improvements might be:
1. Anchor the tasks at their difficulties from the combined 4-task analysis
2. Anchor the examiners at their leniences from the combined 4-task analysis
then the 4 subscore measures for each examinee are directly comparable.

In each subscore analysis, the examinee Table 7 reports the examinee reliability (="Test" reliability).

All 4 examinee subscore measures can be obtained in one analysis by doing examinee x task interactions. In Table 13 we would need to combine the Bias Size with the Examinee measure to obtain the Examinees ability on the Task. We would also need to compute the reliabilities separately. So your method is easier overall.

hsun: Thank you for your quick and thoughtful reply, Dr. Linacre!

The exam was actually equated to the benchmark scale. We use the exams administered in 2010 and 2011 as the benchmark scale (did a standard setting study for that). Examiners from the benchmark scale and from any administration in between the benmark scale and the current administration (theoretically have been calibrated to the benchmark scale in the previous analysis by getting the anchor values from the anchor files) potentially serve as the common examiners except those whose displacement value is greater than 0.50 (in either direction). We also anchor the task difficulties from the benchmark scale. I did not post all of the codes. Under the labels, I also have

Labels =
examinee id1=examinee id 1
examinee id 2=examinee id 2

examiner id1=examiner id 1,1.36


There is no doubt that when the 4 tasks are combined, everything should work! However, now I am a little confused when using the model = ?, ?,1, RS1,1 whether the task equating made any difference because only one task is included in the analysis.

Thanks again!

Mike.Linacre: husn, the task does make a difference because it has an anchor value. For instance, with exactly the same ratings by the same raters, the examinee ability on task 4 is about .52 logits higher than the examinee ability on task 1.

hsun: I see. Thanks for your help, Dr. Linacre!

179. variable maps

ted_redden April 26th, 2013, 6:19am: I am using Winsteps to analyze a set of multiple choice and partial credit questions as outlined in Example 14 "multiple rescorings, response structures and widths"
Is it possible to plot each of the steps in a partial credit score on the variable map. Eg. q25(1 mark), q25(2 marks) etc.

Mike.Linacre: Thank you for your question, Ted.
Winsteps Table 2 pictures the steps in several ways, as does Winsteps Table 12.

180. Questionnaire and Threshold

aurora11rose April 7th, 2013, 12:14am: Greetings Mike,

I am new to Winsteps and Rasch model... so pardone me if my questions sound silly! :)

I am in the midst of developing a Likert scale questionnaire. I pilot tested the instrument (n=100).

1. I ran Rasch Analysis on my data set (seperation .2.84, Person reliability .89, Seperation 3.65, Item Reliability .93, Cronbach Alpha (KR-20) Person RAW Score test reliability= .90). Is it necessary to conduct a separate Cronbach Alpha test on SPSS to increase the reliability of my instrument?

2. My Andrich Threshold numbers are large. (NONE, -26.75, -6.66, 7.65, 25.76) Why is that? Are my test items problematic?

Mike.Linacre: Thank you for your questions, Aurora11rose

1. The Cronbach Alpha computation in Winsteps is the standard one: http://en.wikipedia.org/wiki/Cronbach%27s_alpha - The SPSS computation of Cronbach Alpha should report the same value.

The "test" reliability of your questionnaire (=person ability reliability) is already good (.9). If you want to increase it:
(a) obtain a sample with a wider range of attitudes, increasing the sample's variance
(b) add more items to your questionnaire, increasing the measurement precision

2. It looks like you are using USCALE=. Please set USCALE=1 to report the thresholds in logits.
The Andrich thresholds advance (most negative to most positive), so your thresholds are definitely not problematic. However, please look at the Average Measure for each category on each item (Winsteps Table 14.3). Do they also advance?

aurora11rose: Thank you, Mike!

I did what you told me and it worked!

angkoo: Hi Mike what about the UMEAN? Do I set it at 1 as well? Does it matter if the decimals are set as 0 or 1?

Mike.Linacre: Angkoo, UMEAN= adds or a subtracts a constant value from all Rasch measures (abilities, difficulties) except thresholds. We use it to make output easier to use. For instance, UMEAN=50, USCALE=10 puts most measures in the range 0-100.

UDECIMALS= is also set for ease of use. Most people are confused when many decimal places are reported. They think all those decimals are important. So, if only the nearest integer is important, UDECIMALS=0.

For most classroom tests, UMEAN=50, USCALE=10, UDECIMALS=0 is exact enough.

181. future use of common person equated tests

harmony April 24th, 2013, 8:07am: Hi all:

I'm familiar with the linking and equating of tests using common items. In this framework, once the tests have been equated, they can be administered again to a new set of students yeilding similar results between forms (assuming no unexpected or unfixable misfit) because the common items remain in the tests as well as the item calibrations that put them on the same scale.

But what about future administrations of tests linked by common persons? In this method, what is common between the tests dissappears after the students move on. Can one anchor persons on future tests based on raw scores to logits of the score file of persons from previous tests? Any recommendations?

Mike.Linacre: Harmony, common-person equating requires the tests to be administered almost simultaneously, and even then we need to balance the tests so that half the sample take one test first, and the other half of the sample take the other test first.

Beyond that, the best we can do is to match person distributions. This is what equipercentile equating does.

harmony: As always, thanks so much for your reply Mike :).

182. Category threshold distances-what is recommended?

Val_T April 19th, 2013, 2:02pm: Hello,

I have developed a scale for children (a rare clinical sample) that has 4 response options. Thresholds seem to be ordered but the distances between them are less than recommended in the book by Bond and Fox (i.e. that thresholds should increase by at least 1.4 logits, but no more than 5, to show distinction between categories). The fit statistics for categories are within acceptable limits (see output below) so, should I be worried?

Thanks for advice!


| 1 1 428 7| -.13 -.17| 1.05 1.15|| NONE |( -2.24)| 1
| 2 2 1079 19| .27 .28| .96 .94|| -.87 | -.66 | 2
| 3 3 2082 36| .76 .77| .91 .90|| -.14 | .62 | 3
| 4 4 2242 38| 1.48 1.47| 1.03 1.03|| 1.01 |( 2.31)| 4
|MISSING 37 1| 1.13 | || | |
OBSERVED AVERAGE is mean of measures in category. It is not a parameter estimate.

| 1 NONE |( -2.24) -INF -1.50| | 88% 2% 1.5118| | 1
| 2 -.87 .05 | -.66 -1.50 -.03| -1.20 | 39% 33% .8440| .95| 2
| 3 -.14 .03 | .62 -.03 1.52| -.06 | 44% 77% .4224| 1.00| 3
| 4 1.01 .03 |( 2.31) 1.52 +INF | 1.26 | 77% 43% .7631| 1.03| 4
M->C = Does Measure imply Category?
C->M = Does Category imply Measure?

CATEGORY PROBABILITIES: MODES - Structure measures at intersections
P -+---------+---------+---------+---------+---------+---------+-
R 1.0 + +
O | |
B |11 |
A | 111 4444|
B .8 + 111 444 +
I | 111 444 |
L | 11 44 |
I | 1 44 |
T .6 + 11 4 +
Y | 11 44 |
.5 + 1 44 +
O | 11 3333333 4 |
F .4 + 2*222222 333 4*333 +
| 2222 11 **2 4 333 |
R | 222 1 33 222 44 333 |
E | 222 33*1 ** 333 |
S .2 + 2222 33 11 44 22 333 +
P | 2222 33 1*4 222 3333 |
O |22 3333 444 111 2222 3|
N | 3333333 444444 111111 2222222 |
S .0 +********4444444444444 111111111111*********+
E -+---------+---------+---------+---------+---------+---------+-
-3 -2 -1 0 1 2 3

Mike.Linacre: Val, these categories are functioning beautifully. Everything is advancing with good fit.

The 1.4 logit criterion applies if the rating scale is intended to act like a set of dichotomous items. Please look at the category definitions. Is this the case?

dichotomous advance: each additional step up the rating scale is an extra challenge accomplished.
ordinary advance: each additional step up the rating scale is a qualitatively different attitude.

Val_T: Dear Mike,

thank you for a rapid and reassuring response. My scale is intended to function as 'ordinary advance' so it's good to know it's working well.

Thanks again for your advice.


183. control file for essay item scores  

cardinal April 21st, 2013, 6:25pm: I have used WINSTEPS to calibrate essay items. The MC items were scored 0 or 1 and each essay domain was scored from 2 to 8 points.
One of the domains is now scored from 4 to 16 points (4, 8, 10, 12, 14, 16 [2-8 doubled]). I am not sure how to handle this in control file.
Data file example: for 12 item test first 10 are MC score 0/1, item 11 is scored from 4 to 16 points and item 12 is scored from 2 to 8.


This is how it was handle previously

Is there a way the specify in control file that item column positions are 1-10 (1) 11(2) 12(1)

Thank you

Mike.Linacre: Thank you for your email, Cardinal.

1-10 (1) 11(2) 12(1) can be specified in Winsteps, but it is much easier to have all responses occupy 2 columns.

One way to convert the data file to 2-columns is
1) copy the data into Excel
2) in Excel, "Data", "Text to Columns"
3) If there are data in row 1, insert a blank row at row 1
4) in Winsteps, "Excel/RSSST" menu, Excel, convert Excel file to Winsteps format.
This will also construct the correct CODES= statement.

cardinal: I read the item responses so that each mc and essay domain occupies 2-columns
I program the control file as such.
CODES= 0 1 2 3 4 5 6 7 8 10121416 ; VALID RESPONSES

the item parameters looked fine in the .if file but the step parameters looked very weird
60 1 .0000
61 0 .0000 (last MC item)
61 1 42.1528 (essay domain should be 4 to 16)
61 2 -44.776
61 3 42.1528
61 4 -41.253
61 5 42.1528
61 6 -41.238
61 7 42.1528
61 8 -41.345
62 2 .0000 (second essay domain: this looks as it usually does)
62 3 -.3111
62 4 -2.8646
62 5 -.0442
62 6 -.6469
62 7 2.1528
62 8 1.7139

Thank you for you help.

cardinal: I also tried reading in item responses so that each mc occupies 1-column and I divided the first essay domain by 2 and then read in both essay domains so that they each occupy 1-column in the data file (.dat).
Then I used "iweight=*"/ option for the first essay domain, weighting it by 2. When I did this, the parameters in .if file looked reasonable and the step parameters did too. Does method make sense?

60 0 .0000
60 1 .0000 (last MC item)
61 2 .0000 (first essay domain do categories make sense)
61 3 -1.1919
61 4 -3.3417
61 5 -.4132
61 6 -.7603
61 7 2.5638
61 8 3.1434
62 2 .0000 (second essay domain)
62 3 -1.1032
62 4 -3.4337
62 5 -.4032
62 6 -.7202
62 7 2.5503
62 8 3.1100

Mike.Linacre: Cardinal, you have a decision to make ....

The reported observations are: 4, 8, 10, 12, 14, 16

Are 5,6,7 real, but unobserved, qualitative advances? Or are they mathematical fictions?

If they are qualitative advances, then analyze them as 4,8,12,..

If they are mathematical fictions, then analyze them as 1,2,3,.. with IWEIGHT of 4.

The two sets of estimated measures will definitely be different, because the observations have different meanings. For more discussion, see https://www.rasch.org/rmt/rmt84p.htm

184. IRTPRO and Rasch Model

v.ghasemi April 21st, 2013, 10:37am: Dears
I'm beginners in IRT. I searched for a software and found IRTPRO a good one according to SSI (Scientific Software International). I begin with Rasch model but I didn't find anything in the software.

Mike.Linacre: v.ghasemi, please look at the IRTPRO Guide, page 125, 6.2.3 Unidimensional Rasch. It is in c:\Program Files\IRTPRO (or a similarly named folder).

185. FACETS reported chi-square= ??.?

gRascHopper April 17th, 2013, 1:54pm: Hi Rasch Pros,

I have sample of approx. 3000 persons for which I weighted using the "Rweight" syntax in my DATA= statment. My FACETS output looks good overall, but the Model Fixed Chi-Square for the person facet is being reported as:
"Model, Fixed (all same) chi-square: ??.? d.f.: 2951 significance (probability): .00"
It sort of looks like the test statistic is too large for the space that is allotted for reporting it. Is this true? And if so, is there a way that I can obtain the actual chi-square value?

Thank you!


Mike.Linacre: Adrienne, ??.? is reported when the chi-square is greater than 1,000,000. Possibly an effect of the Rweight syntax is to make the person standard errors unreasonably small. Suggestion: downweight the Models= specifications to compensate.

The chi-square is a test of homogeneity. To compute it yourself,
1) Facets "Output Files" menu, "Scorefile=" to Excel.
2) In Excel, apply the formulas in box at https://www.rasch.org/rmt/rmt62b.htm

gRascHopper: Thank you, Mike, for the details and suggestions. I will try both!

Mike.Linacre: Adrienne, the next update to Facets will report the actual value of the chi-square instead of ??.? for very large values.

186. about data analysis

amrit April 18th, 2013, 10:53am: Its my great pleasure to join on this Forum.
I am going to use Rasch model to develop and validate some questionaire.My problem is that response option is first dichotomus(No/yes) and if respondent reply yes then i have a categorical data(from 1 to 10).So can you suggest me how can i use rasch model in such scenario???? :(

Mike.Linacre: No problem, Amrit.

Here are two approaches:
A) combine the two items into one item, scored 0-10
0 = dichotomous no
1-10 = dichotomous yes and categorical 1-10.

or B) There are two items with responses:
Dichotomous = 0
Categorical = missing data (not administered)
Dichotomous = 1
Categorical = 1-10

The choice depends on how the categories relate to the dichotomies. If "no" on the dichotomy is less than category 1, then (A). If "no" on the dichotomy can correspond to any category, then (B).

For instance,
(A) is Dichotomy: "Do you smoke?". Categories: 1-10 are number of cigarettes each day.
(B) is Dichotomy: "Are you married?". Categories 1-10 are number of times you compliment your spouse each day.

The dichotomous items are modeled with the Rasch dichotomous model.
The categorical items are modeled with the Rasch rating-scale model, or if the 1-10 categories have different definitions for each item, then the Rasch partial-credit model.

What software are you planning to use, Amrit? One beginning point could be Ministep, www.winsteps.com/ministep.htm

187. Subset connection shouldn't be O.K.

Gathercole April 7th, 2013, 3:01pm: Hi Mike,

I'm doing a Facets analysis that should be a problem, but isn't. :) I can't figure out why it's not producing disjoint subsets.

The judging plan was: Each student responds to 1 of 6 prompts and is rated by 2-3 raters.

Since each student is only responding to 1 of the six prompts, I would expect there to be six subsets of students; one for each prompt, since the group ability can't be separated from the prompt difficulty. However, the analysis reports Subset Connection O.K.

The only other unusual feature of my data is the large numbers: there are over 29,000 students and 220 raters.

Do you know what could be causing this unexpected success? :)

Mike.Linacre: Thank you for reporting this situation, Gathercole.

Unfortunately you are correct. This success could be failure ....

Suggestion: Use Facets to analyze your data with two active facets in each analysis:
Models=?,?,X, ... (students and raters)
Models=?,X,?, ... (students and prompts)
Models=X,?.?, ... (raters and prompts)

Does each analysis report the subsetting you expect?

A possibility is that the data are too complex for Facets to detect subsetting. I am investigating ....

Here are some other possibilities ...
Are all the facets active in Models=?
Are the prompts anchored or group-anchored?
Are the students anchored or group-anchored?
Is a missing-data code, such as "9", being analyzed as a live rating?

Are student element numbers duplicative (same element number used for different students)? Please look at Facets Table 7. Do the counts look correct for all the elements? For the students, it may be easiest to "Output Files" menu, Scorefile=, to Excel, then sort the Excel worksheet on "Count". Inspect the top and bottom of the worksheet.

If there is still no explanation, then "Output Files" menu, "Winsteps".
Rows = Students
Columns = Prompts
Does the data matrix look correct?

Gathercole: Thanks for the advice and for looking into this mystery!

I ran the analysis with the models you suggested. The model with only students and prompts produced 6 subsets for the 6 prompts, just as I was hypothesizing should happen with the full dataset. The mystery is that it's not being detected with the full data; even if all raters had rated all students there should still be student subsetting by prompt.

I noticed something with the PROX iterations that are being used to assess subset connection; I'm not sure whether it's significant:

When I run just the students and prompts, it consolidates twice:
>>Validating subset connection
>>Consolidating 7 subsets
>>Validating subset connection
>>Consolidating 6 subsets
Warning (6) ! There may be 6 disjoint subsets

However, when I run with students, prompts, and raters, it doesn't attempt the second consolidation of 6 subsets:
>>Validating subset connection
>>Consolidating 7 subsets
Subset connection O.K.

To answer your other questions:
-All facets are active in Models=
-There is no anchoring of any elements or groups
-No missing data codes are specified
-No students are duplicated with additional prompts (verified in Excel and by the Facets run with only students and prompts that produced the expected 6 subsets)

Mike.Linacre: Thank you, Gathercole, for reporting this bug in Facets. Much appreciated :-)

Yes, there are definitely 6 subsets.

It looks like an internal buffer has overflowed during the subset detection, leading to the incorrect message. When a buffer overflow condition is about to occur, Facets is supposed to issue a warning message, not an O.K. message :-(

I have simulated data matching your design, and have the same problem with Facets. Am investigating ...

Mike.Linacre: Gathercole, the bug has been isolated that results in "Subsets O.K.". Am endeavoring to recode that part of Facets.

Gathercole: That's great, thanks for looking into it. This will probably be the first time I'll be relieved to see the subset warning :)

Mike.Linacre: Gathercole, there is now an amended beta-test version of Facets. Please email me directly for it. mike \at/ winsteps.com

188. Testlet Item Difficulty

MariaNelly April 16th, 2013, 8:12am: Hello!!

I am a grad-student interested in assessment. Thank you in advanced for your help :)

I am working on a project on item difficulty and trying to interpret the output. I should say that in the facets I included Testlests (group of items by reading passage) and I was wondering if the item difficulty by TESTLET is the one on the Measure column in the table by testlet.

Thank you

Mike.Linacre: Thank you for your request, MariaNelly.

Facets reported the overall difficulty of each test with the TESTLET measures.
The difficulties of the items relative to the overall testlet difficulties are reported with the ITEM measures.

In Labels=
1) Anchor the TESTLET elements at 0.
2) Give each item a group number. The group number is the testlet number.

Then each item has its difficulty relative to the latent variable, and the overall testlet difficulty is reported as the mean of the group of items.

MariaNelly: Thank you!!!

That was really helpful :)


Maria Nelly

189. 2 models for each performance in FACETS analysis

Imogene_Rothnie April 17th, 2013, 1:21am: I have a technical question about model specifications in FACETS.

I am analyzing a 12 station OSCE where students get a checklist score on each station (stations have checklist of various lengths, so I have a model specification for each station).

Students are also given a global performance score on each station on a rating scale of "not satisfactory, Borderline, Satisfactory".

I want to model this global part of the examiner scoring into the facets analysis too (have previously just been looking at checklist score).

Specifically, I want to be able to look at the relationship between global performance on a station and checklist score on the ability/difficulty scale.

Ie: facets are 3: Examiner/Station/Student Ratings are : checklist score / global performance so models:

With just checklist items considered model statements are:
Models =
?, 1, ?,R8 ; max score of 12 for item 1
?, 2, ?,R12 ; max score of 16 for item 2 ..so on for 12 stations

How can I also include the global performance rating for each student on each station? (by the same examiner?) I can't see a similar example in the FACETS help guide, apologies.

Any help much appreciated!

Mike.Linacre: Thank you for your question, Imogene.

Your analysis has become 4-facets:

Examiner + Station + Student + Item -> Rating

This can be implemented as:

4, Item
1=Global ; all stations use the same global
2=Checklist ; each station has its own checklist

?,?,?,1,R3 ; Global: 1=not satisfactory, 2=Borderline, 3=Satisfactory
?,#,?,2,R9 ; Each station (facet 2) has its own checklist

Imogene_Rothnie: Many thanks Mike - I started off thinking about a 4th facet but got confused because these are still scores!

I will give it a go now!


Imogene_Rothnie: Thanks Mike,
That analysis gives me a really interesting perspective on item as a facet, but I have a conceptually different question I think.

I want to be able to see the rating scale for the global score (ie 1,2,3) against the rating scale of the checklist on the vertical ruler. And I want to be able to do this for each station.
Basically I am asking 2 things:

IS the global rating scale different for each station? In particular , does '2 - borderline' sit at a different logit value across stations?

Secondly, what is the relationship between the 2 scales on each station, e.g. what logit value on the checklist scale does borderline on the global scale sit? For each station.

This is some research about standard setting. I'm not sure if the above is possible in a single analysis? I guess it's as though each station were a 'testlet' with x checklist items and 1 rating scale item? Maybe there is something I could do in FACETS and then take it to WINSTEPS?


190. Facets spec file

miet1602 April 11th, 2013, 10:39am: Hi Mike,

I had a couple of questions about setting up the Facets spec file - really just to double check that I am doing this right. I attach my example.

There are three facets, markers, candidates and items.

The questions:

1. Regarding the non-center option: If the primary interest is markers, I suppose I should specify this facet as non-centred? And if I am interested in item measures, and then person measures, do I need to change the non-centering option to reflect this? Basically, do I need to run three separate analyses, changing the non-centering option each time to reflect the focus of interest (i.e. markers, then items, then candidates?) The attached file is for markers.

2. Regarding the model - I have used the PC rating scale model, and specified separate rating scales for each group of items which has a different max mark. Is this the right approach?

Thanks very much in advance for your help.


Mike.Linacre: Thank you for your questions, Milja.

1. In your design, we usually think of the markers and items combining together to measure the candidates, so noncenter the candidates.

If this had been a study of marker behaviour, then we would think of the items and candidates (who are usually carefully preselected, such as videotaped performances) combining together to challenge the markers, and so would noncenter the markers.

Also, if the mean of the candidates is to be a certain value (norm-referencing), we would non-center the markers or the items (if there are no markers).

2. Your Models approach would work fine.
It looks like this line should be excluded: ?,?,1,D ;

I would probably have been somewhat lazy. Instead of
?,?,#1-23,R3 ;
?,?,#24-26,R4 ;
?,?,#27-28,R5 ;
?,?,#29-30,R8 ;
?,?,#31-33,R10 ;
?,?,34,R15 ;

my specification would have been:
?,?,#,R15 ;
but then I would have to verify that the range of categories in each Table 8.? matches our intentions.

miet1602: Thanks, Mike.
I will try the shorter model spec. I just wasn't sure that if I'd specified the rating scale out of 15 max, what Facets might do with items that are maybe out of 8 max mark, but none of the candidates got 8. Would it treat it as out of 7?

The ?,?,1,D model was there because there is one item that is a dichotomy. Should i keep it in this case? I suppose this item would not be covered by the rating scale model?

Mike.Linacre: Milja, the R15 means "observations numbered higher than 15 are invalid".

Facets discovers the valid responses by inspecting the data.

Algebraically, dichotomies are the simplest case of the rating-scale model.

R15 includes all the models you specified, but is less exact in screening out invalid data.

Your specifications are more exact for detecting invalid data.

To avoid confusing yourself and others, change
?,?,#1-23,R3 ;
?,?,#2-23,R3 ;
This change will make no difference to the output, but clarifies that item 1 is D (=R1), not R3

191. Testing the Equidistance Assumption

gradstudent23 April 12th, 2013, 3:55am: Hi,

Is there any literature to test the equidistance assumption using Rasch? Is it sensible to get an idea of equidistance by looking at the Rasch-Thurstone threshold?


Mike.Linacre: Thank you for your question, gradstudent.

The equidistant model can be conceptualized as equal spacing of dichotomous items representing the steps up a rating scale. If so, the Rasch-Thurstone thresholds are good approximations to the dichotomies: https://www.rasch.org/rmt/rmt233e.htm

gradstudent23: Thanks for the quick response Dr. Linacre! I will look into the link you sent.

192. Combining AlphaNumeric Coding for Polytomies

uve April 9th, 2013, 11:15pm: Mike,

I was given a survey to analyze which has 30 items. The first 29 are scored the following:

1=Strongly Disagree
4=Strongly Agree

The last item asks respondents to answer with a grade: ABCDF

So valid codes would be 1234ABCDF

I wish to score item 30 as 54321 ( A=5, B=4, C=3, D=2, F=1) but have the grades appear in the legend while keeping the other legend labeling for items 1-29. The closest example I could find in Winsteps Help was #6 under IREFER. But this still won't work. Thanks for your assistance.

Mike.Linacre: Uve, "have the grades appear in the legend" - which Table, Graph or Plot?

uve: The graphs

Mike.Linacre: This is awkward, Uve.

I added an extra item column to Example0.txt containing ABC. This works for me. Does it match what you want?

CODES = 012ABC ; valid response codes (ratings) are 0, 1, 2
1-25 1
26 0
CFILE=* ; label the response categories
0 Dislike ; names of the response categories
1 Neutral
2 Like
26+0 A option
26+1 B option
26+2 C option

uve: This worked great. Thanks!

I just realized that some of the questions have negative wording and should be reverse scored. I've added what I think is the correct command. Is this right?

CODES = 1234ABCDF ; matches the data
NEWSCORE = 123454321
1-29 1
30 2
1 Strongly Disagree
2 Disagree
3 Agree
4 Strongly Agree
30+1 F
30+2 D
30+3 C
30+4 B
30+5 A

Mike.Linacre: Uve, a brave attempt!

NEWSCORE and IVALUE are alternatives, so

1) omit NEWSCORE

2) complete the IVALUE commmands:
CODES = 1234ABCDF ; matches the data
IVALUEA= 123454321 ; forward
IVALUEB= 432101234 ; reversed

uve: That seems to work. Thanks again.

Advice: Since I've rescored the negative items, would it be prudent to add the word "Not" to the item questions? For example, "Bullying is a problem at our school" to Bullying is NOT a problem at our school.

Mike.Linacre: Definitely a good idea,Uve.

Include the word "not" in such a way that it can be distinguished from the original question, e.g., NOT, *not*

Some folks put the word NOT first, so that reversed items can be easily selected from forward items:

NOT: Bullying is a problem at our school

193. Where have the data gone?

ChrisMcM April 8th, 2013, 9:38pm: A relatively new user, so forgive if this is entirely obvious, but I have been stuck for several days on it.

I'm reading a very long thin file with one data point per line. FACETS replies:
Assigning models to "c:\facets\nPACESdiets1to2012\CandidateCentreExaminerEncounterSkillA.dat"
Total lines in data file = 181272
Responses matched to model: ?,?,?,?,?,R = 181272
Total non-blank responses found = 181272
Responses with unspecified elements = 0
Responses not matched to any model = 0
Valid responses used for estimation = 181272

And 181272 is the correct number of rows in the data.

However, when I get to Table 8.1 it says:
Table 8.1 Category Statistics.

Model = ?,?,?,?,?,R
| Category Counts Cum.| Avge Exp. OUTFIT|CALIBRATIONS | Measure at |PROBABLE|Probabil.|PEAK|Diagnostic|
|Score Used % % | Meas Meas MnSq |Measure S.E.|Category -0.5 | from | at |Prob| Residual |
| 0 25204 16% 16%| .07 .07 1.0 | |( -1.28) | low | low |100%| -.9 |
| 1 22120 14% 30%| .72 .72 1.0 | .52 .01| .00 -.71| | -.29 | 23%| |
| 2 108772 70% 100%| 1.44 1.44 1.0 | -.52 .01|( 1.29) .73| .00 | .29 |100%| .9 |

Now there are only 156096 values which have been read. Is there some reason why it is not reading them all? I have other cases which it is more extreme than this. The values are 0, 1 and 2 as reported above.

Sorry, I'm sure this must be something very obvious!



Mike.Linacre: Thank you for your question, Chris.

The explanation is in the column heading "Used". These are the values that are used in estimating the rating scale. Values that involve elements with extreme scores (all 0s or all 2s) do not participate in estimating the rating scale.

Yes, these counts are confusing. I will amend this in the next Facets update.

ChrisMcM: Many thanks for that answer, which I have been digesting for a day or two... I now see that FACETS had indeed lost some data, and that the Used totals are correct for what it had analysed. The 'lost' data correspond to those at the very top end of the scale with all 'correct' responses.

The example was a less severe one of a more severe problem I have with a very skewed set of responses, where most people get a 2 (Satisfactory) about 2% get a 1 (Borderline) and about 1% get a 0 (Unsatisfactory). Each person is rated 16 times, and the result is that the modal category is to have 16 2s.

What worried me about these data was that even though there was clearly agreement across raters in the raw data, FACETS gave a separation reliability of precisely zero. When I dug in I found that all of the cases with 16 2s had, in effect been dropped from the calculation. And that had force the reliability to zero, in effect, as the top corner of the contingency table had been set at zero when it was actually very large (if you see what I mean).

I wriggled around the problem by creating an extra dummy response of '3' just for those who had scored 16 2s, putting it in as a 17th assessment. When I did that I got a separation reliability of 0.3 (not zero, but probably closer to the correct value).

I realise that logits go pear-shaped when they have to calculate 1/0 or whatever, and in effect my wriggle has put in some sort of continuity correction.

THE QUESTION: is there some better way of doing this in FACETS which is more principled and handles this situation? I'm sure there must be but I couldn't find it.

Thanks again for the help, which is really appreciated. I should say that apart from this the program really reaches the parts that the other programs cannot reach, and is doing a wonderful job.


Mike.Linacre: Apologies, Chris. I had not noticed that Facets has this behavior.

It is easy to compute a reliability yourself.
1) Facets "output files" menu
2) Scorefile= to Excel
3) In the Excel worksheet for the desired facet
4) Compute sample S.D. of the measures, then square it = SD^2
5) Put in an extra column next to the Error column
6) Square the error values in the extra column
7) Average the squared values= MSE
8 ) Reliability =
(S.D.^2 - MSE) / S.D.^2

ChrisMcM: Hi -- Thanks again for the speedy reply.

I've done what you suggested, and have now got a nice hefty reliability of .804, which is tempting. Having said that I am worried about the calculations because all of those guys with the 16 2s (who have the highest actual score of 32) have an estimate of their measure of 0 (as well as an SE of 0). Zero though is right in the middle of the range of measures, giving a very strange distribution which makes little sense, the other measures having a range of -.50 to +2.35. (SD^2=.798, MSE = .1557)

Scores of 27, 28 29 30 and 31 have measure estimates of +1.08, 1.26, 1.48, 1.80 and 2.35, so my intuition is that if the 16 2s = 32 had a measure it would be of the order of 3. Substituting that value for the measure gives a reliability of .66 (SD^2=.458, MSE = .1557).

However that still has the SE of those "32" measures at zero. Again, extrapolating from the actual values for 28 to 31 (.44, .51, .63, .91), a guess at the SE of 16 2s=32 is about 1.5. Doing the same calculations then gives a reliability which is negative as MSE is greater than SD^2 (SD^2=.458 MSE = 1.732); R=-2.78 (!)

Something doesn't quite seem principled here, and I am not sure what to do. Would you like me to send the data files to you (although the data file is large at 4.4 Mb)?

Best wishes and particular thanks


ChrisMcM: PS If I simply do the calculations with all the cases of score = 32 removed, then SD^2 = .259, MSE = .52, giving R = -1.007 (which is presumably equivalent to what FACETS did, and it substituted in precisely zero as R shouldn't be negative?).

Mike.Linacre: Chris, the non-extreme computation looks correct. The computation with extremes is definitely incorrect.

The measures corresponding to extreme scores should be the most extreme. Their standard errors should be the highest.

Example file lfs.txt. The rating scale is 0,1,2. There are 25 items. So the maximum score is 50:

1 Child
T.Score T.Count Obs.Avge FairMAvge Measure S.E. Status 1
30.00 25.00 1.20 1.27 .61 .34 -1 1
50.00 25.00 2.00 1.99 6.06 1.83 -3 2 <-Extreme, Status -3
34.00 25.00 1.36 1.47 1.09 .36 -1 3
27.00 25.00 1.08 1.12 .26 .34 -1 4

Perhaps no one in your data scored "2" except those with maximum scores. If so, please add a dummy data record mixing 1's and 2's, in order to make 2's estimable.

194. DIF Analysis

jblanden April 9th, 2013, 5:05am: Conundrum here. I am working with a data set from a two-group pretest-posttest design. The assessment is fairly challenging as the item mean is higher than the person mean on both occasions; however, there is a small but, nontrivial increase from the pretest to the posttest for both groups which, do not differ. I run a DIF analysis on the stacked data using the treatment condition as the differentiating variable. I expected to see most of the item difficulty observations fall on or below the pre=post diagonal. On the contrary, the majority of items fall above the line indicating a general increase in item difficulty from pretest to posttest despite the increase in total score points. Any thoughts about this finding would be appreciated.


Mike.Linacre: John, for stacked data, the average item difficulty is estimated for each item. So, we expect roughly half the items to be reported as easier and the other half to be reported as more difficult relative to the average difficulty.

Suppose that all the items became 1 logit more difficult. Then, in a stacked analysis, the DIF would report no change. That one logit shift would be reported as a one logit increase in person ability.

John, how about "racking" the data? Then shifts in item difficulty would be relative to the average ability of each person across the two time-points.

jblanden: Mike, Thank you for your clarification and suggestion. I may have some misconceptions about the DIF analysis computation/interpretation that require further clarification. For the DIF= control variable I specified time of test (pre, post) and treatment condition (E,C) thinking this approach anchored estimates of person ability to the stacked analysis and allowed for the separate estimation of item difficulties for each combination of the classification variables (same as if I racked the data by time and ran separate analyses for the E and C groups). The Excel Worksheet used to generate the DIF Plots (30.2) appears to contain the item difficulty measures (DIF Measure) for each group at each time point that I need to construct the pre vs post scatterplot of item difficulties. In advance, thank you for your time.

Best regards,

Mike.Linacre: John, the stacked DIF analysis may be doing exactly what you want it to do. However a general change in item difficulties (easier or harder) is seen as a general change in person abilities (higher or lower) in a stacked analysis.

In a racked analysis, where the ability of each person is held constant across time-points, general changes in item difficulties are seens as changes in item difficulty.

jblanden: Mike, It is the differential change in item difficulty between the two groups that I want to examine across time. So I am back on track with racking. Thank you for your guidance and patience. John

195. Common Item Equating for different Cohorts

CelesteSA April 4th, 2013, 2:11pm: Hi there,

I have two Science tests written by two different cohorts with most of the items being common in both tests with a few item differences between them (one groups wrote in 2011 and the other in 2012).

I followed all the steps in the Winsteps manual to equate the tests based on the common items, but even after scaling test B on S.D. and Mean as per Step 3.a and removing outlier items, the best-fit slope is 0.58 which seems to be as good as I can get it.

Can I now compare the learners on the two tests or are their results still so different that I need to keep on eliminating items or should I assume the groups are so inherently different they cannot be compared?

Would appreciate your comments,

Kind Regards,


Mike.Linacre: Celeste, yes, this is a difficult situation.

Let's do a little more detective work. In the two separate (unequated) analyses, what are the means and S.D.s of the abilities of the two samples?

CelesteSA: Thanks for the reply Dr Linacre,

Based on the common items in the tests, below are the mean and Standard Deviation of the items which I used for the upscaling.

S.D. 1.33 1.16 1.146551724
MEAN 0.00 0.00 0.00

But from your comment it seem that person abilities S.D. and Mean should be used for the upscale, is that correct or was my using the item S.D. and Mean correct?

Mike.Linacre: Celeste, we need to investigate: Is this change in empirical performance due to a change in test discrimination or due to a change in the spread of person abilities?

In traditional equating, the spread of the person abilities is assumed to be the same (norm-referenced or equipercentile equating).

In Rasch, we usually consider the spread of the person abilities to be a finding, and prefer to assume the invariance of the latent variable (item difficulties).

Of course, neither assumption is exactly correct. If an assumption leads to dubious results, then we need to investigate it. You indicated that the "invariance of the latent variable" (equating item difficulties) assumption is producing dubious results. So we need to examine equating based on the "spread of the person abilities". Does this support or challenge the "invariance of the latent variable" assumption?

In your analysis, it appears that the ratio of the sample ability S.D.s is 1.15, but the ratio of item difficulty S.D.s (for the common items) is 1/0.58 = 1.72.

This suggests that it would be safer to equate the sample ability S.D.s, than the item difficulty S.D.s., unless we have other evidence that the "true" sample ability S.D.s have changed by 1.72/1.15 = 1.50. This is a 50% change in one year, which is an amazingly large change unless there has been considerable revision to the curriculum, change to the student population, etc.

CelesteSA: Hi Dr Linacre,

I have investigated the person abilities for the two tests, and found that:

Test A: Mean = -0.69 and S.D. = 0.55
Test B: Mean = -0.58 and S.D. = 0.58

When I place the two groups' person abilities on a scatter plot (sorted in terms of highest abilities for each groups), I get a R2 of 0.913 which to me implies that person abilities of the two groups are equivalent even though the items seem to function differently for the groups.

Would you then suggest I do person equating instead of item equating?

Mike.Linacre: Celeste, the similarity of the two person S.D.s certainly suggests that they are the same for the two person samples.

The challenge becomes the relationship between the cohort means. According to these numbers it appear that
either Cohort B is 0.11 logits more able (on average) than Cohort A
or the items have drifted so that Test B is is 0.11 logits easier (on average) than Test A
or there has been a small change in cohort ability + a small change in Test easiness

There is no statistical way to decide between these alternatives. For example, if the teachers have been "teaching to the test", then a drift in item easiness is expected. If the teachers have been motivated by overall test results to improve their teaching, then an improvement in cohort ability is expected.

.11 logits is roughly equivalent to one month's gain in a typical educational setting: https://www.rasch.org/rmt/rmt62f.htm

CelesteSA: Thanks Dr Linacre, that does clarify the issue somewhat.

The cohort with the higher mean wrote the test later in the year, approximately 2 and a half months later. We had hoped that this fact would not make a difference but now it does seem that the extra months of school benefited the group that wrote later.

How can I adjust for this? Should I add 0.11 logits to the group that wrote the test earlier?

Mike.Linacre: Celeste, if the two cohorts are considered to be randomly equivalent, and they are to be equated, then a norm-referenced adjustment of 0.11 logits is indicated.

BTW, please double-check that 0.11 value, I may have misunderstood the numbers ...

196. confirmation of linking logic

harmony April 8th, 2013, 11:23am: Hi all:

Tests A, B, and C have been linked via common items in the framework of Test A. Then changes were made to Tests A, B, and C with new items being added and old items being removed, including the original linking items.

If the new test B is linked to the old test B via the measures of items that were not changed in the file that included the original linking items (though not the linking items themselves), is the new test still properly linked to test A? The logic is that since the measures of the unchanged items are reported in the context of test A, they are linked to that test and can thus be used to link the new test B to that test. Does this make sense?

Any insights will be appreciated.

Mike.Linacre: Harmony, it certainly looks like all the tests are linked. There is a network of shared (common) items that connects all the tests. It is not the network that the test developers originally intended, but it is there.

harmony: Thanks foy your reply Mike. Out of curiosity, I'm going to do some virtual linking using items with similar content and objectives and see if there are any great differences between that and the current linking.

Gathercole: Hi harmony,

I was in charge of a project very similar to yours last year (linking 3 English proficiency tests in an English foundations program to a common vertical scale). Our design ended up working fairly well, if you're interested in comparing notes my email is gathercole \at/ gmail.com.

197. Item weight question

Esther April 3rd, 2013, 10:18pm: That approach will work, Esther. All looks good.

If the skipped score-points in Item 21 are conceptual levels (unlikely), then the original scoring would be more accurate.

Esther: Hi Mike,

Thank you so much for your prompt reply. I have a couple of follow-up questions regarding your reply. The scores for Item 21 is a domain score for an essay and the original scores from the raters run from 2 to 8. The scores for this domain was doubled because this domain is weighted more and that’s why we have the skipped score points for this item in the data file. The range of raw score points for the test is 6 to 44 (20+16+8). My follow-up questions are
1.Is this a case of skipped score points at conceptual level? If not, could you give me an example of that?
2.Could you explain a little bit more about what you mean by “the original scoring would be more accurate”?

Mike.Linacre: Esther, your weighting is exactly correct.
Example of conceptual levels:

Item 1: 0=All Wrong, 1=Partially Correct, 2=Correct.

item 2: 0=All Wrong, 2=Correct. There is no partially correct response. For this item, 1 is a conceptual level. We can imagine it, even though we cannot observe it.

Analyzing item 2 as 0,2 is more accurate than analyzing item 2 as 0,1 weighted 2. See https://www.rasch.org/rmt/rmt84p.htm

198. Progress over time

harmony April 2nd, 2013, 7:10am: Hi all:

I'm working on linking tests in a multi-level English language foundation program so that we can put student abilities on one scale. One question of interest that seems elusive in the literature is any research indicating expected language development over time in terms of logits of ability. Put simply, how much improvement -in terms of logits- is a student expected to make with about 4 months of instruction at 20 hours per week?

I remember reading somewhere -perhaps on the measurement transactions page- about a study that indicated about 1 logit of ability after a year of instruction, but I think this referred to mathematics.

Any thoughts or reference to research that has been done to answer such questions would be greatly appreciated.

Mike.Linacre: Yes, Harmony, Rasch Measurement Transactions is a good resource for this type of information:
https://www.rasch.org/rmt/rmt62f.htm - for grade equivalents and logits
https://www.rasch.org/rmt/rmt64a.htm - reading ability and age

199. Low reliability of separation but high chi square

windy March 29th, 2013, 8:52pm: Hi Rasch-stars,

The attached Facets output contains information about rater accuracy in writing assessment. The essay facet is scored dichotomous (accurate or not accurate) based on a match between operational and expert raters -- and measures for each essay on the logit scale describe the difficulty for a rater to be accurate on a particular essay. The raters are not centered because they are the object of measurement (in a sense, the raters are being "tested" on the essays).

My question is related to the output for the essay facet. The reliability of separation is very low (rel = 0.04), but the chi-square statistic is quite large (2671.2). This seems like a mismatch in information.

Any ideas about this finding?

- Stefanie

Mike.Linacre: Thank you for your question, Stefanie.

The chi-square is 2671.2 but its expection = its d.f. = 2119, so the mean-square is 1.26 which is close to its expectation of 1.0. So the size of the misfit to the null hypothesis is small, but the power of the statistical test is so great that we can see clearly that 1.26 is different from 1.0. Consequently, the null hypothesis that "all persons are the same except for measurement error" is rejected.

Another way of looking at this is: Reliability is a function of (average measurement location, average measurement precision), but the chi-square is a function of (individual measurement location, individual measurement precision).

windy: Thanks very much, Dr. Linacre!

200. DISCRIM is estimated but constrained across items

RaschModeler_2012 March 27th, 2013, 9:46pm: Hi Mike,

Please see the following equation

logit_ij = alpha * (theta_j - item_i)


alpha = the discrimination parameter constrained to be equal across all items, BUT estimated nonetheless

theta_j = person ability

item_i = item difficulty

Question 1: If the discrimination parameter, alpha, is constrained to be equal across all items but is actually being estimated in the model, is it still a Rasch model? If not, could you please explain why?

Question 2: Is there a mathematical way to convert the discrimination parameter that was estimated (e.g., alpha = 1.4) to 1.0, such that the item difficulties and person abilities can be interpreted as one would in a Rasch model?



Mike.Linacre: RM, is this Winsteps DISCRIM= ?

If so, the discrimination coefficient is not used in the estimation. It is computed as a fit statistic after the estimates have been made.

Please try DISCRIM=Yes and also DISCRIM=No. The Rasch measures do not change.

RaschModeler_2012: Hi Mike,

I'm actually talking about estimating this model in another software package that estimates DISCRIM. I am aware that Winsteps does not esimate DISCRIM in the model. In light of that DISCRIM is actually being estimated, would you mind answering the questions I presented previously? Given your expertise, I'm curious what your thoughts are on this matter!

Thank you very much!


Mike.Linacre: OK, RM.

If the item discriminations are to be constrained to be equal, then discrimination is not a model parameter. The 2-PL model becomes a 1-PL Rasch model.

The constrained item discrimination coefficient can have any value. We usually set it to 1 for mathematical convenience and to report measures in logits. If it is set to 1.7, then the measures are in probits (approximately). If the discrimination is set to 1.4, and we want to convert the resulting estimates to logits, then multiply all the estimates by 1.4

RaschModeler_2012: Thanks, Mike. Okay. One more question. Say the discrimination parameter is estimated to be 0.73 across all items. Then to convert the item difficulties into logits, would one simply multiply the item difficulties by 0.73? Also, since Winsteps mean centers the item difficulties, afterwards they should be mean centered. So, for comparable estimates, one would do the following:

1) .73*(each item difficulty)
2) calculate mean of estimates derived from step 1
3) subtract the mean derived from step 2 from each of the item difficulties derived from step 1
4) compare to item difficultiles estimated from winsteps




Mike.Linacre: RM, yes, that should do it. The computation would be the same whether the 0.73 was estimated (as a parameter) or imputed (as a coefficient).

Mike.Linacre: RM, the R module ltm uses Marginal Maximum Likelihood estimation (MMLE). Cross-plotting MMLE estimates and Winsteps JMLE estimates we expect the the relationship to be slightly curvilinear. The slope of the cross-plot should be close to 1 (when ltm reports in logits) or close to 1.7 (when ltm reports in probits).

For a discussion of different estimation methods, see my papers at:

Mike.Linacre: RM, "constant across items but free to vary" - this would produce estimates that are not "identified", so either
a) the R routine has its own constraint on discrimination
b) the R routine reports one discrimination value out of an infinite number of possible discrimination values.

RaschModeler_2012: Apologies. I misstated. When I said "constant across items but free to vary," I meant to say that R estimates a constant discrimination parameter across all items. That constant discrimination parameter need not be 1.0. In other words, it applies the equation that I showed above, where alpha is estimated (as opposed to being fixed at 1.0), but forced to be the same across all items. So, there is a single discrimination parameter estimate that applies to all items (e.g., 1.3)

Very sorry for my misstatement. At any rate, my point is that given the discrepancy between the estimation methods, the best I can do to make the results between R and WINSTEPS comparable, would be to follow the steps I mentioned previously. From what I can tell, there is nothing else I can do to make the reuslts comparable.

Would you agree?


Mike.Linacre: RM, alternative estimation methods for the same values always produce (slightly) different values. It is the same situation in physics as in psychometrics. Cross-plot the two sets of estimates to discover the conversion values.

It is the same situation in Winsteps if different analysts use different convergence criteria (LCONV=, RCONV=). The problem with estimates is that they are only estimates, never the true values :-(

RaschModeler_2012: Thanks for the reality check, Mike, Much Appreciated. I have a general question for you--related to this post. I know that the item difficulties are mean-centered, but are they standardized to have a SD of 1? I've seen other programs do this, and I'm trying to figure out a way to compare the results. I suppose I could standardize the WINSTEPS results to have a SD of 1.0 for the items, but then won't that change the scale? Not sure what to do...

As always, thank you for your patience and willingness to answer my questions, despite the painfully obvious answers...


Mike.Linacre: RM, in Winsteps the SD of the items is whatever it happens to be after each item is estimated. It is usual in IRT programs to constrain the S.D. of the person abilities to be 1.0 probits, and their mean to be 0.0 probit, implementing the assumption that the persons are a random sample from a unit-normal distribution.

An item constraint in many Rasch programs is that the difficulty of the first item is set at 0.0. This can produce distorted standard errors and fit statistics, so is not done in Winsteps. See "Reference Item" standard error in https://www.rasch.org/rmt/rmt92n.htm

RaschModeler_2012: There's this software called Xcalibre, which gives the following option when fitting a dichotomous logit "Rasch" model to:

**Center the dichotomous item parameters on b (to have a mean of zero and sd of 1)

I found this to be very odd. Needless to say, the results are no where near the results I obtain from Winsteps. I have sent an email to Technical support for guidance.

But, if I'm understanding you correctly, you are in agreement with me that such practice is peculiar.



Mike.Linacre: RM, if the reported item SD is 1, then the item discrimination must be set at the logit item S.D.

So, multiply all the Xcalibre measures by the item discrimination to approximate the logit measures.

RaschModeler_2012: That's just it. Discrimination is fixed at 1.0. Is that impossible? If so, I apologize for wasting your time on this. Tech support will have to explain to me how it is possible for the item difficulties to be standardized to a mean of zero and sd of 1 and have discrim=1. How is this a Rasch model? There must be something I'm not understanding about the software...


Mike.Linacre: RM, yes, there must be something else going on .....

201. Stage probability (Revisited)

chong March 26th, 2013, 3:22am: Hi Mike,

I once analyzed the stage-like data using Rasch dichotomous model and I was told to use the following formula to compute the probability of a person of ability B being observed in a particular stage k overall:
P(X = k) = 1/(1 + exp(T(k-1)-B)) - 1/(1+ exp(T(k)-B))
for k = 1 to m

T(k) = "the half-way between the average difficulty of stage k items and the average difficulty of stage (k+1) items

and T(0) = -infinity; T(m) = +infinity

For the same set of data, I recently rescore some multiple-choice items into the partial credit ones. While I can compute the probability of a person belonging to each category for certain item directly from the PCM expression, I doubt if I could apply the above-mentioned formula and procedures to find, at the test level, the probability of person being in a particular stage. I ask this because I found the formula has usually been used (specifically?) for the graded response model, which is not a Rasch model, "even in the absence of a discrimination parameter" (Masters, 1982, pg. 155).



Mike.Linacre: Thank you for your question, Chong.

Let's start with the dichotomous stage data. You probably used a Binet (imputation) data design. For example, an item has 5 stages (k=1 to 4), so is expressed as 4 dichotomous items.

0000 = stage k=0
1000 = stage k=1
1100 = stage k=2
1110 = stage k=3
1111 = stage k=4

Then T(k) is the average of the difficulties of stage k across all the items.

Is this correct?

chong: My apologies, Mike, I should have further described what the instrument looks like.

The test originally consists of 25 multiple-choice items to be scored dichotomously each. There are a cluster of 5 items at each stage k, where k = 1 to 5. So T(k) has previously been conceptualized as the midpoint between average difficulties of stage k items and average difficulties of stage (k+1) items, as shown in the attachment below.

Am I right so far? Please correct me if I'm wrong somewhere.

Suppose some distractors of several items are 'more correct' than the others and hence partial credit should be given. After rescoring, the same test consists of a mixture of dichotomous and partial credit items. Now my question is whether the formula P(X = k) = 1/(1 + exp(T(k-1)-B)) - 1/(1+ exp(T(k)-B)) and the relevant computations are still applicable if the partial credit instead of dichotomous model is concerned?



Mike.Linacre: Thank you for your response, Chong.

Before thinking about the partial-credit items, I am trying to understand the dichotomous items:

a) Think about a "Stage 2" person. Conceptually, what is that person's expected score on the 5 Stage 1 items? on the 5 stage 2 items? on the 5 stage 3 items?

b) There are 25 dichotomous items. Suppose a person scores 2 on subtest 1 (Stage 1) , scores 3 on subtest 2 (Stage 2) and scores 2 on subtest 3 (stage 3), and 0 on subtests 4 and 5. At what Stage is the person?

chong: Thanks for your quick reply, Mike.

"Suppose someone scores 2 on subtest 1 (Stage 1) , scores 3 on subtest 2 (Stage 2) and scores 2 on subtest 3 (stage 3), and 0 on subtests 4 and 5."

In this case, the subtest scores for this person is (2, 3, 2, 0, 0). In theory, we hope to see more Guttman-like responses across the subtests, e.g., (3,2,2,0,0) or at least (3,3,1,0,0), in contrast to (2, 3, 2, 0, 0) or even worse, (2, 2, 3, 0, 0). In the strictest or ideal case, the subtest scores of 7 (2 + 3 + 2) correspond to (5, 2, 0, 0, 0), which may be empirically exhibited by only a small portion of respondents. But I understand the problem is that both (2,3,2,0,0) and (5,2,0,0,0) have the same Rasch measure, and hence the same probability to be classified at each stage k = 1 to 5 (according to the formula P(X = k) = 1/(1 + exp(T(k-1)-B)) - 1/(1+ exp(T(k)-B)), although fit statistics might help differentiate the quality of both responses.

Suppose I analyze with only the first four subtests and there are two persons (m & n) have the same ability B= -0.86 but different subtest scores, (4,2,1,0) and (2,3,2,0). Given

T0 = -infinity
T1 = -1.4
T2 = -0.1
T3 = 1.4
T4 = +infinity

Using the formula, I get their probability of being at each of the four stages:
P(x = 1) = 0.37
P(x = 2) = 0.31
P(x = 3) = 0.22
P(x = 4) = 0.09

Thus, the probabilities seem to fit person m relatively better than person n. But what about a person with subtest scores (5,2,0,0)? Apparently, while we may infer that any persons of ability B = -0.86 (in this test) are expected to have a relatively better chance being at stage 1 (e.g., person m) than other stages, they could have made the transition to the higher stages in different ways (person n in particular).

Surprisingly, without computing the probabilities, I could have arrived at a different answer, i.e., stage 2, at face value because

T1 = -1.4 < -0.86 < -0.1 = T2, despite -0.86 is slightly closer to T1.

If the cases like the kind of (2,3,2,0,0) abound, the resulted item hierarchy may no longer appear as shown in the figure above, i.e., the stages might to some extent overlap. [Construct validity is in danger!]

Does this line of thoughts sound right to you?

Your help would be much appreciated!


chong: Mike, I need to go back to your questions, I did not saw part (a) before posting my last response.

"Think about a "Stage 2" person. Conceptually, what is that person's expected score on the 5 Stage 1 items? on the 5 stage 2 items? on the 5 stage 3 items?"

'Conceptually' means theoretically?

In fact, the old theory suggests different grading criteria to deal with this: a person is deterministically assigned (a single) stage 2 if, out of 5 items per subtest:
(a) 3 (or 4) are answered correctly for each of both subtests 1 & 2, and
(b) less than 3 (or 4) are answered correctly for each of subtests 3, 4 & 5.

Thus, (5,5,0,0,0), (5,4,2,1,1), (4,5,1,2,0) and even (4,4,2,2,2) are a few examples of those assigned stage 2. But it would be more unclear and less effective for borderline cases, e.g., (5,3,3,2,0).

For recent studies, I abandon the criteria on the basis of the revised theory and use Rasch analysis instead.

Mike.Linacre: Thank you for the details, Chong.

It is clear that
1) empirically, assigning students to stages is approximate
2) mathematically, computing the probability that a student of ability B is in stage k is also approximate

Accordingly, we can adjust the dichotomous formula for partial credit items while maintaining about the same overall level of approximation.

1) for dichotomous items at stage k, the item difficulty (measure) corresponds to a score of 0.5 on the item
2) for each partial credit item, compute the measures corresponding to a score 0.5, 1.5, 2.5, ... on the item
3) for each partial credit item, decide what stage corresponds to a score on the item of 0.5, then what stage corresponds to a score of 1.5, then what stage corresponds to a score of 2.5, ....
4) average all stage k measures = average (stage k dichotomies + partial-credit measures at stage k)
5) apply the dichotomous formula

Is this what you want, Chong?

chong: Yes, Mike, this coincides with what I first had in mind, except that I conceptualized in terms of Thurstone thresholds in place of expected score.

I think I'm almost done, there are only several things I need to confirm before progressing further:

(a) By "dichotomous formula", you mean P(X = k) = 1/(1 + exp(T(k-1)-B)) - 1/(1+ exp(T(k)-B))?

"for each partial credit item, compute the measures corresponding to a score 0.5, 1.5, 2.5, ... on the item"

(b) Do these measures refer to the "Rasch-half-point thresholds" ?

For illustration, I use one of my partial credit items, which assess only the first 3 stages:

-infinity to -3.12 --> score 0 region --> pre-stage 1
-3.12 to -1.02 --> score 1 region --> stage 1
-1.02 to 1.15 --> score 2 region --> stage 2
1.15 to +infinity --> score 3 region --> stage 3

Thus, (c) the "partial-credit measure" at stage 1 for this item is -3.12 ?

Many thanks,


Mike.Linacre: Chong, yes. The Rasch-Thurstone thresholds or the Rasch Half-Point thresholds. The choice depends on how you think about your partial-credit items. The difference is usually small.

chong: Thank you so much, Mike, for your priceless advice.

There is one more thing perplexing me so far: I found in the literature that people interpret the Rasch-Thurstone thresholds in a way differing from what I currently understand. For example, in Wilson's (2005) Constructing Measure: An Item Response Modeling Approach, he writes (pg. 106):

"The kth Thurstone threshold can be interpreted as the point at which the probability of the scores below k is equal to the probability of the scores k and above (and that probability is .5)"

and yet later,
"... the lowest intersection is the point at which Levels 1, 2, 3, and 4 together become more likely than Level 0..."

In another published document, Wilson (1982) also writes (pg. 225):

"The Thurstone thresholds can be interpreted as the crest of wave of predominance of successive dichotomous segments of the set of levels. For example, tau1 is the estimated point at which levels 1, 2, and 3 become more likely than level 0, tau2 is the estimated point at which levels 0 and 1 become more likely than levels 2 and 3, tau3 is the estimated point at which levels 0, 1, and 2 become more likely than level 3" (see attachment below).

My rigid understanding of Rasch-Thurstone threshold (T) is that it serves like the pass/fail point like the dichotomous item difficulty. For example, T1 is the point where the chance of scoring '0' = 0.5 = the chance of scoring '1' or above. Only beyond this point is a person more likely to score 1 or above than scoring '0'. Is this right?

Is there something I don't know or misunderstand?

Mike.Linacre: Thank you for your questions, Chong.

The two description of R-T thresholds agree. One description is considering R-T thresholds as a point on the latent variable (probability .5). The other description consider the R-T threshold as the transition point in a process of changing probabilities (from "less likely" to "more likely").

For the relationship between R-T thresholds and dichotomies, see https://www.rasch.org/rmt/rmt233e.htm

chong: Mike, I think I've seen the point of having both static & dynamic views of R-T thresholds.

Again, thank you so much!

203. Person-item map and partial credit model

kstrout1 March 27th, 2013, 8:27pm: Hello!

I attached the person-item map related to my question. I completed a Rasch analysis using the partial credit model. On my person-item map, I know that persons are located on the left and items are located on the right. The "most difficult" items are at the top of the map, and the "easiest" items are at the bottom of the map. What I don't understand is the located of the people. How can people "above" or "below" items on this scale? The items are questions related to spiritual wellness. If the person had to provide a response, how are the "above" the response, or "below" the response.

Thank you!!

Mike.Linacre: Thank you for your question, Kstrout1.

Since these are partial-credit items, please choose an item map that better matches the item format.

See, for instance, Winsteps Table 12.6 https://www.winsteps.com/winman/index.htm?table12_6.htm
or Winsteps Table 2.2 https://www.winsteps.com/winman/index.htm?table2_2.htm

This shows the region on the latent variable roughly corresponding to each category of the partial-credit scale of each item.

204. unidimensionality with dichotomous data

ybae3 March 25th, 2013, 3:13pm: Hello,

I have the experience of Rasch rating scale model but no Rasch dichotomous model. Currently I am piloting a validity of a screening tool if the tool can be used for Korean young children. The tool originally has been developed for American young children.

The tool consists of 16 items; the 46 parents of the children answered to the 16 items.

After running unidimensionality analysis, variance explained by measures=19.1% and unexplained variance(total)=80.9%
I deleted one item (outfit MNSQ=2.09) but variance explained by measures is still far from 60%. I attached the output related to this.

Could you let me know how I can solve this problem? Also, could you give me any advice how to learn Rasch dichotomous analysis including procedures? I am doing self-study by looking for books, websites, etc. But I want to know that I am on the right track.

I hope my questions make sense :)

Thank you,


Mike.Linacre: Thank you for your question, Youlmi.

Are you sure you have a problem? Please see https://www.rasch.org/rmt/rmt221j.htm
Look at the line corresponding to the S.D.s for your persons and items.

What is the eigenvalue of the first contrast? Is this greater than 2.0? If not, then there is no strong secondary dimension. https://www.rasch.org/rmt/rmt191h.htm

ybae3: Thank you for your quick response.

I think I have a problem because variance explained by measures is only 19.1%. The variance should be over 60% to be unidimensional measure. Am I right? I chose Diagnosis option in Winsteps and then chose Unidimensionality map to see if the variance is over 60%.
You told me to look at the line corresponding the S.D.s for my persons and items. I tried to see them by choosing everything under Diagnosis option but I couldn't see the S.D. s for my persons and items. Could you give me another help to understand about how to analyze Rasch dichotomous data? Are there any steps just like Rasch step by step analysis for rating scale data?

The eigenvalue of the first contrast is 2.8. Can I say this is greater than 2.0?

Thank you for your advice!


Mike.Linacre: Youlmi, what is the S.D. of your person ability measures and of your item difficulty measures?

You write: "The eigenvalue of the first contrast is 2.8. Can I say this is greater than 2.0?"

Reply: Yes, so please look at the plot in Winsteps Table 23.2.
Do you see 3 or 4 items separated away from the other items at the top or bottom of the plot? (items A,B,C or items a.b.c).
If yes, these items may be on a different dimension. Please look at the content of the items. Do they differ from the other items?
If no, then there may be a pervasive second dimension, such as readability of math items on a math test. Some math items require better reading skills than other items, but there are no "reading" items included in the math test.

ybae3: Thank you for your answers!

Your question: what is the S.D. of your person ability measures and of your item difficulty measures?

Reply: * In the case of Non-Extreme,
the S.D. of my person ability measures= .60;
the S.D. of my item difficulty measures= .96
* In the case of Extreme & Non-Extreme,
the S.D. of my person ability measures= 1.01;
the S.D. of my item difficulty measures= 1.11

By following your advice, I can see 4 items (A,B,C, &D) separated away from the other items at the top of the plot. However, I think that the 4 items do not differ from the other items. All items indicate a child's depressive symptoms including externalizing and internalizing behaviors. I attached the file for showing these items.

I appreciate your advice!



Mike.Linacre: Thank you for these details, Youlmi.

Look at https://www.rasch.org/rmt/rmt221j.htm with your non-extreme S.D.s. (Extreme scores do not participate in variance computations, because they have no unexplained variance and no dimensionality.) According the the nomogram, we expect the "variance explained by the Rasch measures" in your data to be about 20%. In your output, the variance explained by the Rasch measures is 19.1%. Almost the same. That is correct.

If you want the Rasch measures to explain more variance, then you need wider ranges of items difficulties and person abilities.

Yes, in your PCA plot, items ABCD are noticeably separate from the other items. The content of those items suggests the split is "Physical" and "Mental". This matches a similar split noticed in analysis of the MMPI-2 depression items ("Finding Two Dimensions in MMPI-2 Depression", Chih-Hung Chang, https://www.rasch.org/rmt/rmt94k.htm). But, you are correct, these are more like two strands, then two dimensions.

ybae3: Thank you for your quick response!

So...I need to add more items to this screening tool and use larger sample size to investigate if the tool is a valid measure for young Korean children, am I right?
If yes, can I keep two dimensions, externalizing(physical) and internalizing(mental) behaviors, in a single measure?

Thank you for your advice! It's great helps :)


Mike.Linacre: Youlmi, your findings are not likely to change with more items and more children.

Add more children with very high depression and with very low depression to demonstrate that your instrument is valid over a wide range of depression.

You do not need more items, but your items are dichotomies (Yes/No). What about making giving your items rating scales: (Never, Sometimes, Often, Always)?

ybae3: Thank you for your advice!

Then, I will do another pilot with a 4-point rating scale and will let you know how the screening tool works for young Korean children.

Again, I appreciate your advice :)



205. Calculate Rasch person Score

uvelayudam March 25th, 2013, 2:20pm: Hi All,

I'm new to Rasch measurement. I have been asked to calculate a Rasch personal score for the item, structure file generated by WINSTEPS using JMLE in psychometric package as follows:

Structure File:
";CATEGORY","Rasch-Andrich threshold MEASURE"

Item File:
" ",1,.56,1,37.0,88.0,.42,.88,-.55,.96,-.04,.00,.84,1.00,83.3,73.3,1.15,"1","R"

I would like to know how I can pass these parameters to the JMLE to compute personal score. Could someone help?


Mike.Linacre: Thank you for your participation, Udaya.

If you have Winsteps, then
SAFILE= your structure file
IAFILE= your item file

Here is the algebra: https://www.rasch.org/rmt/rmt122q.htm

You could also start from the Excel spreadsheet: https://www.rasch.org/poly.xls - remove the part where the item difficulties and thresholds are re-estimated. Instead use your fixed values.

uvelayudam: Thanks Mike! This is really helpful.

As per my study, most of the rasch softwares like WinSteps, jMetrik, are using http://java.net/projects/psychometrics java implementation. So, I'm looking at http://java.net/projects/psychometrics implementation of JMLE. I'm wondering how can I pass these SAFILE/IAFILE parameters to this JMLE.java and get the calculated rasch personal score?

Any suggestions would greatly help.


Mike.Linacre: Udaya, that java.net code is much, much more than you need. For your purposes, you do not need JMLE, because the items are already estimated. Simple MLE is enough. Modifying the java.net code to use anchor (preset) values will be much more work than coding https://www.rasch.org/rmt/rmt122q.htm into Java or Javascript. The Java or Javascript code will be similar to the BASIC code shown on that webpage.

uvelayudam: Thanks Mike! That is wonderful...!! I will work on writing that piece of code and post back for any queries.

A question -

I'm just kind of curious to know how jMetrik uses the JMLE for estimation. When should I go for psychometrics JMLE implementation? Do you have any insight on these? Since, I'm new to this area, this would be a good learning experience for me.

Thanks again,

Mike.Linacre: Udaya, all JMLE estimation is based on Wright & Panchapakesan (1967) - http://epm.sagepub.com/content/29/1/23.extract - further elaborated in Best Test Design - www.rasch.org/btd.htm and Rating Scale Analysis - www.rasch.org/rsa.htm

JMLE means "Joint Maximum Likelihood Estimation". It is designed to estimate person abilities and item difficulties (and Andrich thresholds for polytomies) simultaneously. It is not required when the item difficulties and Andrich thresholds are already known. Then we can use the simpler MLE "Maximum Likelihood Estimation".

uvelayudam: Thanks for the explanation! This is pretty clear for me now. I have another question on the algorithm you referred at https://www.rasch.org/rmt/rmt122q.htm. I came to know that average scores would be the default when Rasch score is not provided. Is this true? What is average score in this scenario?

Mike.Linacre: Udaya, https://www.rasch.org/rmt/rmt122q.htm must have Rasch item difficulties in logits and also the Rasch "step" difficulties (Andrich thresholds) in logits.

206. Cronbach's alpha vs. Rasch reliability indices

miet1602 March 9th, 2013, 11:04pm: Hi,

Any ideas if there might be a reasonable explanation (apart from errors in analysis) for a big difference between Cronbach's alpha coefficient (.65) calculated in SPSS and Item reliability and separation from a Facets analysis (Separation 6.15, Reliability .97) for the same data set (34 items, 30 persons, 6 markers, fully crossed)?


Mike.Linacre: Milja, Cronbach Alpha is a person-sample reliability. It is not an item-sample reliability. We expect the Cronbach Alpha "Test" (for this person sample on these items) Reliability to be slightly larger than the equivalent Rasch person-sample Reliability, see https://www.rasch.org/rmt/rmt113l.htm

CTT does not report an item Reliability, but it could. Organize the data so that the rows are the items, and everything else is the columns. Then compute Cronbach Alpha.

miet1602: Thanks, Mike.

The Rasch person-sample reliability for those data came out as Separation 4.32, Reliability .95, which is again much higher than the corresponding Cronbach Alpha of .65... Any ideas why this might be, as the expectation would be the opposite?

We did find that the raters are more or less behaving like rating machines though, with Exact agreements: 10588 = 69.2% Expected: 5795.8 = 37.9%.

Therefore, I calculated the adjusted reliability for persons based on the formula you gave in one of the reliability articles: Lower bound to Separation Reliability = R / ( R + N(1-R) ), and Rasch Person reliability came out .76, closer to Cronbach Alpha, but still higher...

Is this the right approach? What would be an explanation for the difference in person reliability according to these two indices going in the opposite direction from that expected?

Thanks again,

miet1602: Hi Mike,
Wasn't sure whether you had a chance to see my previous post, or whether it kind of slipped your notice. It would be a great help if you could take a look and just let me know your view.
Really appreciate it!

Mike.Linacre: Milja, sorry I did not reply.

The Spearman-Brown Prophecy formula may give us the answer:

Rasch Reliability with 6 markers = .95
Rasch Reliability with 1 marker = .95 / ( 6 + (1-6)*.95 ) = .76

Rasch reliability with 34 items = .95
Rasch Reliability with 1 item = .95 / ( 34 + (1-34)*.95 ) = .36

So, this suggests that SPSS Cronbach Alpha (which requires a rectangular, two-facet dataset) is organizing the data so that
rows = one person record for each marker = 6 records for each person
columns = items

Please email me the SPSS dataset, so that we can verify the Cronbach Alpha value.

miet1602: Thanks for your reply, Mike.

I attach the data that my colleague (who has done the Cronbach alpha analysis) used, but it is the original data rather than the SPSS file (which he did not save...). Is this helpful? That is the same data file I have used for the Facets analysis.

Thanks in advance for any insight!

miet1602: Seems that file I attached does not open properly, so just trying again...

miet1602: and again...

miet1602: Both of these open in a weird format in my browser, but maybe that won't be the case for you... Please let me know if you managed to open either of these... Not sure which other format might work...

Mike.Linacre: Milja: the files are correct. Please tell your browser that they are Excel files. Your browser does not know :-/

The Cronbach Alpha for your Excel file (all columns except Marker and Candidate) is 0.64.
As we suspected, this is the Cronbach Alpha for the rectangular dataset:
rows = one person record for each marker = 6 records for each person
columns = items

This is different from the cubic dataset in Facets which is:
rows = person
columns = items
3rd dimension (slices) = raters

The cubic dataset will reduce the error variance of the candidate scores by 1/6 compared with the rectangular dataset, so we would expect Cronbach Alpha in Facets to be considerably higher than in Excel/SPSS.

miet1602: Thanks, Mike. Good to know that I did not do something wrong in the Facets analysis.

However, I am not sure how is one to decide which reliability coefficient to consider to be the 'right' one. I know that reliability is not necessarily an indicator of how good the test is, how well the items function etc., but low reliability is not generally considered to be a good thing.
So, in this particular case, if Cronbach alpha is .64, and the Facets person one is .95, should I report the lower bound of Facets person reliability (.76) to test designers as a sort of equivalent of alpha?
Or, should I do a two-facet analysis (items-persons) and report that reliability?

Mike.Linacre: Milja, here is a more accurate Cronbach Alpha with SPSS:

Rows = Persons (30)
Columns = Items + Raters (6 x 34 = 204)

Cronbach Alpha = .93

miet1602: Mike, thanks a lot for this. Of course, I have a follow up question - or several questions:
1. How can the two Cronbach alphas be so wildly different that according to one we have an indication that a test is very reliable, and according to the other, reliability is borderline acceptable?
2. We have done some g-theory analyses on the same data, exploring various methods for determining reliability, and the coefficients were as follows:
Phi = 0.51
Phi (lambda) = 0.94
Again - these two between them are wildly different, and Phi is obviously much lower than the Rasch person reliability (.95) or your correct version of Cronbach alpha (.93)... We know from some previous research that Phi tends to underestimate reliability, but it is difficult to decide in this case which one of these to go by if we want to estimate how reliable the test is...

Thanks yet again for all your help.

Mike.Linacre: Milja, you asked:

1. "How can the two Cronbach Alphas be so wildly different"?

Answer: the Cronbach Alpha reliability computation summarizes two statistics:

a) the variance of the person distribution.
This is probably about the same for "Persons" (30 rows) and "Persons + Raters" (30x6 = 180 rows).

b) the precision of measurement of each row. The more observations of a row, the higher the precision.
The precision of the two data designs is very different. When the columns "Items + Raters", there are 34*6 = 204 observations of each row When the columns are "Items", there are 34 observations of each row.

2. "G-Theory."

Answer: Reliability ("Generalizability") depends on how the variances are summarized. Please summarize the variances according to ("Persons" vs. "Items + Raters") and then according to ("Persons + Raters" vs. "Items"). There should be similar changes in Generalizability.

207. ICC: how to interpret a different shape?

avandewater March 20th, 2013, 1:15am: Hi Mike,

I'm looking in detail at my data and have plotted the "multiple Model ICCs" (see attached). One of the items, however, has a different ICC than all other 3-category items.
In the Category Probability Curve (also in attachment) it can be seen that the second category ("1") has a large ability area in which is most probable.

How should I interpret the ICC of this item? Does it look like there are actually two related but dichomotous items in this one item? Or is this actually quite normal to see, just showing that it give a lot of statistical information at two ranges of ability (between 0 and 1, and 1 and 2) or so?

The fit statistics of this item are showing some overfit. (INFIT .73 -1.7|OUTFIT .62 -1.7)

Thanks again!

Mike.Linacre: Yes, Sander, this item has a wider middle category than the other items. We would need to look at the substantive definitions of the categories for all the items in order to decide whether this is good (the best item?) or bad (the worst item?).

In general, we like wide middle categories, because then the item gives us useful measurement information over a wide range.

208. Unidimensionality: t-tests, binomial calculator...

avandewater March 19th, 2013, 4:09am: Hi all,

Mike, thank for previous replies to my posts - they are clear and very helpful!

Being further in my analysis, and having read quite abit, I come with the 'issue' of checking for unidimensionality. I followed methods covered in the "CORE TOPICS" course, and there might be a (however small) second dimension in my outcome measure (Eigenvalue 2.4 initially, after item reduction 1.9)
I have read quite a bit about checking for uni/multidimensionality (e.g. Smith EV, 2002) and there are more ways to go about this.

1. To what extent should we rely on "Eigenvalue >=2.0" as a guideline? Or should we perform a Monte Carlo simulation (resulting in Eigenvalue 1.7) instead?

2. Also, RUMM has incorporated a "series of independent T-tests"; then some percentage is checked with a Binomial Proportions Calculator (http://stattrek.com/online-calculator/binomial.aspx) (as in https://www.rasch.org/rmt/rmt201c.htm)... I'm not sure how to perform these steps when using WINSTEPS (or what data/outputs to use when willing to calculate by hand for example).

I would appreciate any comments on this!


Mike.Linacre: Thank you for your questions, Sander.

Here is the fundamental question. What will you do if you discover that a test is multidimensional?

Your planned action determines what criteria and what methods are used to decide on multidimensionality.

For instance,
a) if you plan to omit off-dimensional items from the test, then the investigation will be one item at a time.

b) if you plan to split the test into two or more tests, so that there is one test for each dimension, then each test must have at least two items.

c) if you plan to abandon the unidimensional Rasch model, and use a different model, perhaps a multidimensional Rasch model, then the investigative criteria must match the dimension-selection criteria for the other model.

d) if your plan is merely to report that the test is (or is not) multidimensional at a certain level, then you are free to choose any criteria you like, provided that it is clear to your audience what your criteria are.

In detail,
Your 1.) - please see https://www.rasch.org/rmt/rmt233f.htm - where simulations have been done.

Comment: when you tell your non-specialist audience, "This test is multidimensional", they will ask, "What are the dimensions?".

For instance, if it is an elementary arithmetic test, and you answer "the dimensions are addition and subtraction", then the learning-difficulty specialists will say, "Yes, those two operations have important psychological differences", but the school principals will say, "No, those are merely two strands of arithmetic. They are the same thing for Grade advancement."

Your 2.) This needs expert knowledge of RUMM. Perhaps the RUMM discussion group can help you: http://www2.wu-wien.ac.at/marketing/mbc/mbc.html

Comment: https://www.rasch.org/rmt/rmt201c.htm concludes "an exploratory factor analysis used a priori, with parallel analysis to indicate significant eigenvalues, should give early indications of any dimensionality issues prior to exporting data to Winsteps or RUMM." This contradicts https://www.rasch.org/rmt/rmt114l.htm "Leaping first into a factor analysis of the original observations is prone to misleading results."

avandewater: Thank you so much for your advice Mike!

I probably go for the method under a), since items loading on the first contrast might be locally dependent to some extent.

Also, I had not seen the "Rasch First or Factor First" response (and related articles) yet - thanks!


(It was good to read the "Rasch First or Factor First", because factor analysis (I have been playing with the data for a while) showed 3 factors with Eigenvalue >1 - they do not all come back in Rasch PCA of residuals as substantial contrasts.

209. Which items belong to second dimension?

RaschModeler_2012 March 15th, 2013, 1:17am: Hi Mike,

Which items, if any, do you believe might belong to a second dimension based on the output below?



TABLE 23.0

Table of STANDARDIZED RESIDUAL variance (in Eigenvalue units)
-- Empirical -- Modeled
Total raw variance in observations = 39.2 100.0% 100.0%
Raw variance explained by measures = 17.2 43.9% 43.2%
Raw variance explained by persons = 7.8 19.9% 19.6%
Raw Variance explained by items = 9.4 24.0% 23.6%
Raw unexplained variance (total) = 22.0 56.1% 100.0% 56.8%
Unexplned variance in 1st contrast = 2.6 6.6% 11.8%
Unexplned variance in 2nd contrast = 2.2 5.7% 10.1%
Unexplned variance in 3rd contrast = 1.8 4.5% 8.0%
Unexplned variance in 4th contrast = 1.7 4.4% 7.8%
Unexplned variance in 5th contrast = 1.6 4.1% 7.3%


100%+ T +
| |
V 63%+ +
A | U |
R 40%+ M +
I | |
A 25%+ +
N | I |
C 16%+ P +
E | |
10%+ +
L | |
O 6%+ 1 +
G | 2 |
| 4%+ 3 4 5 +
S | |
C 3%+ +
A | |
L 2%+ +
E | |
D 1%+ +
| |
0.5%+ +

Approximate relationships between the PERSON measures
PCA ITEM Pearson Disattenuated Pearson+Extr Disattenuated+Extr
Contrast Clusters Correlation Correlation Correlation Correlation
1 1 - 3 0.4526 0.7517 0.5637 0.9294
1 1 - 2 0.5986 0.9681 0.7332 1.0000
1 2 - 3 0.5743 0.9357 0.6546 1.0000

TABLE 23.2

Table of STANDARDIZED RESIDUAL variance (in Eigenvalue units)
-- Empirical -- Modeled
Total raw variance in observations = 39.2 100.0% 100.0%
Raw variance explained by measures = 17.2 43.9% 43.2%
Raw variance explained by persons = 7.8 19.9% 19.6%
Raw Variance explained by items = 9.4 24.0% 23.6%
Raw unexplained variance (total) = 22.0 56.1% 100.0% 56.8%
Unexplned variance in 1st contrast = 2.6 6.6% 11.8%


-5 -4 -3 -2 -1 0 1 2 3
-+-------+-------+-------+-------+-------+-------+-------+-------+- COUNT CLUSTER
| A | | 1 1
.6 + | +
| B | | 1 1
C .5 + | +
O | | |
N .4 + C | + 1 1
T | | D E | 2 1
R .3 + | F + 1 2
A | G | 1 2
S .2 + | +
T | I | H | 2 2
.1 + K J | + 2 2
1 | | |
.0 +------------------------------k---------|------------------------+ 1 2
L | | |
O -.1 + i | j + 2 2
A | | |
D -.2 + h| + 1 3
I | | |
N -.3 + | g + 1 3
G | | |
-.4 + d f| e + 3 3
| | c | 1 3
-.5 + | b a + 2 3
| | |
-5 -4 -3 -2 -1 0 1 2 3
COUNT: 1 11 2 112 21 111 21 2 1 1

Approximate relationships between the PERSON measures
PCA ITEM Pearson Disattenuated Pearson+Extr Disattenuated+Extr
Contrast Clusters Correlation Correlation Correlation Correlation
1 1 - 3 0.4526 0.7517 0.5637 0.9294
1 1 - 2 0.5986 0.9681 0.7332 1.0000
1 2 - 3 0.5743 0.9357 0.6546 1.0000

TABLE 23.3


--------------------------------------------------- --------------------------------------------
|------+-------+-------------------+--------------| |-------+-------------------+--------------|
| 1 | .63 | -.84 1.04 .93 |A 12 ITEM12 | | -.51 | 1.57 .93 .89 |a 15 ITEM15 |
| 1 | .55 | -1.28 .96 .91 |B 20 ITEM20 | | -.50 | .96 1.13 1.05 |b 14 ITEM14 |
| 1 | .38 | -4.06 1.41 1.63 |C 1 ITEM1 | | -.43 | 2.53 .98 1.04 |c 17 ITEM17 |
| 1 | .35 | .61 .65 .51 |D 22 ITEM22 | | -.42 | -.99 .77 .63 |d 5 ITEM5 |
| 1 | .33 | 1.15 .80 .65 |E 21 ITEM21 | | -.40 | 1.05 .82 .57 |e 16 ITEM16 |
| 1 | .28 | .36 1.42 1.50 |F 6 ITEM6 | | -.39 | -.18 .94 .83 |f 4 ITEM4 |
| 1 | .25 | -.03 1.16 1.20 |G 8 ITEM8 | | -.32 | .52 1.04 .85 |g 3 ITEM3 |
| 1 | .16 | 1.68 1.38 2.64 |H 9 ITEM9 | | -.20 | -.11 .80 .77 |h 13 ITEM13 |
| 1 | .16 | -1.59 1.15 1.11 |I 10 ITEM10 | | -.11 | -.77 .81 .95 |i 18 ITEM18 |
| 1 | .10 | -.77 .97 .85 |J 7 ITEM7 | | -.08 | 2.92 1.42 1.43 |j 2 ITEM2 |
| 1 | .09 | -1.52 .93 .75 |K 19 ITEM19 | | | | |
| 1 | .00 | -1.21 .85 .71 |k 11 ITEM11 | | | | |
--------------------------------------------------- --------------------------------------------

| 1 1 | .63 | -.84 1.04 .93 |A 12 ITEM12 |
| 1 1 | .55 | -1.28 .96 .91 |B 20 ITEM20 |
| 1 1 | .38 | -4.06 1.41 1.63 |C 1 ITEM1 |
| 1 1 | .35 | .61 .65 .51 |D 22 ITEM22 |
| 1 1 | .33 | 1.15 .80 .65 |E 21 ITEM21 |
| 1 2 | .28 | .36 1.42 1.50 |F 6 ITEM6 |
| 1 2 | .25 | -.03 1.16 1.20 |G 8 ITEM8 |
| 1 2 | .16 | 1.68 1.38 2.64 |H 9 ITEM9 |
| 1 2 | .16 | -1.59 1.15 1.11 |I 10 ITEM10 |
| 1 2 | .10 | -.77 .97 .85 |J 7 ITEM7 |
| 1 2 | .09 | -1.52 .93 .75 |K 19 ITEM19 |
| 1 2 | .00 | -1.21 .85 .71 |k 11 ITEM11 |
| |-------+-------------------+--------------|
| 1 3 | -.51 | 1.57 .93 .89 |a 15 ITEM15 |
| 1 3 | -.50 | .96 1.13 1.05 |b 14 ITEM14 |
| 1 3 | -.43 | 2.53 .98 1.04 |c 17 ITEM17 |
| 1 3 | -.42 | -.99 .77 .63 |d 5 ITEM5 |
| 1 3 | -.40 | 1.05 .82 .57 |e 16 ITEM16 |
| 1 3 | -.39 | -.18 .94 .83 |f 4 ITEM4 |
| 1 3 | -.32 | .52 1.04 .85 |g 3 ITEM3 |
| 1 3 | -.20 | -.11 .80 .77 |h 13 ITEM13 |
| 1 2 | -.11 | -.77 .81 .95 |i 18 ITEM18 |
| 1 2 | -.08 | 2.92 1.42 1.43 |j 2 ITEM2 |

Mike.Linacre: RM: The eigenvalue indicates a strength of about 2.6 items. The plot indicates A,B and maybe C,D,E,F,G - we would need to look at the text of the items. The cluster analysis of the contrasts indicates ABCDE form cluster 1,

210. ISFILE derived scoring table

chrisoptom March 11th, 2013, 11:45pm: Hi folks
Can anyone with the my confusion over the following:

I having trouble understanding why the Rasch scoring table for a questionnaire (derived from the ISFILE output) doesn't give a zero score if all the lowest categories are chosen by the respondent. For instance, if the options for each item are scored 0,1,2,3,4 then in conventional Likert scoring, a respondent selecting zero for all items would have an overall score of 0%. My scoring table and others I've looked at would produce a min score of ~25%.


Mike.Linacre: Thank you for your question, Chris.

Theoretically, a score of 0 on the questionnaire corresponds to a Rasch measure of -infinity. In practice, -infinity is not a useful number to report, so we substitute the measure for a small fractional score on the questionnaire. Similarly for each item, when the items are considered separately.

However, this Bayesian adjustment to extreme scores would not produce a minimum score ~25%. In your example, suppose we have 10 items, so that the score range of the questionnaire is 0-40. Then, for a person scoring 0, the ISFILE would indicate that the score on each item is 0.25, so that the score on the questionnaire would be 2.5 out of 40, noticeably higher than the score of about 0.3 out of 40 which corresponds to the reported Rasch measure on the entire questionnaire.

This is the first time I have heard about building scoring tables from the ISFILE=. Please tell us how it is done. There may be some way the output of Winsteps can be adjusted to make the construction of scoring tables easier and more accurate.

chrisoptom: Mike
Thanks for your quick reply. I admit that I really struggled to understand how winsteps could be used to generate a scoring table for my questionnaire. I followed the advice of previous users (The quality of life impact of refractive correction (QIRC) questionnaire by Garamendi & Elliott) who suggested use of the ISFILE.

Specifically in my questionnaire, there are three possible categories 0,1,2 for each item and it was suggested that I use the values CAT+.25, ATCAT and CAT-.25 respectively to score each item. Thus the final Rasch score for each questionnaire was obtained from the addition of these scores and then dividing by the number of items answered in that questionnaire.

I'm worried now as to whether this was the correct approach?


Mike.Linacre: Chris, it sounds like respondents answer only some of the questions on the questionnaire, and we need to estimate a measure from those responses. If the respondent answers only one question, then the CAT measures are reasonable. Unfortunately, if the respondent answers two or more questions, then averaging the CAT values will produce measures that can be much too central. We need to apply an expansion formula to obtain a good approximation to genuine Rasch estimates obtained by, for instance, https://www.rasch.org/rmt/rmt122q.htm - If the respondents answers all questions, then use the score table in Winsteps Table 20.

I will work on an expansion factor over the next few days.

chrisoptom: Thanks again for your help Mike

You're right some respondents don't answer all the items on the questionnaire.

Am I right in thinking that if all items are answered then I simply sum the scores (0,1,2) and then use table 20.1 to convert to Rasch score

Best Wishes

Mike.Linacre: Chris, yes, Table 20.1 is the scoring table for answers to all the questions.
If there are some usual sets of questions answered, then Winsteps can calculate scoring tables for those.
For instance, suppose this set of questions is commonly answered: 1, 4-8, 13-17
Then, at the end of an analysis of all the questions:
Winsteps menu bar, Specification dialog box
Output Tables, Table 20.
This will be scoring table for items: 1, 4-8, 13-17

Mike.Linacre: Chris: I am working on a simple method of estimating Rasch measures from incomplete data. It will require some last-minute computations. How were you planning to implement the ISFILE method?

chrisoptom: Hi Mike
Please can you be more specific with regards to ISFILE implementation? Originally I was taking the ISFILE values (from the coulumns previously mentioned) for the questions answered and then inputting them into a spreadsheet to replace the answered questions.

Thanks again for your invaluable help

Mike.Linacre: Chris: if you are using an Excel spreadsheet, then more exact estimation would be done by a VBA macro that uses the actual responses in each row and a second worksheet that contains the "MEASURES" from the ISFILE=. The VBA macro would implement https://www.rasch.org/rmt/rmt122q.htm

Alternatively, instead of a VBA macro, the algorithm could be implemented on the second worksheet in a way similar to www.rasch.org/poly.xls - with the item and threshold measures pre-set.

chrisoptom: Thanks once again Mike, I'll take a look at the links suggested and have a go at modifying my data.


211. Properly Reporting 3.2

uve March 12th, 2013, 7:46pm: Mike,

Below is the Table 3.2 for an RSM. My question has to do with explaining coherence. If this was an individual item, had PCM been used, I would have no problem but in terms of the RSM, it's a bit harder. So referring to category 3, I would say that 36% of the resopndents with measures between -0.04 and .43 were observed to be:

1) using category 3 for all items
2) using category 3 on average for all items
3) had an average category rating of . . .
4) used category 3 as expected for all the items

I'm guessing it's none of the above, though I'm leaning to 4, but I'd greatly appreciate your input.

¦ 1 NONE ¦( -1.71) -INF -1.13¦ ¦ 79% 13% 1.3606¦ ¦
¦ 2 -.07 .01 ¦ -.50 -1.13 -.04¦ -.72 ¦ 27% 43% .7630¦ .92¦
¦ 3 -.36 .01 ¦ .43 -.04 1.13¦ -.11 ¦ 36% 65% .5351¦ .99¦
¦ 4 .43 .01 ¦( 1.81) 1.13 +INF ¦ .80 ¦ 77% 39% .8845¦ 1.04¦

Mike.Linacre: Uve, with RSM, everything is relative to each target item in turn.

So, the 36% is summarized across the persons with abilities in the range, "zone", from -.04 to 1.13 logits relative to each item difficulty.

So the 36% looks at all the observations where (-.04 &#8804; (Bn-Di) &#8804; 1.13). Of these, 36% are in category 3. Depending on the targeting of the persons and the items, there could be many eligible observations or only a few.

212. Nonconvergence when fitting Rasch model

RaschModeler_2012 March 10th, 2013, 10:31pm: Hi Mike,

Sorry for the flood of questions from me these days. As you can tell, I've been spending a lot of time i(more than usuual) in measurement these days! I've been trying to fit a standard dichotomous Rasch model using a LOGISTIC regression procedure in a standard software package and I keep getting the error "quasi-separation" and that the "maximum likelihood estimation failed."

These data are simulated from WINSTEPS, and when fit in WINSTEPS I do not receive any convergence issues, nor do I receive any oddities in the output.

Question: Is there a way to avoid this issue by randomly sorting individuals and then adding another item which has the response pattern:

1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 etc.



Mike.Linacre: RM, no idea why there is a problem, but we can always add two dummy person records:
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 ...
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 ...

These will have the effect of making all the parameter estimates slightly too central, but rarely change the inferences that will be drawn from the analysis.

RaschModeler_2012: HI Mike,

Thank you for the suggestion. I realize what's going on. I simulated the data such that person abilities range from very low (-4.5 logits) to very high (+4.5 logits). As a result, there are some cases where 99% of the items were answered correctly ("1") and other cases where 99% of the items were answered incorrectly ("0"). At the last iteration, these cases are associated with person ability estimates that are clearly way off (e.g., -23.0 logits or +23 logits), and are resulting in nonconvergence.

All the other cases seem to yield reasonable person abilities.

Any suggestions?



Mike.Linacre: RM, trim the very low and very high scorers from the data. They are almost uninformative about the estimation of the item difficulties.

When the range of the Rasch measures exceeds 40 logits, the precision of the floating-point hardware in a typical PC degrades. Perhaps the LOGISTIC regression software also degrades. Winsteps and Facets have special code to handle this situation (because some Rasch applications have measurement ranges of 100+ logits).

213. Invariance not enough?

uve March 10th, 2013, 11:20pm: Mike,

After reviewing the article on racking or stacking data: https://www.rasch.org/rmt/rmt171a.htm, I came across another paper in the most recent volume of JAM (14, No.1). Quoting from page 61:

"Item invariance implies that the relative endorsements and location of the items do not change or are independent of the sample responding; in kind, the relative item endorsements should behave as expected across different samples. In CTT, the difficulty of a test is sample dependent, making it unnecessarily problematic to measure change. When items are invariant, the Rasch model is particularly discerning in differentiating between high and low scorers on a measurement scale as it places persons and items on a common scale. This makes it ideal for measuring change on a construct."

Perhaps I am misunderstanding, but it seems to me that invariance negates the need to rack or stack data if one is attempting to measure change on a construct. I know this must be incorrect, but I guess I'm not seeing the advantage of invariance if one still needs to calibrate different administrations of an identical instrument in the rack or stack manner.

Is the author correct in his statement about change? Why isn't invariance enough so that one merely needs to compare measures of an identical instrument given to identical respondents over the course of a period of time when calibrated separately?

Mike.Linacre: Uve, invariance is an ideal that is never realized in practice. Item drift across time or ability levels is inevitable. The focus becomes on how we identify it and control it.

Racking data is a means to identify item drift.

When we have data collected for the same sample at two time points. There are three analytical alternatives in common use:
1) the item difficulties at the first time point dominate, because that is when decisions are made (e.g., medical patient testing) or because the time 1 difficulties are now the benchmark. Action: Anchor items in the time 2 analysis at their time 1 difficulties.
2) the item difficulties at the last time point dominate, because that is when decisions are made (e.g., certification, graduation). Action: Anchor items in the time 1 analysis at their time 2 difficulties.
3) item difficulties are averaged across all time points, because each is equally important (e.g., tracking a student cohort across grade levels). Action: Stack the data.

Of course, as you indicate, we hope that the item estimates are invariant enough that all 3 approaches yield the same results. However, items may have drifted so much that, by time 2, some items have become new items and so must be give new calibrations. This happened to the item "count backwards from 10". One hundred years ago this item was a difficult task in elementary numeration. Then came the Space Program. Now the item is child's play :-)

214. Discrimination parameter

RaschModeler_2012 March 10th, 2013, 1:36pm: Mike,

As I understsand the standard dichotomous Rasch model, the equation is as follows:

logit_ij= [(pij / (1-pij)] = theta_j - beta_i


theta_j = a person's ability (in logits)
beta_i = an item's difficulty (in logits)

Mathematically, if one were to incpororate an item discrimination parameter, I believe the equation would change as follows:

logit_ijk= alpha_k*(theta_j - beta_i)

According to the Rasch model, alpha_k's must be set to 1.0 for all items. I have read that this is because an item discrimination of 1.0 implies that it [the item] is discriminating as it should be, given its level of difficulty.

Question: Why a value of 1.0, however? Why is 1.0 the "magic" number? How do I connect the equation to the conceptual rationale for setting the discrimination parameter to 1.0? Why does 1.0 imply that the item is discriminating as it should be, given its level of difficulty? Why not 0.5 or 2.0?

Any thoughts would be really appreciated.


Mike.Linacre: RM, 1.0 is mathematically convenient, because it produces estimates in logits. But Winsteps allows any value (by means of USCALE=) that is the same for all the items. A commonly used preset discrimination is 1.7 which approximates the units of the estimates to probits.

215. Chi-Square Test between Rasch and 2-Parameter?

RaschModeler_2012 March 10th, 2013, 12:52am: Hi Mike,

While I am aware that WINSTEPS is intended to fit Rasch models only, I also know that WINSTEPS offers the feature of providing estimated discriminations had a 2-parameter model been employed ("DISCRIM=Y"). My question is whether WINSTEPS also offers a formal statistical test between the Rasch model and 2-parameter model? I ask because comparing the Chi-Square derived from fitting a Rasch model using WINSTEPS and the Chi-square derived from a 2-parameter model derived from different software (e.g., Xcalibre) is fraught with problems due to various differences between software (e.g., estimation methods are different, convergence criteria are different).

I am hoping WINSTEPS offers a formal Chi-Square test between the Rasch model and a 2-parameter model. If not, I am curious if this has ever been considered, and the rationale for not including such a test, if in fact it is not included.

I am absolutely thrilled with WINSTEPS, but it's becoming increasingly required by the field to demonstrate that the Rasch model yields a better model-to-data fit than a 2-parameter model. Consequently, I'd like to be able to perform formal Chi-Square tests without worrying that the reason for the difference in Chi-Squares is due to software differences, and not goodness of fit.



Mike.Linacre: RM: model-to-data fit is a poor criterion for selecting or rejecting the Rasch model. See "Which Model Works Best?" https://www.rasch.org/rmt/rmt61a.htm - A prime reason for choosing Rasch is to avoid fitting accidents in the data. For a parallel situation in conventional statistical modeling see "Taguchi and Rasch" -https://www.rasch.org/rmt/rmt72j.htm - which explains that the data-fittingphilosophy of American quality-control statistics was a failure when compared with the model-dominant philosophy of Japanese quality-control statistics.

However, if you do want to compare model-fit, use a 2-PL program, then for Rasch use its 1-PL (uniform 2-PL discrimination) option.

RaschModeler_2012: Mike: The articles you sent me are quite compelling. It raises several questions for me, but if I may, I'd like to go off on a tangent for a moment:

Suppose we wanted to assess degreee of self-report depression among adult individuals residing in the U.S. We develop items based on feedback from content experts throughout the U.S. We also have end-users help us refine the items and response options to ensure that each item is measuring what we intend it to measure.

Next, we obtain a large sample of individuals who range from not at all depressed to extremely depressed. We find that most people are lower on the measure, while only a small percentage have moderate to severe levels of depression. When we fit the Rasch model using data from everyone, the item difficulty mean (centered at zero) is substantially higher than the person ability mean. In fact, most of the items are more difficult than most of the people's ability levels. Due to poor person-item targetting, we remove individuals who are not in treatment for clinical depression, after which the Rasch model is refitted, and lo and behold the person-item means and distributions are nearly the same....The standard errors associated with the items, person reliability, etc. have substantially improved by removing those who have little to no depression. The item-person map looks like it came out of a simulated experiment.

Curious...what would you do under such a scenario, assuming the two objectives are to:

1. sensitively measure self-report depression
2. differentiate "normals" from "depressed"

It seems to me that removing the non-depressed sample improves item-person targetting, but hurts the possibility of differentiating "normals" from the "clinical population." And vice versa is true as well--keeping all subjects allows one, potentially, to see the place on the map at which normals separate from the the depressed on the ruler, but now one has sacrificed good person-item targetting.

Have you ever encountered this scenario before? What's your general rule of thumb? Might you calibrate the items twice, and use one measure for one purpose *e.g., differentiate "normals" from "depressed" and the other measure for the other purpose: sensitively, with little error, measure level of depression among the depression population.

To me, this gets at the heart of the articles you sent me. Do we remove subjects who are well below the lowest point of the ruler and/or do we keep them (allow for poor item-person targetting).

Sorry if there is an obvious answer to my question.


Mike.Linacre: RM, your description sounds familiar. It is what Chih-Hung Change encountered when he reanalyzed the data for the Depression Scale of the MMPI-2 in 1994. https://www.rasch.org/rmt/rmt81f.htm was one of his findings. Later, we discovered that the authors of the MMPI-2 knew of the scoring fault he identified. One of the MMPI team told me that their thinking was so focused on optimizing the "Normal" vs. "Depressed" classification that they ignored the actual meaning of the latent variable!

In analysis of other clinical variables, we discovered that including almost symptomless individuals was misleading. It would be like including Math Professors in the sample for a Grade School arithmetic test. Any wrong answers made by the Professors would be idiosyncratic or accidental. They would not help in constructing the latent variable. After the latent variable is constructed, we could administer the test to the Professors to discover what their arithmetic level actually is.

218. Likert vs Rasch scoring

chrisoptom March 5th, 2013, 6:48pm: Hi folks
I'm new to this forum and wonder if anyone could offer help with the following?

I'm a PhD student using Winsteps to help me develop a questionnaire looking at Spectacle adaptation symptoms experienced by people when they obtain new spectacles. Essentially the various items describe symptoms and the respondent has 5 categories of freq response from "none", "occasionally" etc upto "all the time".

With conventional Likert scoring, a respondent with no symptoms at all would score zero and a respondent with every symptom all of the time would score 100%. However using the Rasch model and the Winsteps generated scoring table for the questionnaire, the same 2 respondents would have scores of approx 30% and 70% respectively. How do I explain this compression in range of possible questionnaire results?

Sorry if this seems a basic question but even my Prof seems unsure how to answer this one?

Best Wishes

Mike.Linacre: Chris, thank you for your courageous explorations. Something is awry. Please look at Winsteps Table 20. The total raw score for someone will all "none" should be the lowest score in Table 20.1, and the total raw score for someone with all "all the time" should be the highest score in Table 20.1. If not, there is a problem with the data or with the Winsteps CODES= instruction.

For instance, if "9" in the data means "missing data", and it is included in CODES=, Winsteps will think that is a score of "9". The solution is to omit 9 from CODES=

Please look at Table 14.3. - Do the frequencies of each category for each item make sense?
Please look at Table 3.2 - Does the overall structure of the rating scale make sense?

chrisoptom: Hi Mike

Thanks for your advice, I've checked table 20.1 and all looks okay, all "none" is lowest score and all "all the time " is the highest. I think the issue I am struggling to understand is that the range of possible total Rasch scores for each of my questionnaires is compressed to a score of ~30 (none for all items) to ~70 maximal symptoms. i.e. a spread of about 40 whereas with Likert scoring the range would be zero to about 80 (as no respondent selected "all the time " for all items/symptoms).

Subsequently I'm trying to relate the Rasch score (or amount of spectacle symptoms experienced) to other respondent characteristics such as age and I wonder if this compression is "hiding" a potential correlation.

I've looked at other Rasch develped questionnaires in my area such as QIRC and their scoring tables(derived from the Winsteps ISFILE) exhibit the same effect, no symptoms at all still result in a questionnaire result of ~25.

I assume that this is a characteristic of using Rasch but I am still struggling to understand it properly.

Any help you could offer would be very appreciated


219. Sufficiency

uve March 10th, 2013, 1:55am: Mike,

The issue of raw score sufficiency appears quite often in the readings. However, if in my example we have two persons who take an exam of 10 items and the first person gets the five hardest items correct but the five easiest items incorrect and the opposite holds true of the second person, then both have scored the same. However, wouldn't the Rasch model report a higher ability level for the first (though with extreme model misfit)?

Please correct me if I am wrong, but I was under the impression that JMLE is the process by which this takes place using the variance in the data to adjust the measures until convergence occurs. So what is really needed is not only the raw score, but the variance that has occurred by examining the responses to each of the items, or response string. So is the raw score really sufficient? It seems not. It seems sufficiency = raw score + response string.

Mike.Linacre: Uve, in Rasch, the raw score (on the same set of items) is the suficient statistic. So, "equal raw scores = equal estimated Rasch person measures". Different response patterns produce different fit statistics.

See "Who is awarded First Prize when the raw scores are the same?" - https://www.rasch.org/rmt/rmt122d.htm

220. Reliability of Difference Scores

RaschModeler_2012 March 8th, 2013, 2:05am: Hi Mike,

According to CTT, reliability of difference scores is a function of the score reliability of each measure (Rxx, Ryy) and the correlation between the measures (rxy):

[.5(Rxx + Ryy) - rxy] / (1 - rxy)

Does this formula hold in the context of Rasch modeling? Two scenarios:

1. Suppose we want to calculate the difference (discrepancy) score between an Rasch measure of intelligence and a Rasch measure of reading ability. (This is often done to identify children with reading disabilities; e.g., high IQ but average reading abilitiy score)

2. Suppose we develop a Rasch measure of depression, and administer it to individuals before and after an intervention, and perform a paired t-test to determine if there is a statistically significant reduction in the mean scores from t1 to t2.

According to CTT, both scenarios above are potentially problematic because the correlation between the measures (or the correlation of the same measure administered at two time points) is likely high, thereby reducing reliability of difference scores.

Given your expertise in psychometrics and Rasch modeling, I was wondering if you have any thoughts on this matter.

Thank you,


Mike.Linacre: RM, Reliability is a summary of the precision of the measures. Reliability formulas generally apply better to Rasch measures than to raw scores.

The "reliability of difference scores" formula looks like Lord (1963) with the assumption that the variances of the two sets of scores or measures are the same. Lord's formula is at http://lilt.ilstu.edu/staylor/csdcb/articles/Volume9/Chiou%20et%20al%201996.pdf - this paper critiques the paradox implicit in this formula. If we want a high reliability of difference scores, then we also want a low correlation between the two sets of scores. But a low correlation between the two sets of scores implies that the scores are for unrelated attributes, so that the differences are meaningless.

In Rasch measurement of differences, we are usually working from the distribution of the pairs of measures. I have never adjusted difference measures for their correlations, and don't recall seeing a Paper that has. Probably, if the differences are small enough to report low reliability, they are also so small that we cannot reject the null hypothesis of "no difference".

But you may be onto something important that we have all overlooked ....

RaschModeler_2012: Mike,

Thank you for sharing your thoughts. Out of curiosity, suppose you had a Rasch measure of intelligence and a rasch measure of reading aptitude. While both would presumably be on the logit scale, they are not standardized to have the same mean and standard deviation. So, how would yould one calculate a difference (a.k.a. discrepancy score)?

It is common practice to report that if someone scores at an intelligence level that is 1 sd above an aptitude test (e.g., reading test), that this could be indicative of a disability.



Mike.Linacre: RM: there is the same problem in physics. When we have children's weight and heights, how do we calculate a discrepancy score in order to identify children who are obese or malnourished? Nutritionists do it by plotting typical weights, smoothing and approximating. Here it is from http://www.nhs.uk/Livewell/healthy-living/Pages/height-weight-chart.aspx - Can psychometricians really do better than metrologists, or are we fooling ourselves with our sophisticated algebra?

RaschModeler_2012: Thank you, Mike. You've given me much to think about. I'll write back if I have further questions. Intriguing, indeed!


221. Please help! Person reliability and missing data

kstrout1 March 8th, 2013, 5:20pm: Hello,

I am working toward the completion of my PhD in nursing, and I had to do Rasch analysis with Master's partial credit as part of my dissertation.

I have two questions that I do not understand, and I hope someone can help me!

1) My item reliability for each of my five scales was high "1." However, my person reliabilities are low- ranges from .19-.30. I have a large sample size n=5,706. I also have a homogenous sample (60% earning a Bachelor's degree or higher).

I know that sample size influences item reliability and that small variance influences person reliability, but I still don't understand. I thought that the logit values are generated from ability and item difficulty; therefore, if the ability does not vary, how can the item reliability remain high?

2) Why would you enter "0" for data from missing items versus dropping that person?

Thank you so much!!!

Mike.Linacre: Thank you for your questions, kstrout.

1). "Reliability" means "Reproducibility of the score or measure hierarchy". We don't need differences in person ability to detect differences in item difficulty (which items are easy and which items are hard). We don't need differences in item difficulty to detect differences in person ability (which persons are experts and which persons are novices). All we need to be certain about the item difficulty hierarchy (high item reliability) is a large sample of persons. All we need to be certain about the person ability hierarchy (high person "test" reliability) is a large sample of items.

For example, the US Navy had an arithmetic test of many equally difficulty items. They identified "better" and "worse" recruits by their scores on the test. Equally difficult items produced high person "test" reliability.

2) Here are four ways of processing missing data, depending on the situation:
a) omit all persons with missing data. This is done for statistical methods (such as factor analysis) that require complete data.
b) impute preset values for missing data. On an MCQ test, missing data is coded "wrong".
c) impute reasonable values for missing data. This is done when there are only occasional missing data values, but the entire person sample is to be reported.
d) treat missing data as "not administered". This is done in computer-adaptive tests.

kstrout1: Dr. Lincare,

Thank you SOOOO much for your help! You saved me. Do you think my low reliabilities are related to the fact that I only used 5 items? Ideally, how many items do you recommend using for high reliability?

I used Rasch analysis to examine wellness in a cohort of aging adults. I divided wellness into five dimensions (social, physical, intellectual, spiritual, and emotional). I combined the logit values for five items representing each dimension to create a "wellness score" for each dimension. I ran a regression analysis to determine which dimension contributes to cognitive ability among older adults. I just thought I would let you know how I was using the data.

I want to do another rasch analysis for my first study after I graduate with my PhD. I am just curious about your recommendation regarding number of items to include in the analysis.

Thank you SO much!!

Mike.Linacre: Kstrout, in educational testing, usually 200 dichotomous (MCQ) items are needed for high reliability. We can use a nomogram to estimate the number of polytomous (rating scale) items for the required reliability. See: https://www.rasch.org/rmt/rmt71h.htm

For example, in the nomogram, if we want a reliability of 0.8 (reasonable for observational instruments) and the items have 5 categories and the true person S.D. is about 1 logit then we need about 7 items.

kstrout1: Dr. Linacre,

Thank you so much!! I am forever grateful for your support!!

I referenced you several times in my manuscript, and I am so excited that you are the person replying!

Thank you! Thank you! Enjoy your weekend!!


222. sample size & research design

denise80 March 7th, 2013, 9:38am: Dear Dr. Linacre,

A master's student of mine is working on a project to explore whether rater bias exists in oral-interview assessments. But what is different in her study is that, she has a pre and post-test design, in which she manipulates the variable "knowledge of students' proficiency level," so in the post-test the raters will be provided with this information and rate the same set of data after a certain amount of time interval.

She has 20 raters rating the performance of 12 students (6 pairs) from 3 different proficiency levels. The rubric has 5 components and a total score. Do you think the sample size is enough for Rasch?

I have seen studies that ran Rasch twice for pre and post-test data, so I think it is possible to do so with this study, but is it also OK to run Wilcoxon t-test to see whether there are any significant changes between the pre and post test scores?

I would really appreciate your help in this.


Mike.Linacre: Thank you for your questions, Denise.

This dataset size is often encountered in exploratory and pilot studies, also in small-scale clinical studies. for a master's project in which the finding is indicative, rather than definitive, this sample size is certainly big enough.

Since there is obvious dependency in the data (same set of data rated twice), the analysis probably needs to balance these in the first stage, before doing separate pre- and post- test analyses. This suggests an analytical design similar to https://www.rasch.org/rmt/rmt251b.htm (replacing "patient" with "student").

And, yes, when the Rasch measures have been obtained, conventional statistical techniques can be applied to them.

denise80: Thank you so much, Dr. Linacre for your prompt response!

I looked at the link, yes, that's what we would like to do, but I couldn't figure out the analysis, I am more of a sociolinguistics person and got involved with this study because of my master's student. I can interpret the tables but I still need to work on how to run the analysis itself. I am looking at the links on this webpage but if there's anything you could suggest for a beginner, that'd be great!

I have a couple of questions also. When you say once the Rasch measures have been obtained, are you suggesting that with the results of Rasch I can do a t-test or, with the raw data, because I was asking about the raw data?

Also, we have another set of data consisting of 20-30 raters rating approximately 300 papers. As far as I know, no two raters rated the same performance. Is it still possible to conduct Rasch measurment to examine raters' experience, age, tasks, and students' proficiency level?

Again, thank you so much for your invaluable feedback.


Mike.Linacre: Denise, there are introductory Rasch books. Please see "Rasch Publications" on www.rasch.org

t-tests with the Rasch estimates (which are interval-scaled).

Question: "no two raters rated the same performance. Is it still possible to conduct Rasch measurment to examine raters' experience, age, tasks, and students' proficiency level?"

Reply: Model the raters to be representatives of their classes, rather than individuals.
So, instead of rater "Jane", we have rater "Experienced" + "Older", etc.

denise80: Thanks!

One last question, can i do multimodel Rasch with Ministep or do I need Facets for that?

Mike.Linacre: Denise: "multimodel Rasch" -
1) analysis with multiple Rasch models simultaneously (dichotomous, Andrich rating-scale, partial-credit): Ministep, Winsteps, Minifac, Facets.
2) multidimensional Rasch analysis: ConQuest
3) multifaceted Rasch analysis: Minifac, Facets. This can be approximated with RUMM.

denise80: Dr. Linacre,

Is it OK if I email you?

Mike.Linacre: Denise, certainly you can email me. mike \at/ winsteps.com

224. Hofstede's Cultural Dimensions (Constructs)

Emil_Lundell March 7th, 2013, 10:08pm: Hello,

I have some questions that I hope that somebody could answer.

Geert Hofstede and his fellow researchers claim to have constructs for measuring attitudes toward power distance, individualism vs. collectivism, masculinity vs. femininity and uncertainty avoidance.

Is there any analysis of the indexes unidimensionality, ordering of the data/categorization of items, local independence and DIF? Hofstede's constructs are widely used in Political Science but are they really succesful measurements? If there isn't any published analysis could somebody take the needed time and effort to conduct one?

Best regards,


Info about dimensions or "constructs": http://geert-hofstede.com/national-culture.html

Items of Values Survey Module 2008 in different languages: http://www.geerthofstede.nl/vsm-08

Mike.Linacre: Thank you for asking, Emil.

Rasch dimensionality analysis of the VSM-08 would be straight-forward, but no one seems to have done it.

There are Rasch researchers who are eager to conduct this type of work. One is Rense Lange http://www.facebook.com/rense.lange

Emil_Lundell: Thanks :)

226. ZEMP formula

uve March 6th, 2013, 11:51pm: Mike,

I was attempting to replicate the ZEMP values. I use this almost exclusively now over ZSTD due to the fact that it is not uncommon to have 1,500 to 5,000 scores for any given instrument. I was hoping you could clarify the half adjustment mentioned. Both equations look identical. From Winsteps help:

To avoid the ZEMP values contradicting the mean-square values, the ZEMP computation is:

Accordingly, Winsteps does two half adjustments:
for all k items where ZSTD(i) >0,
ZEMP(i) = ZSTD(i)/(S), where S = sqrt[ (1/k) Sum( ZSTD(i)²) ]
for all k items where ZSTD(i) <0,
ZEMP(i) = ZSTD(i)/(S), where S = sqrt[ (1/k) Sum( ZSTD(i)²) ]

The second equation is identical to me. Are you saying that other than any item with a measure of zero, the stdev is adjusted differently?

Mike.Linacre: Uve, yes, the equations are identical.

ZEMP takes ZSTD=0 as the baseline, and then linearly adjusts the positive and negative halves of the ZSTD distribution independently, giving each half an average sum-of-squares of 1.0 away from 0. When the two halves are put together, the model distribution of ZEMP is N[0,1], and the empirical distribution of ZEMP approximates a mean of 0 and a standard deviation of 1. Usually there is no ZSTD with value exactly 0.000.

A statistically more precise transformation could be done, but the outliers would be the same.

The "k" notation in Winsteps Help is obscure, so here is the revised wording:

The ZEMP transformed standardized fit statistics report how unlikely each original standardized fit statistic ZSTD is to be observed, if those original standardized fit statistics ZSTD were to conform to a random normal distribution with the same variance as that observed for the original standardized fit statistics.

To avoid the ZEMP values contradicting the mean-square values, Winsteps does separate adjustments to the two halves of the ZSTD distribution. ZEMP takes ZSTD=0 as the baseline, and then linearly adjusts the positive and negative halves of the ZSTD distribution independently, giving each half an average sum-of-squares of 1.0 away from 0. When the two halves are put together, the model distribution of ZEMP is N[0,1], and the empirical distribution of ZEMP approximates a mean of 0 and a standard deviation of 1. Usually there is no ZSTD with value exactly 0.000. Algebraically:

for all kpositive items where ZSTD(i) >0 and i =1, test length
ZEMP(i) = ZSTD(i)/(Spositive), where Spositive = sqrt [ (1/kpositive) Sum( ZSTD(i)² for kpositive items) ]

for all knegative items where ZSTD(i) <0 and i =1, test length
ZEMP(i) = ZSTD(i)/(Snegative), where Snegative = sqrt [ (1/knegative) Sum( ZSTD(i)² for knegative items) ]

227. Calculating the variance in PCA on residuals

jacobxu March 6th, 2013, 10:41pm: Hi Mike,

I have been tried to calculate the "Total raw variance in observations" and "Raw unexplained variance" using the instructions from the Winsteps Manual (page 493 of section "Dimensionality: contrasts & variances"), but still cannot figure it out. So I was wondering you can kindly give me a hint.

I take the following steps to calculate the "Total raw variance in observations":
1. export the Response
2. export the Person Measures and calculate the mean of them
3. export the Item Measures and calculate the mean of them
4. calculate Ebd (expected response value) by using the exp(b-d)/(1+exp(b-d))
5. Response matrix - Ebd matrix
6. square the results from step 5
7. sum the numbers on the matrix from step 6

The result was thousands (1000 persons and 18 items) rather than the number in the output file. Which step(s) was/were wrong on my procedures?

Many thanks!

Mike.Linacre: That looks good so far, Jacobxu.
The big number from 7. is not easily interpretable, so we need to standardize it.
8. Sum the unexplained variance in the observations: (4.-7. using the person and item measures for each response)
9. The unexplained variance -> eigenvalue of PCA of residuals = number of items (18)
10. Total raw variance = 18 * (7.) / (8.)

We now have the total raw variance expressed in eigenvalue units, which can be interpreted.

jacobxu: Thanks, Mike!

Actually, I am running a simulation study to investigate the approach you proposed in the manual: using a simulated data to verify the first eigenvalue on PCA on residuals from the "empirical data".

Up to this moment, most of the conditions which violated unidimensionality can be identified using your method (please check the enclosed excel file). For the conditions of unidimensional data, there are 50:50 false positive rate just because I rudely compare the first eigenvalues between simulated data and "empirical data".

However, up to this moment, there are some conclusions might contradict with the suggestions in the manual.
1. 2.0 is not a cut-point
2. for the magnitude of the first eigenvalue, it might not have a direct relationship with the number of items violated the unidimensionality. For example, in a condition with 1000 person and 18 items and 0.7 correlation between two dimensions, the first eigenvalue is 1.5 but half of the items (9 items) are belonging another dimension. Almost all other conditions showing the same pattern.
3. following 2, if the violation can be confirmed using the simulated data, the items from negative or positive contrast loading can be clustered into different dimensions.
4. I am wondering if involve the change of "total raw variance" between empirical data and simulated data (that's why I want to figure out how to calculate the variance), it can help us deviate the difference between the first eigenvalues to confirm the violated of unidimensionality.
5. based on the difference between the first eigenvalues of simulated data and empirical data, it might be possible to develop a Cronbach's alpha like indicator to represent the departure of unidimensionality for practitioners.

Mike.Linacre: Excellent, Jacobxu. Please compare your work with https://www.rasch.org/rmt/rmt191h.htm

Yes, agreed. 2.0 is not a statistical cut-point. It is a substantive cut-point. When we declare that an instrument is "multidimensional", our non-technical audience says "OK, what are the dimensions? Which items are on which dimension?"

David Andrich's RUMM program contains a useful investigative device that is now implemented in Winsteps: the correlation of the person measures when the instrument is split into dimensions. Here is an example from Winsteps Table 23.3 - copied from Winsteps Help: https://www.winsteps.com/winman/index.htm?table23_2.htm

Approximate relationships between the KID measures
PCA ACT Pearson Disattenuated Pearson+Extr Disattenuated+Extr
Contrast Clusters Correlation Correlation Correlation Correlation
1 1 - 3 0.1404 0.2175 0.1951 0.2923
1 1 - 2 0.2950 0.4675 0.3589 0.5485
1 2 - 3 0.8065 1.0000 0.8123 1.0000

In this example, notice the low correlation between item clusters 1 and 3. We would almost certainly say this correlation indicates multidimensionality.

Please do refine your findings. Then publish an immediate Research Note in Rasch Measurement Transactions, followed by a more thorough paper in a psychometric Journal (APM, JAM, etc.)

228. Dummy subset warnings; rater severity over time

valerie March 5th, 2013, 4:15pm: Hi Mike,

First off, thanks for maintaining this forum...I've been browsing through the posting and have been impressed with the speed & detail with which you reply!

Ok, here's my situation: I am using FACETS to investigate a placement exam in which 2 raters score one essay topic on the basis of 5 rubric categories. My full data set consists of scores from 543 test takers spread out over 7 administrations and ratings from four raters. One rater participated in all 7 admins; the other three participated in 4, 2, and 1 admins, respectively. The essay prompt and the scoring rubric have been the same for all administrations.

So, I wanted to double-check that I've done the following right:

1) I coded test administration as a dummy facet (i.e. Labels= 2,administrations,D) to investigate possible interactions between it and the rubric categories. I received a warning that that there were seven disjoint subsets (i.e. one for each admin). I think it it safe to ignore this warning, though, as I am only interested in the interactions? (Subset connections were no problem when I ran the data without administration info).

2) In order to investigate changes in rater severity across time, I first ran the full data set and produced an anchor file; I then ran a separate FACETS analysis for each of the 7 individual administrations by using the anchor values for rubric categories and persons but commenting out the anchor values for raters and allowing them to float. Sound reasonable?



Mike.Linacre: Thank you for your participation, Valerie.

1) Did you anchor the elements of the dummy facet? It sounds like this is what is needed, something like:
4, Administration, A
1-7, Administration, 0 ; anchor all the administrations at 0

2) The anchoring approach will work , or obtain the values from Administration x Rater interactions in the Table 13 of the initial analysis. ("measr" for the rater element + "Bias Size") or ("measr" for the rater element - "Bias Size") depending on whether you are modeling severity or leniency.

valerie: Hi Mike,

I thought I had anchored my dummy items, but I guess I hadn't done it right...I did what you suggested and it worked perfectly.

Thank you!


229. questions about bias analysis

apple March 6th, 2013, 4:23am: Dear Mike,
I have been researching the interaction between candidates(40) and raters(4) in spoken English proficiency test. But I have not run the Rasch facets software successfully. Could you lend me a hand? attached file is the date of four raters rating 40 candidates. could you show me the bias between rates and candidates? the level of spoken test is from 10 to 60 .
Looking forward to your reply. Thanks !

Mike.Linacre: Apple, delighted to lend you a hand.

This dataset appears to have only two facets: candidates and raters. The usual interaction/bias analysis requires at least 3 facets. So we will do what we can. Your Facets specification and data file are below. Your data are scored so that 5 observed score-points = 1 ordinal score-point. The Bias Size is in logits. the "Obs-Exp Average" (observed-expected average) is in ordinal score-points.

Here is some of the output:
Table 13.1
|Observd Expctd Observd Obs-Exp| Bias Model |Infit Outfit| Candidate Rater |
| Score Score Count Average| Size S.E. t d.f. Prob. | MnSq MnSq | Sq Nu Ca measr N Rater measr |
| 6 4.65 1 1.35| 3.51 1.69 2.08 1 .2852 | .0 .0 | 115 35 35 3.66 3 Rater 3 -1.56 |
| 4 3.02 1 .98| 3.40 1.65 2.06 1 .2872 | .0 .0 | 116 36 36 -1.43 3 Rater 3 -1.56 |
| 7 5.81 1 1.19| 3.35 1.72 1.94 1 .3024 | .0 .0 | 135 15 15 2.33 4 Rater 4 2.75 |
| 7 5.81 1 1.19| 3.35 1.72 1.94 1 .3024 | .0 .0 | 159 39 39 2.33 4 Rater 4 2.75 |
| 5 3.88 1 1.12| 2.87 1.59 1.80 1 .3223 | .0 .0 | 110 30 30 1.66 3 Rater 3 -1.56 |
| 5 3.88 1 1.12| 2.87 1.59 1.80 1 .3223 | .0 .0 | 111 31 31 1.66 3 Rater 3 -1.56 |

Here is the Facets specification and data file:

Title="Essay Analysis"
Facets = 2 ; candidates and raters
Positive = 1,2 ; candidate ability and rater leniency
Center = 2 ; the mean leniency of the raters is the measurement origin
Model = ?B,?B, MyScale ; the ratings are modeled to accord with MyScale rating-scale
Rating Scale = MyScale, R60 ; highest observable rating is 60
0 = 10 , , , 10 ; rescore because intermediate categories are structural zeroes.
1 = 15 , , , 15
2 = 20 , , , 20
3 = 25 , , , 25
4 = 30 , , , 30
5 = 35 , , , 35
6 = 40 , , , 40
7 = 45 , , , 45
8 = 50 , , , 50
9 = 55 , , , 55
10= 60 , , , 60
1, Candidate
2, Rater
1 = Rater 1
2 = Rater 2
3 = Rater 3
4 = Rater 4
Data= ; copied from Excel spreadsheet so that we can see the format:
; element element observations
;candidate raters rater1 rater 2 rater 3 rater4
1 1-4 35 40 30 45
2 1-4 10 20 10 15
3 1-4 30 35 25 30
4 1-4 10 15 10 15
5 1-4 20 30 25 35
6 1-4 35 35 30 40
7 1-4 30 40 25 40
8 1-4 30 40 30 40
9 1-4 45 45 40 50
10 1-4 30 35 30 40
11 1-4 30 40 35 35
12 1-4 35 40 35 35
13 1-4 40 50 45 50
14 1-4 10 10 10 15
15 1-4 25 30 35 45
16 1-4 35 35 40 45
17 1-4 30 40 35 40
18 1-4 30 30 25 35
19 1-4 30 40 30 45
20 1-4 35 40 30 35
21 1-4 30 35 25 40
22 1-4 30 40 25 45
23 1-4 30 35 25 40
24 1-4 30 45 40 50
25 1-4 30 40 35 40
26 1-4 35 45 40 45
27 1-4 30 35 30 40
28 1-4 35 40 35 45
29 1-4 35 40 40 45
30 1-4 25 30 35 40
31 1-4 25 35 35 35
32 1-4 40 50 45 50
33 1-4 30 35 35 40
34 1-4 25 30 25 35
35 1-4 30 35 40 40
36 1-4 25 25 30 30
37 1-4 35 40 30 40
38 1-4 30 35 30 35
39 1-4 25 35 30 45
40 1-4 25 30 25 25

230. Problems with Winsteps

CelesteSA March 5th, 2013, 7:15am: Hi There,

I have been using Winsteps for about a year and a half. On my old PC I got output that made sense to me. When I got a new PC and installed Winsteps, it gave me output that did not match the output I had gotten before. I asked a colleague to run the same data set through her Winsteps and she got completely different output than mine (item map looks very different, item statistics don't match).

This is very worrying to me, how do we know which output is correct if the same data is run through Winsteps on two different PCs and produces completely different output. I have tried varies things to correct this: re-installed Winsteps, tried running it in compatibility mode but nothing works.

Please help! I rely on the output I get from Winsteps to make decision about test refinement and if the program cannot give me the same consistent results for the same data set then my decisions could be incorrect.


Mike.Linacre: CelesteSA, this situation is perplexing. The results of each new version of Winsteps are compared with the results of previous versions. Changes are documented in https://www.winsteps.com/wingood.htm - the Rasch estimates produced by Winsteps have not changed in many years. Winsteps is running on many versions of Windows. Currently supported versions of Windows are XP, Vista, 7 and 8, but earlier versions of Winsteps, producing the same estimates, are operating on earlier versions of Windows.

Changes in Winsteps versions (that do alter estimates) are different default values of certain control variables. For instance, are you analyzing rating-scales with unobserved intermediate categories? The default setting used to be STKEEP=No, so that unobserved intermediate categories are squeezed out of the rating scale. So, a rating scale with categories 2,4,6 was analyzed as 2,3,4. The current default is STKEEP=Yes. Every intermediate numerical value in a rating scale is a category: 2,4,6 is analyzed as 2,3,4,5,6.

Please email to me your Winsteps control and data file, the two different outputs, and the Windows versions of the two PCs. I will investigate immediately. Email to mike \at/ winsteps.com

Mike.Linacre: CelesteSA, thank you for your files. The differences were caused by different default settings of GROUPS=.
In one analysis, there is the Rating Scale model (items share the same rating-scale structure): GROUPS=
In the other analysis, there is the Partial Credit model (each item defines its own rating-scale structure): GROUPS=0

231. Weighting items and persons

Letao March 5th, 2013, 8:51pm: Hi Mike,

The data I am working now is suggested to apply weights for all analyses. I read the Winsteps manual about "weighing items and persons" https://www.winsteps.com/winman/index.htm?weighting.htm, but still have some confusion.
1) My data set has a variable of person weight which is not normalized. According to the manual, normalized weights are suggested to use in the analyses, does it mean I have to write PWEIGHT=(item # of weight)*N/(sum of all weights) in the control file, and choose "include all with specified weights" for person weights in weight selection ? My data has no item weight, does it mean I have to choose "report all and weight 1." for item weights?
2)The data set I am working is longitudinal. To accurately measure the persons at different waves, which weights should I use, longitudinal weights or cross-sectional weights?

Thank you,


Mike.Linacre: Thank you for your questions, Letao.

First do unweighted analyses to verify that everything is correct. Weighting will skew the results and can hide problems in the data. Also, the Reliability of this unweighted analysis is the true Reliability for these data.

Then, apply the weights using PWEIGHT=. There is no need to normalize weights or do other adjustments unless you want the reliability of the weighted analysis to match the true Reliability.

Weighting: cross-sectional or longitudinal?
If the weighting is census weighting because each person represents a group in the population, then cross-sectional weighting.
If the weighting is to make each time-point in the longitudinal data equally influential, then longitudinal weighting.

232. questions about my specification

eagledanny March 2nd, 2013, 1:43am: Dear Prof. Linacre,
Good evening!
Recently I wrote a specification of 20 raters rating 20 students' essays on a 15 level holistic scale. Two raters gave scores for a student. However, the Facets operation halted in the middle. I checked your Manual of Facets in 2010, but cannot spot the reason. The attachment is my specification and data. Would you please lend me a hand, thank you so much!
Best regards

Mike.Linacre: Eagledanny, you have few observations (40), but a long rating scale (16 categories) and 40 elements (20 raters and 20 students). There are not enough observations to estimate the structure of the rating scale meaningfully.

Suggestion: replace R15 with B15. This specifies a standard "binomial trials" rating-scale structure for the data:

Model = ?,?, MyScale
Rating scale = MyScale, B15
0 = lowest
15 = highest

The estimation will now report reasonable results.

Table 6.0 All Facet Vertical "Rulers".

|Measr|-raters |+students|MYSCA|
| 3 + + +(15) |
| | | | --- |
| | | | |
| | | | |
| | | | 13 |
| | | | |
| | | 1 | |
| | | | --- |
| 2 + + + |
| | | | 12 |
| | | | |
| | | | --- |
| | | | |
| | | 11 | 11 |
| | | | |
| | | 12 | --- |
| 1 + + + 10 |
| | r10 r17 | 20 | |
| | r12 | | --- |
| | r1 r11 r15 | 2 17 | |
| | r7 | 7 | 9 |
| | r2 | | --- |
| | r6 r8 | | |
| | | 19 | 8 |
* 0 * * 3 * --- *
| | r16 | 5 6 | 7 |
| | r3 r9 | 9 18 | |
| | r13 r4 r5 | | --- |
| | | 14 | 6 |
| | r14 r20 | | |
| | | 4 | --- |
| | | | |
| -1 + + + 5 |
| | | 15 | --- |
| | | | |
| | r18 r19 | 8 | 4 |
| | | | |
| | | | --- |
| | | 13 | |
| | | | 3 |
| -2 + + + |
| | | 10 | --- |
| | | | |
| | | | |
| | | 16 | 2 |
| | | | |
| | | | |
| | | | --- |
| -3 + + + (0) |
|Measr|-raters |+students|MYSCA|

Notice that the range of the raters is about half the range of the students. This accords with a remark by F.Y. Edgeworth in his 1890 paper "The Element of Chance in Competitive Examinations"!

And also look at Table 8 to see the frequency of the categories:

Table 8.1 Category Statistics.

| DATA | QUALITY CONTROL | Obsd-Expd|Response|
| Category Counts Cum.| Avge Exp. OUTFIT|Diagnostic|Category|
|Score Used % % | Meas Meas MnSq | Residual | Name |
| 0 0 0% 0%| | -.7 | lowest |
| 1 2 5% 5%| -3.22 -2.77 .1 | | |
| 2 2 5% 10%| -2.65 -2.22 .1 | | |
| 3 3 8% 18%| -1.59 -1.59 .2 | .6 | |
| 4 0 0% 18%| | -2.8 | |
| 5 5 13% 30%| -1.13 -.79 .2 | 1.8 | |
| 6 3 8% 38%| -.50 -.42 .1 | -.5 | |
| 7 6 15% 53%| -.15 -.09 .1 | 2.1 | |
| 8 5 13% 65%| .33 .20 .1 | .9 | |
| 9 4 10% 75%| .57 .47 .1 | | |
| 10 3 8% 83%| .91 .78 .1 | -.7 | |
| 11 1 3% 85%| 1.18 1.12 .0 | -2.2 | |
| 12 6 15% 100%| 1.72 1.42 .3 | 3.5 | |
| 13 0 0% 100%| | -1.5 | |
| 14 0 0% 100%| | -.5 | |
| 15 0 0% 100%| | | highest|
Binomial trials discrimination: 1.39 S.E. .02

eagledanny: Hi Mike,
So surprised to get your immediate reply. I feel that the number of observations is not up to the standard, but have no idea to deal with it. Thank you so much.
Best regards

233. Misfitting item - blank responses to solve?!

avandewater February 27th, 2013, 3:51am: Hi Mike,

Using clinical observations to estimate a person's ability, we have some items that are significantly misfitting (e.g. OUTFIT MNSQ 5.16 ZSTD 2.9). When investigating, this is due to 3 (out of 109) responses (found in Winsteps tables 10.4).
The misfitting responses to these items are from the main misfitting people (Table 6.1 and further).
During the Rasch course I've done last year, in these cases, we were mainly talking about/discussing "deleting persons". Since it seems -to me- a bit radical to delete an entire response string (definately when having a smaller sample size), I was wondering:
Would blanking out (creating missing data) one response of one person to one item at the time be better when looking at fit and outliers? (rather than all responses of a person)

Also, other then "https://www.rasch.org/rmt/rmt234g.htm" are there any good references to support the deletion of persons or responses from a data set when looking at fit? (some non-Rasch people might find it "dodgy" to just delete data...)

Thanks in advance!

Mike.Linacre: Sander, deleting a few misfitting observations in a Rasch analysis (using, say, EDFILE= in Winsteps) may make the overall fit look better, but its impact on measurement will be negligible.

The problem of deleting obviously aberrant responses (called "outliers" in most of statistics) is often discussed by statisticians. In the psychometric tradition, L.L. Thurstone was among the first to advocate quality-control of the data. In modern data mining, it is called "data cleaning". A practical example is https://www.rasch.org/rmt/rmt62a.htm where situations in which guessing is likely are screened out of the data. Obviously eliminating bad data must not only be defensible, but also be seen to be the intelligent thing to do.

234. Mulitple measurements on persons into one data set

avandewater February 27th, 2013, 2:46am: Hi Mike and other Rasch people!

being a 'baby' here on the forum with my first post, and being involved in Rasch analyses since about a half a year, I'm not too familiar how to search the forum accurately (I'm afraid). therefore I have some questions, first about use of multiple measurements of persons in a data set.

My situation: three clinical measurements taken:
1 (baseline)
2 (6 weeks later)
3 (one week after 2; 7 weeks after 1).

Sample size n=92 (at time point 1).

To be coherent with recommendations (https://www.rasch.org/rmt/rmt74m.htm) I prefer to have a data set of n>108. (was time wise not possible to keep collecting data)

I have read about stacking data from different time points (e.g. https://www.rasch.org/rmt/rmt101f.htm; and https://www.rasch.org/rmt/rmt171a.htm). However, I also read (can't remember where at the moment) that adding data from the same people should be avoided since it violates the 'independent measurements' - which I think is fair enough.

Is there a good reference that suggests to add (any) second measurements; i.e. use multiple measurements on persons in a data set?

I could imagine that data points of people who have meaningfully clinically improved/deteriorated, can be included (although data set might get skewed towards higher ability). However, if people have not changed or little - what then? Not including into the data set?

It would be great if anyone could help me with this!

Thanks in advance!

Mike.Linacre: Welcome, Sander.

The effect of dependency on Rasch analysis is usually minimal, but referees are often concerned about it. So a technique is https://www.rasch.org/rmt/rmt251b.htm - This uses a sample of the records (one for each patient) to construct the measurement framework, which is then applied to all the patient records.

235. Polytomous Outfit MNSQ Calculations

uve February 25th, 2013, 9:45pm: Mike,

When I square the Z-Scores for dichotomous, sum them and divide by degrees of freedom, I get the same outfit MNSQ as reported in Winsteps. However, when I attempt this for polytomous data, it does not come out the same. Is there an additional step(s) needed?

Mike.Linacre: Uve, https://www.rasch.org/rmt/rmt34e.htm shows the actual formulas used in Winsteps. They are from Rating Scale Analysis (Wright & Masters) p. 100. (A classical page in Rasch mythology.)

uve: Mike,

Here's an excerpt from your link:

Outfit is based on a sum of squared standardized residuals. Standardized residuals are modeled to approximate a unit normal distribution. Their sum of squares approximates a X2 distribution. Dividing this sum by its degrees of freedom yields a mean-square value, OUTFIT MEANSQ, with expectation 1.0 and range 0 to infinity.

The equation given in the link appears to be what I did. My assumption is that I should be able to square the z-scores from the XFILE output, sum them, divide them by the total number of respondents to that item and that should produce the MNSQ found in Winteps tables. This worked for various dichotomous items. But when I apply the same process to a polytomous data, it does not work. In the polytomous data, I coded missing responses as -1 in Winsteps. I filtered out these responses, but this did not work either.

I must be missing something very basic.

Mike.Linacre: Perplexing, Uve. Polytomous or dichotomous should not make any difference, The computation is the same for all z's.
For the mean-square computation, missing data and extreme scores are skipped. So, in the XFILE, only include those. This is most easily done from the "Output Files" menu. Uncheck "Missing" and " Extreme".

uve: Success! Thanks Mike. I knew it had to be something simple.

I’m finding the XFILE invaluable in helping build a more detailed picture of who comprises the respondent groups found using incorrect choices or highly unexpected categories, as summarized in something like Table 10.3. I was trying to decide whether I should use z-scores or chi-square to classify students when I ran into the issue. I guess my dichotomous data just happened not to have any extreme scores, which is why it worked and the polytomous data didn’t. I thought that using chi-square would produce a better method of grouping because I could use a single cut point of sorts, but settled on z-scores filtering for less than -.2 and greater than .2. I can then find the average measures for each z-score grouping beyond this range by each option but with added filters for, say, gender, ethnicity, school site, etc. to see just exactly where most of the deviance is coming from.

This brings up an additional question. If we have access to z-scores why would we need MNSQ’s? Z-scores provide us ranges with which we are familiar and for which we would have a better sense of determining significant deviance-one number hopefully does the trick. Measures beyond +/- 2 are more familiar, but there does not seem to be consistent cut point range for MNSQ.

Mike.Linacre: Uve, yes, the choice of convenient statistics is ultimately personal. For instance, a century ago they were choosing between "mean absolute deviation" (MAD) and "standard deviation" (SD). SD became the convention, but MAD is easier to explain to a non-technical audience.

BTW, this simple mean-square guide is generally useful:

Interpretation of parameter-level mean-square fit statistics:

>2.0 Distorts or degrades the measurement system.
1.5 - 2.0 Unproductive for construction of measurement, but not degrading.
0.5 - 1.5 Productive for measurement.
<0.5 Less productive for measurement, but not degrading. May produce misleadingly good reliabilities and separations.

236. Sample size to for one-parameter Rasch Model

lw357 February 25th, 2013, 11:04pm: Hi Everyone,

I am a ph.d student of Applied Linguistics at NAU. I am wondering if I have 50 students performed on three different speaking tasks whose response (3*50) was rated by two teachers on a five-point scale (each task has a different holistic rubric), is it feasible to run a FACETS analysis?

Looking forward to hearing back from you

Mike.Linacre: Sure, lw357. The usual minimum for robust results is around 30 students, but Facets can analyze any sample size from 2 upwards. For sample-size criteria, please see https://www.rasch.org/rmt/rmt74m.htm.

Your dataset is small enough that it can be analyzed with the free student/evaluation version of Facets, called Minifac, www.winsteps.com/minifac.htm - which has all the capabilities of Facets, except for restrictions on the size of the dataset.

lw357: Dear Dr.Linacre,

Thank you for your reply. So your suggestion would be using Minifac instead of the regular FACET, right?

Sure, we wil download the Minifac and try to run it.

Thank you very much and have a nice night!


237. item anchoring for survey items

Letao February 21st, 2013, 11:48pm: Dear Mike,

I have longitudinal survey data and I want to anchor the item calibrations in the 1st year, so that I can track the changes of person calibrations at different waves. Anchoring items for test items makes sense to me, but I am not sure if I can do this for survey items. Is item/person anchoring in survey research logically sound?

Another question:

Most of my survey items are dichotomous (e.g., yes/no), but a few of them are polytomous (e.g., from never to almost every day), How should I deal with items with different scales in Rasch model? What I intended to do is to recode the polytomous variable into dummy variable and use Winsteps to estimate the parameters. Mathematically, it will work, because all the data are dummy coded. But conceptually, I am not quite sure, since some items are yes/no, and some items are less/more.

Many thanks,


Mike.Linacre: Letao,
1) Please do anchor the survey items in later analyses so that all results can be directly compared.
2) Please analyze the dichotomous and polytomous items together. There is no need to recode the polytomous items. Dichotomous and polytomous items can be anchored in later analyses.

Letao: Hi Mike,

Thank you very much for the quick respond.
Which model in Winsteps I should use if I want to analyze the dichotomous and polytomous items together. I used Partial Credit Model to analyze polytomous data before, so should I still use the PC model when analyzing these two types of data together?

Thank you,


Mike.Linacre: Letao: yes, you can use PCM

but it may make more sense to use a grouped-rating-scale-model:
where there is a D for each of the dichotomous items, and a P for each of the polytomous less/more items.

PCM is for when each item has a unique response structure. The grouped-rating-scale model for when subsets of items share the same response structure.

Letao: That's fantastic! Thanks a lot and warm regards!!


238. Concurrent Validity and Predictive validity

anita February 21st, 2013, 2:15am: Good morning Prof. Linacre
Good morning Rasch-people

I want to ask about how to analyse concurrent validity?
(I am using established instrument with dichotomous data whereas the instrument that I build using likert scale).
Is it necessary to use established instrument that also with likert scale?

I also want to know about how to analyse predictive validity. Is it with the same procedure.

Thanks for your time and help.

Mike.Linacre: Anita, concurrent validity is usually evaluated with the correlation between the total scores or measures on two instruments. The format of the items on the instruments is usually irrelevant.

Predictive validity is usually evaluated with the correlation (or similar) between the total scores or measures on an instrument and an indicator variable (such as graduation from College or Grade level.)

anita: Thank you for the answer.
As I am a Rasch novice, may I have a link that shows step by step to analyse concurrent validity and predictive validity.

How to interpret correlation between the total scores or measures on two instruments?

Many thanks for your help in advance,

Mike.Linacre: Anita:

For both concurrent and predictive validity.

1) Perform a Winsteps analysis

2) Output Files menu. PFILE= to Excel

3) In Excel, the important column from the Rasch analysis called "Measure"

4) In Excel, place your other variable in another column, aligned by rows with the "Measure" column. Your other variable is .....
Concurrent validity: the raw score or measure on another instrument of known validity
Predictive validity: the value to be predicted by the Rasch measure, such as Grade level

5) In Excel, correlate the Rasch "Measure" column with the other column to discover Concurrent or Predictive validity.

OK :-)

anita: Thank you. Very helpful.

Warmest regards,

239. facets for PCM and RSM

Juwon,J February 21st, 2013, 6:05am: Hi, Mike.

I'm researching something about PCM and RSM using facets program.

What Is the difference between PCM and RSM in code?

==> Models = ?,?,#,R5 for PCM

==> Models = ?,?,?,R5 for RSM

The only difference is # or, right?

If yes, I got a problem in comparing the output with the sas output , in detail, step difficulty of PCM. It is quite different from the output by SAS. That's why I'm not sure now that below code is right.

And if the examinee is set up to be non-centered and other facets(rater, item) become centered automatically?
Which specification is the right one? I'm confused.
Do I need to add a code for PCM?


; facet spec_facets3_spec.spec
Title = Ratings of PCM
Output = output_pcm1000facets3_log.log
;Output file report by facets
Score file=pcm1000facets3.SC ; score files ;Score file=pcm1000facets3.csv ; score files, csv file or excel

Arrange = N ; report all facets, sorted by entry number order, ascending

Facets = 3 ; examinee, rater, item
Positive = 1 ; facet 1, examinees, higher score=higher measure
Inter-rater = 2 ; facet 2 is the rater facet
Non-centered = 1 ; facet 1
Models = ?,?,#,R5 ; each examinee on each rater has own rating scale, i.e, a partial credit model

Labels= ;to name the components
1,examinee ;name of first facet
1-1000 ;1000 anonymous examinees

2,rater ;name of second facet
1=rater1 ;names of elements within facet
2=rater2 ;these must be named, or they are treated as missing




Ratings of PCM 2013-02-21 ¿ÀÈÄ 1:29:45
Table 8.1 Category Statistics.

Model = ?,?,1,R5 ; itemno: item1
| Category Counts Cum.| Avge Exp. OUTFIT| Thresholds | Measure at |PROBABLE| THURSTONE|PEAK|
|Score Used % % | Meas Meas MnSq |Measure S.E.|Category -0.5 | from |Thresholds|Prob|
| 0 34 2% 2%| -1.53 -1.53 1.0 | |( -3.51) | low | low |100%|
| 1 114 6% 7%| -.63 -.63 1.0 | -2.26 .20| -1.78 -2.71| -2.26 | -2.48 | 46%|
| 2 238 12% 19%| .05 .07 .9 | -1.01 .11| -.52 -1.10| -1.01 | -1.06 | 42%|
| 3 348 17% 37%| .94 .88 1.0 | .08 .08| .54 .01| .08 | .03 | 39%|
| 4 550 28% 64%| 1.90 1.92 1.1 | .92 .07| 1.77 1.10| .92 | 1.03 | 47%|
| 5 716 36% 100%| 3.15 3.16 1.0 | 2.27 .06|( 3.51) 2.71| 2.27 | 2.46 |100%|

---------------------sas output-----------------------

Item 1.






Mike.Linacre: Thank you for your questions, Juwon.

You wrote: What Is the difference between PCM and RSM in code?
==> Models = ?,?,#,R5 for PCM
==> Models = ?,?,?,R5 for RSM
The only difference is # or, right?

Reply: This is correct. # specifies that each element of facet 3 defines its own rating-scale structure. This is the "partial-credit model".

You wrote: is it right for examinee to set up to be non-centered?

Reply: One facet must be non-centered in order for the estimates to be identifiable (and not overconstrained). We can choose which facet. Usually we choose the facet that is the "object of measurement". Usually this is the examinee facet, unless we are doing a study of rater behavior.

You wrote: If so, I got a problem in output, in detail, step difficulty of PCM. It is different a lot from the output by SAS.

Reply: Facets estimation in Table 8 is self-checking. Please look at Table 8, column "Observed-Expected Diagnostic Residual". If this column is not shown, or the values in it are small, then Facets has verified that the Rasch-Andrich thresholds correspond to the "Used" frequency counts for the estimation method implemented in Facets.

| Category Counts Cum.| Avge Exp. OUTFIT| Thresholds | Measure at |PROBABLE| THURSTONE|PEAK|Diagnostic|
|Score Used % % | Meas Meas MnSq |Measure S.E.|Category -0.5 | from |Thresholds|Prob| Residual |
| 0 34 2% 2%| -1.53 -1.53 1.0 | |( -3.51) | low | low |100%| -.9 |
| 1 114 6% 7%| -.63 -.63 1.0 | -2.26 .20| -1.78 -2.71| -2.26 | -2.48 | 46%| |
| 2 238 12% 19%| .05 .07 .9 | -1.01 .11| -.52 -1.10| -1.01 | -1.06 | 42%| |
| 3 348 17% 37%| .94 .88 1.0 | .08 .08| .54 .01| .08 | .03 | 39%| |
| 4 550 28% 64%| 1.90 1.92 1.1 | .92 .07| 1.77 1.10| .92 | 1.03 | 47%| |
| 5 716 36% 100%| 3.15 3.16 1.0 | 2.27 .06|( 3.51) 2.71| 2.27 | 2.46 |100%| |

The SAS thresholds are more central than the Facets thresholds. The S.D. of the thresholds for Facets is 1.56. The S.D. of the thresholds for SAS is 1.32. Their ratio is 1.56/1.32 = 1.18

What are the S.D.s of the examinee distributions for Facets and SAS? What is their ratio? If the ratio is close to 1.18 then this suggests that SAS is modeling the examinee distribution to be more central than Facets.

What estimation method is used by SAS? Is there webpage describing the SAS algorithm?

Research suggestion:
Use the Rating-scale model in Facets and SAS with Facet 1 noncentered and the equivalent specification in SAS. Save their outputs.
Then do the same with Facet 2 noncentered and the equivalent specification in SAS. Save their outputs.
Then do the same with Facet 3 noncentered and the equivalent specification in SAS. Save their outputs.

In all 3 Facets runs, the Andrich Thresholds in Table 8 should be the same. Are they always the same in SAS? If not, this suggests that SAS is imposing constraints on the distributions of the elements in each facet that are interacting with the threshold estimates.

240. rating scale diagnostics

Li_Jiuliang February 18th, 2013, 4:40am: Hi Mike, I have a question about rating scale diagnostics information. My rating scale has four components, i.e., MIC, INT, LU, and SU, each of which has six categories, i.e., 0 - 5. It is said the simplest way to assess category functioning is to examine category use statistics (i.e., category frequencies and average measures) for each response option. Shape distribution is an important feature in the category frequencies, and regular distributions such as normal distribution is preferable to those that are irregular (Bond &Fox, 2007). The FACETS output of category frequencies (i.e., the observed count) for my study is shown in attachment. My question is: do the category frequencies in my study show a normal or near normal distribution? And, how can I tell that? Thank you!

Mike.Linacre: Li Jiulang, thank you for your questions.

Normal distributions of observed Counts are not necessary, but we do like to see unimodal distributions.
We also like to see Average Measures advancing.

We can see that there are few observations of category 5. This is a problem if we need to infer rating-scale structures from these data to future data for all the components. Category 0 also has few observations, even though they have low average measures (which is what we want). This suggests that we need more very high and very low performers in our analysis in order to make inferences based on these counts more robust. We need at least 10 observations of every category of every component, if the rating scales for the items are to be analyzed independently (the Partial Credit model).

MIC, INT, LU, SU are all unimodal and have strongly advancing Average Measures. This is good.

The patterns for the components looks similar (after adjusting for the modes of the observed counts, and ignoring categories 0 and 5). Therefore, my recommendation would be to use the "Rating Scale Model" (MIC, INT, LU, SU share the same rating scale structure). We would have 15 observations of category 0, so its estimation should be robust. We would have 5 observations of category 5, shared by MIC, INT, LU, SU, so that the top category of all the components would be defined (though only roughly estimated), but "rough" is better than nothing.

dachengruoque: Forgive my superficial understanding if it is indeed, shall we infer from the observed counts throughout 6 categories that raters seldom apply category 0 in the four components even less category 5. The rated performances seemingly have insignificant differences then, since it can be observed that most of the applied categories are either 1, 2, 3 in the four components, category 4 having not many observed counts as well though more than 0 and 5 categories.

Mike.Linacre: Dachengruoque, your inferences appear to be correct.

Li_Jiuliang: thank you very much Mike and Dachengroque!

241. Break point value measurement for grouping data

rony000 February 19th, 2013, 2:28pm: Dear friends,
I have 9 parameter, binary data (0/1), sample 460. Now i need to category the respondent in 4 groups. How i get the break point value for each group.

Here i attached my data and link of a report. Please see exhibit 3.2 (page 31) and 3.3 (page 34, 70) from the report, available here: www.fns.usda.gov/fsec/files/fsguide.pdf

Please inform me the option how to measure this as like as given in the report. I have winstep software with me.


Mike.Linacre: Mahmud, we can stratify the respondents into 4 groups in two ways:
1) Normatively:
either (a) one quarter of the sample in each range, starting from the bottom
or (b) divide the measurement range of the sample into quarters, starting from the bottom

2) By criterion:
Look at the item map and choose 3 cut-points which indicate transitions from one qualitative status to the next higher qualitative status. These 3 cut-points define 4 groups.

The choice of stratification methods depends on the message you want to communicate to your audience, and how the audience is intended to use your findings.

243. Ordered Thresholds Versus Summary Data

uve February 14th, 2013, 6:18pm: Mike,

Upon examining the probability curves of a survey, I made the decision that perhaps categories 2 and 3 should be combined. This rescoring provides better probability curves and what appears to be maybe a slightly better fit to the model, though the visual is hard to interpret. However, once I dig into the summary data and the item-person map, a different picture emerges. Reliability drops slightly as does separation and error increases slightly. The mean increases for the rescored version so the targeting has decreased a bit. From the map you can see that the rescored model has a much stronger positive skew than the original.

In your opinion, are the resultant non-disordered category thresholds produced by collapsing some of them too high a price to pay given the changes in the summary data in this situation?

Raschmad: Recently there were some debates about disordered thresholds and what we should do in case they occur between David Andrich and Ray Adams/Margaret Wu in the EPM Journal.
I couldn't follow their arguments. what is the final verdict Mike? are disordered thresholds problematic and we should collapse or is it OK as Adams and Wu said (I guess they said. I'm not sure if if understood them correctly)?


Mike.Linacre: Uve and Anthony, the central issue is "What is the purpose of the rating scale?"

1. If we want the rating-scale categories to represent unequivocal advances in the underlying variable, then we need ordered empirical categories (category average measures), ordered modeled categories (Andrich thresholds) and categories that fit.

2. If want the rating-scale categories to increase the useful measurement information in the items, then disordered modal categories are OK - see, for instance, David Andrich at https://www.rasch.org/rmt/rmt251c.htm - but disordered empirical categories are problematic and so is category misfit.

In practice, we usually want a mixture of 1. and 2. and category functioning is somewhat imperfect. Accordingly, we must make a choice about our priorities. Which is more important? Category-level, item-level or test-level function? If test-level function is most important, then category-level function is secondary.

uve: Thanks for the links. I've read through them and as always, it will take time to digest. The quote below comes from an additional link: https://www.rasch.org/rmt/rmt202a.htm

These values [referring to 4 symmetrical item thresholds]have nothing to do with the distribution of persons, or the relative difficulties of the items - they characterize the response structure among scores, given the location of the person and the item.

This seems to provide another take on your comments. I have to wonder in which situation one could find themselves in where category-level function could ever take precedence over test-level function.

Mike.Linacre: Uve, you ask "category-level function could ever take precedence over test-level function."

Yes, this is the situation in medical use of rating-scales. Inferences for treatment are based on both overall results on the instrument and the ratings in specific categories. The specific categories of individual items may decide the treatment plan.

244. Comparing point-measure correlations

Raschmad February 14th, 2013, 7:17pm: Dear Mike,
You always advise not compute the average of point-measure correlations. If we want to compare the point measure correlations of two tests what should we do? It seems that computing means and using a t-test doesn't work.


Mike.Linacre: Anthony, since correlations are non-linear, the first step is to linearize them with Fisher z transformations.

My guess is that comparing sets of item correlations is equivalent to comparing test reliabilities. KR-20 is derived from summarizing inter-item correlations.

245. Retrieving the "prn" files after RUMM analysis

a0013774 February 12th, 2013, 6:35am: Dear Mike

How do I retrieve files that are on 'prn' such as summary statistics, individual fit, item fit etc.
I have saved them in a folder. But I'm falling to open them. I need to include them in my thesis write up. Is there a way I could save them all into PDF where I can easily copy and crop?


Mike.Linacre: Phihlo, .prn files are usually text files. Right-click on them and open with NotePad.

a0013774: Thanks you Mike.:)

246. Creating subtests-using Winstep

a0013774 February 3rd, 2013, 10:21am: I have 9 scored items. for example LP1, LP2, CS1, CS2, CS3, D1, R1, R2, TE1. I would like to combine the following into one subtest (LP1 and LP2) and (CS1,CS2,CS3) and (R1 and R2) using Winsteps. D1 and TE1 will remain unchanged. This means I will ultimately have 5 items LP, CS, D, R, and TE.
Thanks Phihlo

Mike.Linacre: Thank you for your question, Phihlo.

Winsteps cannot combine items automatically. So here is a suggestion if your data file is already in Winsteps format:
1. Launch Winsteps
2. Analyze your 9-item file
3. Winsteps menu bar, "Output Files", Output the RFILE= to Excel
4. In Excel, combine your items into subtests
5. Winsteps menu bar, "Excel/RSSST", convert your Excel file into a new Winsteps control file
6. Analyze the new Winsteps control file.

a0013774: Dear Mike

Thank you very much for assisting.I have managed to go as far as step 3-to excel.
However, I got stuck on Excel

-How do I combine my items into subtests using Excel? e.g. combining LP1 & LP2 and CS1,CS2,CS3,CS4 and R1& R2 (SEE EXCEL FILE ATTACHED)

-I normally type in my data directly on the Winstep control file set up (Grid) not from Excel. How do I convert Excel file into a new Winsteps control file?


Mike.Linacre: Phihlo:
1) Please make 3 new columns (LP, CS, R) in Excel. Add LP1 and LP2 into LP; CS1,CS2,CS3,CS4 into CS; R1& R2 into R
2) Save the Excel file.
3) In Winsteps:
Winsteps menu bar
Create the new control file
4) "Launch Winsteps"

a0013774: Dear Mike

Thank you so much for all your help.
I have now managed to convert Excel into Winsteps through the link you have attached earlier.

May I just request clarity on ONE thing



Mike.Linacre: Phihlo: do you want a subtest score? If so, please add columns LP1+LP2 into LP.
If you don't want a subtest score, what do you want?

a0013774: Morning Mike.

I am really confused now. Are you saying that if I have scores of LP1=2 and LP2=1. I should then create a column LP=3. Is this what you mean by saying"add columns LP1 +LP2 into LP"?


Mike.Linacre: Phihlo, please begin again ...

You wrote: "I would like to combine the following into one subtest (LP1 and LP2)"

Do you mean:
a) I want a two item test containing LP1 and LP2
b) I want one item which contains the score on LP1+LP2
c) ????

a0013774: Thanks Mike

(b) I want one "Super Item"(LP) which contains the score on LP1 + LP2. Ultimately in my Rasch analysis (item difficulty, bubble chart etc)I would like to have a Super Item (LP) not LP1 & LP2

I am quating from RUMM 2030 Manual on creating Subtests- "with subtests the original items are now regrouped.."This is what Ineed
N.B:I prefer usinng Winsteps not RUMM

Mike.Linacre: OK, Phihlo, we can now proceed in two ways:

A) the individual items are decisive:
1) analyze all the original items (LP1, LP2. CS1, ....)
2) obtain ability estimates for all the persons based on the original items
3) save the ability estimates: PFILE=hold.txt
4) add scores on LP1 and scores on LP2 into scores on combined super-item LP and similarly for CS, R using the Excel method described above
5) analyze the new control file with items LP, CS, R, D, TE1 and person ability estimates anchored at the old values: PAFILE=hold.txt
6) produce the output for the super-items

B) the superitems are decisive:
1) add scores on LP1 and scores on LP2 into scores on combined super-item LP and similarly for CS, R using the Excel method
2) analyze the new control file with items LP, CS, R, D, TE1.
The person estimates are obtained from the super-items.
3) produce the output for the super-items

a0013774: Mike: I am interested in A-
1. Ability estimates-are you refering to Person Measure under Output tables?
2. "add"-are you referring to adding 'mathematically' scores obtained from LP1 and LP2?
3. Could you please unpack/elaborate on step 5 for me


Mike.Linacre: Phihlo, good, we are making progress.

1. and 3. please see the Winsteps Help File that explains anchoring of person measures using the PFILE= output file from the original analysis and the PAFILE= input file into the new analysis.

2. Yes. The scores on the superitem are the score on the subtest. We are analyzing the subtest as though it is one "partial credit" item with scored categories.

a0013774: Dear Mike

I have tried to create subtests using Winsteps taking into account all your input BUT still found it complicated to do. Any way thanks a lot.

I then decided to create subtests usinng RUMM 2030 (which is easy). It worked but the problem is that the threshold map for 4 out of 5 items came out disordered. CAN I CONTINUE WILL THE ANALYSIS OR NOT? What exactly does it mean??

Mike.Linacre: Phihlo: you have combined items into "testlets". In these circumstances, disordered thresholds are expected because the "rating scale" is artificial. See, for instance, https://www.rasch.org/rmt/rmt251c.htm

Please be sure that you are anchoring the person measures at their uncombined values (A above) so that the item results between the uncombined and combined analyses are directly comparable.

Also, thank you for "BUT still found it complicated to do" - I will include a tutorial in the next Winsteps Help file. Currently the explanation of how to combine items is too brief: https://www.winsteps.com/winman/testlet.htm

a0013774: Dear Mike

Thank you so much.

I am now clearer and understand better


Mike.Linacre: "BUT still found it complicated to do" -

There is now a Tutorial about combining items using Winsteps and Excel at www.winsteps.com/a/CombiningItems.pdf

a0013774: Thank yoi so much Mike. I will look at it-hence I prefer Winsteps.

Question-what exactly is the acceptable range for reliabilty for Rach? I know for fit according to Bond and Fox is +2,-2


Mike.Linacre: Phihlo, person "test" reliability usually needs to be .8 or more for low-stakes tests and .9 or more for high stakes tests (same as in Classical Test Theory).

a0013774: Mike, thank you!

247. Structure data in ISFILE

uve February 11th, 2013, 10:07pm: Mike,

Below is from Winsteps:

"STRU" (structure calibration) or step measure is a Rasch model parameter estimate (Rasch-Andrich thresholds), also the point at which adjacent categories are equally probable.

I ran the ISFILE hoping to find the Rasch-Andrich thresholds so I could plot them. However I see no such measures. The data below is the 2nd category in a 4 category survey:

2 2 0.05 0.06 -1.16 -0.55 -0.73
2 2 -0.01 0.05 -1.08 -0.44 -0.67
2 2 0.13 0.05 -0.97 -0.34 -0.56
2 2 0.72 0.05 -0.62 -0.1 -0.17
2 2 -0.61 0.1 -1.95 -1.43 -1.5
2 2 -0.06 0.06 -1.14 -0.55 -0.73
2 2 -0.12 0.05 -1.08 -0.42 -0.69
2 2 0.62 0.05 -0.5 0.07 -0.08
2 2 -0.34 0.06 -1.39 -0.75 -0.99
2 2 -0.12 0.06 -1.3 -0.73 -0.87
2 2 -0.99 0.07 -1.65 -0.81 -1.34
2 2 0.21 0.05 -0.9 -0.29 -0.48
2 2 0.07 0.05 -0.99 -0.39 -0.58
2 2 0.25 0.05 -0.86 -0.26 -0.44
2 2 -0.37 0.06 -1.21 -0.48 -0.85
2 2 -0.1 0.06 -1.36 -0.8 -0.92[/justify]

I expected to see the thresholds in the second column, but it doesn't look like this is correct. Below is an example of item 1 from Table 3.2. Notice that the .02 threshold in Table 3.2 is not found in ISFILE.

¦ 1 1 519 21¦ -.49 -.49¦ 1.04 1.17¦¦ NONE A ¦( -1.72)¦
¦ 2 2 375 15¦ -.16 -.06¦ .73 .66¦¦ .02A¦ -.55 ¦
¦ 3 3 849 34¦ .26 .29¦ .87 .66¦¦ -.75A¦ .45 ¦
¦ 4 4 757 30¦ 1.28 1.20¦ .89 .92¦¦ .73A¦( 2.03)¦
¦MISSING 26 1¦ .40 ¦ ¦¦ ¦ ¦
OBSERVED AVERAGE is mean of measures in category. It is not a parameter estimate.

¦ 1 NONE ¦( -1.72) -INF -1.16¦ ¦ 85% 16% 1.3108¦ ¦ 1.1% 5.7 ¦
¦ 2 .05 .06 ¦ -.55 -1.16 -.07¦ -.73 ¦ 27% 54% .6068¦ .97¦ -.7% -2.5 ¦
¦ 3 -.72 .05 ¦ .45 -.07 1.25¦ -.21 ¦ 44% 71% .4986¦ 1.09¦ -1.4% -11.8 ¦
¦ 4 .76 .05 ¦( 2.03) 1.25 +INF ¦ .99 ¦ 87% 33% .9022¦ 1.24¦ 1.1% 8.6 ¦

Mike.Linacre: Uve, the .02 Andrich Threshold is relative to item difficulty of .03 on the latent variable. The .05 "Structure Measure" on the latent variable = Item Difficulty +Andrich threshold = .03 + .02 = .05. This is also shown in the ISFILE in which all values are relative to the latent variable.

Yes, terminology is confusing. Terminology has never been standardized across the Rasch world, and ambiguities abound. For instance, is a "threshold" relative to an item difficulty or relative to the latent variable?

uve: Thanks Mike. If I might, I'd like to make a suggestion. Perhaps the ISFILE Structure column could be modified to house these data. Imagine a 50-item survey with 6 responses. In a PCM run, you would be pasting 250 calculations. Then if you were comparing structures between two DIF groups, you would have to repeat.

Mike.Linacre: Uve, sorry, I am not following .....

item difficulty is .03
thresholds without item difficulties are in SFILE= (.02, -.75, .73)
thresholds with item difficulties are in ISFILE= (.05, -.72, .76)

What are the 250 calculations?

uve: Sorry Mike. I was referring to the Andrich thresholds given at the top portion of Table 3.2. Unless I am mistaken, these are the thresholds relative to the mean of the item being set at zero (relative to item difficulty) and the structure measures are the Andrich thresholds relative to the measures. If I wanted to plot the Andrich thresholds, not the structure measures, I would have to look up the item difficulty found at the top of 3.2 and subtract it from each of the structure measures to come up with the Andrich thresholds relative to item difficulty. If I had a 50 item survey with 6 categories, this would be 5 thresholds for each item in a PCM run. I would by subtracting the each item's measure from each items 5 structure measures = 250 calcualtions. It wouldn't take that much time in Excel, but if you're trying to compare between two or more groups in a DIF run, then it would be time consuming. So I was thinking it might perhaps be convenient to have the Andrich thresholds given in the top portion of 3.2 to appear in the ISFILE, perhaps in the Structure column. Just a thought.

Mike.Linacre: Uve, the numbers you want are in the SFILE= where these are the thresholds relative to item difficulty.

248. Raw Variance explained by items

NothingFancy February 5th, 2013, 7:59pm: I am examining the dimensionality of a test and want to make sure I'm understanding some of the results from table 23.0

Is the Raw Variance explained by items related to the range of item difficulty? Such that a small variance here means there isn't every much coverage in items?

I've seen on the other measure, Raw Variance explained by persons - where even if this isn't small, that's not necessarily bad, such in samples of nurses at the end of a rotation, where you expect the entire sample to be relatively high ability.

Mike.Linacre: NothingFancy, the more the dispersion of the item difficulties, the more variance in the data will be explained by the items. If the items all had the same difficulty, then they would explain none of the variance in the data.

The item difficulties are shown beneath the plot in your Table 23.2. We can see that their range is only about 1.5 logits. This is narrow. We don't expect their difficulty range to explain much variance.

NothingFancy: Does applying a Partial Credit Model modify anything I should look at?

Mike.Linacre: NothingFancy, polytomous scales (partial credit, etc.) explain variance even if all the items have the same difficulty because each scale operationalizes a range of scores.

The variance explained by the polytomous scales is partitioned between the persons and the items in proportion to the variance they would explain without the scales. This suggests that Table 23 should display a "variance explained by polytomous scales" line.

NothingFancy: I guess my confusion is the item difficulty part. The chart shows a relatively small item difficulty range, but if I look in the partial credit tables, I see a much larger spread of item difficulties (structure calibration, category measure, etc.)

Mike.Linacre: Yes, Nothing Fancy. The partitioning of variance depends on how the data are conceptualized. For instance, is the rating scale an attribute of the item (so that each item becomes its own testlet) or attributed to the "style" of the person? We also have the conundrum that changing the definition of "item difficulty" changes the partitioning of the variance.

249. Misfit

LyndaCochrane February 10th, 2013, 5:18pm: I'm a Rasch novice so please forgive me if my question is stupid. My data are taken from an interview process with 10 different stations each contributing to an overall score 0 to 100. When I ran the Rasch analysis, more than 75% of the person infit and outfit mean squares are > 2. Does this indicate the stations are not assessing a unique attribute? That there are disparate groups (e.g. due to ethnicity, experience)? That we need to redesign the stations?

Mike.Linacre: Thank you for your question, Lynda.
"more than 75% of the person infit and outfit mean squares are > 2."
Something has gone wrong with the analysis, Lynda. The average mean-squares (bottom of the Tables) are usually forced to be near 1.0. It sounds like yours are much higher.

If you are using Facets, please look at Table 8. Do the reported counts in each category match what you expect?
If you are using Winsteps, please look at Table 14.3. Do the reported counts in each category match what you expect?

Please look at the Convergence Table, are the numbers at the bottom very small?

LyndaCochrane: Your timely and detailed reply are greatly appreciated, Mike. I've decided I need to start at Square 1 with this and am now working through your tutorials. Thank you for your help.

250. Anchoring

Gloria February 3rd, 2013, 9:03am: Hello there
I applied the same Maths and Reading tests to a group of students at the beginning and end of the school year, and want to compare the results.
For the reading test (dichotomous items), I anchored the final test items using the measure of the entrance exam items.
But since the Math test items has partial credits, not sure if I should do the same.
Any help or comment?
Thanks in advance,

Mike.Linacre: Gloria, yes, please do anchor the partial-credit items so that the students' measurements can be compared.
In Winsteps, anchoring dichotomous items requires their measures to be in IAFILE= (which is the IFILE= of the beginning tests)
Anchoring partial-credit items requires their measures to be in IAFILE= (which is the IFILE= of the beginning tests) and SAFILE= (which is the SFILE= of the beginning tests)

Gloria: Hi Mike
Thanks so much for your soon reply and your priceless help.
I checked the help menu (help SAFILE=) and found Example 5 really helpful. Applying this example to my Math test, I am planning to include this in my control file (the example only use 6 items just to simplify, actually my test has more items), but I still have one doubt that I detail below:


p1 -.51
p2 -.22
p3 -1.49
p4 -1.27
p5 -.86

p1 1 -1.51
p1 2 -.73

p2 1 -2.53
p2 2 -1.85

But my doubts are:
(1) The Winsteps example 5 (SAFILE= help) confused me a bit, because in that example, only SAFILE is used to anchor a partial-scale item. Since you mentioned in your previous note that I have to use both IAFILE and SAFILE to anchor partial-credit item, I 'd appreciate if you let me know if I am in the right track in the example described above.
(2) The measure of the categories 1&2 for the items 1&2 came from WinSteps table 3.2 (Structure measure), but there is no value for category 0. Should I anchor category zero equal to zero in the SAFILE? (I saw some examples in the help menu that did this but it is not clear for me why).

Thanks in advance for your help

Mike.Linacre: Gloria, you are doing well :-)

https://www.winsteps.com/winman/index.htm?iafile.htm Example 5 is clearer.

1. Please use the item entry numbers, not the item labels. 1 instead of p1.

2. the SAFILE= values are relative to the IAFILE= values, so usually IAFILE= are required (unless the item difficulties are being re-estimated).

3. the lowest category is anchored at 0 as a place-holder, so that Winsteps knows that it is part of the anchored partial credit structure.

1 -.51
2 -.22
3 -1.49
4 -1.27
5 -.86

1 0 0
1 1 -1.51
1 2 -.73

2 0 0
2 1 -2.53
2 2 -1.85

Gloria: Hi Mike
Thanks so much for all your priceless help.
I was discussing with a friend who told me that, after anchoring, some items might misfit (it seems that happened to his data but luckily not to mine). It is not clear for me why the items might misfit after anchoring, but if this happens, should I leave these items without anchoring? should I delete them?
Thanks again

Mike.Linacre: Gloria, item difficulties are always optimized to the original data. We expect the fit to be worse when we use anchor values with new data.

If anchor values produce very big misfit, then we must think about why we are anchoring, and why the misfit has become large. There is no one answer about what to do.

If the misfit has become large, because the item has become bad, then we delete it.
If the misfit has become large, because the item difficulty as changed ("drifted"), then we unanchor the item.
If the misfit has become large, because the new sample of persons are misbehaving (for instance, guessing), then we leave the anchor value unchanged.
If this misfit has become large, because ...... (and so on)

Gloria: Got it ! Just a final question, when you say "very big misfit" , you mean mnsqINFIT less than .5 and mnsqOUTFIT greater than 2?
Thanks again

Mike.Linacre: Gloria: misfits less than 1.0 are no problem with anchoring (and rarely with anything else). They indicate the data fit too well. They indicate redundancy in the items, but not mismeasurement.

Mean-squares greater than 2.0 indicate that there is more unmodeled noise than useful information in the responses to the item.

Gloria: Thanks so much Mike!

251. index development based on rasch

mizzary February 8th, 2013, 10:29pm: Dear Mike, really need your assistance as I am new in this field. ;D
let say that I have done the validation process 3 times and have managed to revise the whole instrument to the extend that it reaches the accepted level. I need to interpret the result to the public and my supervisor need it to be in a simple index like the one we normally see.
1. can you please give me some guide on the next step that I need to take to reach the single index or can you please recommend some readings on this topic that was done based on Rasch. (happiness index; human development index)
2. I am not familiar with programming but I have asked a friend to digitize the instrument. Unfortunately, he is not sure how to incorporate Rasch analysis in the whole process. Thus, we are stuck and revert back to using score and averages :'(. I am seriously at a very sticky situation... please please help
Thank you so much in advance for you time and support ;)

Mike.Linacre: Congratulations, Mizzary, on your progress.

1. Index: a good range is 0-100.
Please see Winsteps Table 20. It has the values of UMEAN= and USCALE= so that your instrument will report in the range of 0-100
Example: TO SET MEASURE RANGE AS 0-100, UMEAN=46.5329 USCALE=7.3389
The definition of your index depends on the content of your items. It is helpful to your public if you can produce a picture like: www.winsteps.com/a/Linacre-measure.pdf page 4.

2. Digitizing? Perhaps all your friend needs is a score to measure table. This is Winsteps Table 20.
If your friend needs to estimate Rasch measures, then https://www.rasch.org/rmt/rmt102t.htm or https://www.rasch.org/rmt/rmt122q.htm

252. Guessing and Fit

uve February 6th, 2013, 5:44pm: Mike,

I found a rather puzzling but interesting oddity with regard to fit for an item. As you can see below, the average ability for the students choosing the correct answer is lower than those choosing the most common incorrect answer. I exported the XFILE and discovered that 84 of the 161 who chose the correct option had standardized residuals of 2.00 or higher. Expected probabilities for a correct choice for them ranged from 3-19%. Their average measure was -.25 with a range from -1.33 to .51. My initial reaction was that these 84 lower performing students were able to guess the item correctly. However, when I looked at the lower asymptote, the data did not confirm this. In fact the upper asymptote seems to suggest there was significant carelessness occuring but little guessing. Why would this be?

¦ A C 0 ¦ 85 12 ¦ .02 .08 .7 -.19 ¦
¦ B 0 ¦ 142 20 ¦ .32 .08 1.1 -.09 ¦
¦ A 0 ¦ 319 45 ¦ .65 .05 1.4 .16 ¦
¦ D 1 ¦ 161 23 ¦ .55* .08 1.8 .04 ¦

¦ 161 712 1.92 .10¦1.25 2.0¦1.63 2.6¦A .04 .35¦ 75.7 78.6¦ .57¦ .09 .54¦

Mike.Linacre: Uve, when we see results like this, our first suspicion is "incorrect scoring key" or "badly written item". The next suspicion is "false friend": students answering the question without fully understanding its implications. This causes some of the more careful, but less able students to succeed while high-ability students breeze through incorrectly.

253. Misfits Statistics and assumption of Normality

marlon February 1st, 2013, 3:12pm: Dear Prof. Linacre, Dear Rasch Users,

I would like to ask about your opinion in the problem of misfit statistics in the Rasch Model for 0-1 items.

As it is stated in Wright, Stone (1979) or Bond Fox (2001) Rasch Outfit and Infit Measures are computed on the basis of standardized residuals. The residual is the difference between observed score and expected score. Let us say it is:
where x: observed score; E(x): expected score (derived from model parameters)
This residual can be standardized y: (as it is known: E(y)=0 and Std.Dev=sqrt[(E(x)*(1-E(x))] ).

Finally we can state that the standardized residual of the outcome is:


It is stated in one of the above mentioned books that if our data fits the Rasch model this standardized residuals should be normally distributed.

As a further step all these z's are squared and summed and then considered as a chi-square distribution while producing OUTFIT and INFIT measures.

So, my questions are:
1. It seems that assumption of normality is absolutely necessary assumption in OUTFIT and INFIT. Am I right?
2. perhaps, before proceeding to OUTFIT INFIT statistics we should check the normality of the residuals?
3. How to proceed when we had to reject the hypothesis of the normality of residuals?

Thank you for your precious help and answers in advance!

Mike.Linacre: Marlon, when the data fit a Rasch model perfectly, then the standardized residuals are expected to be normally distributed. Of course, empirical data never fit a Rasch model perfectly, so it is unlikely that the standardized residuals are normally distributed. The reason for computing the INFIT and OUTFIT statistics is to identify where the data depart from their modeled ideal form.

So, the usual process is:
1. Compute the maximum likelihood (or similar) estimates
2. Compute the INFIT and OUTFIT statistics
3. Discover that the values of the INFIT and OUTFIT statistics indicate that the standardized residuals are not normally distributed everywhere in the data.
4. Use the values of the INFIT and OUTFIT statistics to assist with the diagnosis (and hopefully remediation) of the deficiencies in the data.

A global test of "normality of standardized residuals" can be performed. We expect it to reject the hypothesis that they are normally distributed, but the test is unlikely to provide the diagnostics information provided by INFIT and OUTFIT.

OK, Marlon?

marlon: Mike,
Thank you very much for your reply. I still wonder: Do we have to assume that residuals are normal or not.

Also, how to procede if test of normality (let us say: Smirnoff-Kolmogorov's or Shapiro-Wilk's) says that rests are not normal.

Thank you for your interest in the problem.


Mike.Linacre: Marlon, the maximum likelihood estimation used by most Rasch software assumes that the randomness in the data is normally distributed in order to produce its estimates.

But we know that the randomness in empirical data is never exactly normally distributed, so the MLE estimates will be wrong. But how severely wrong? INFIT and OUTFIT tell us. If the distortion in the estimates is small, then we can usually ignore it. If the distortion is large, then we may take remedial action.

A global test of normality is not meaningful for Rasch data. We need to test the normality of the data for each parameter separately. Noise in Rasch data is often localized to a few parameters (such as students who guess, or items with incorrect scoring keys.)

marlon: Dear Mike,
Thank you very much for your answer. It is clear and conivincing.

I'm thinking on relying on INFIT and OUTFIT but there are number of opinions in the literature criticizing them for being too much sensitive to sample size. My settings are: 70.000 cases and 18 items with the same rating scale.
Do you think I should adopt any special critical values for those statistics in my case.
In one of the papers* I read I found the following formulas prooposed by B. Wright for computing critical values for outfit and infit , given the sample size:

Unfortunately, I could not find any paper with justification for them. Maybe, someone knows this argumentation and could tell me why this numbers?

Thank you in advance,

*Smith, R. M., Schumacker, R. E., & Bush, M. J. (1998). Using Item Mean Squares to Evaluate Fit to the Rasch Model. Journal of Outcome Measurement, 2(1), 66&#8211;78

Mike.Linacre: Marlon, there are two versions of each INFIT and OUTFIT statistics: the mean-square and the standardized.
The relationship between them is shown below. The curved lines are different values of the mean-squares. The standardized statistic (y-axis) is highly sample-size dependent (x-axis). The mean-square is effectively sample size independent. the functional ranges of the mean-squares are:

Mean-Square RangeInterpretation of parameter-level mean-square fit statistics:
>2.0Distorts or degrades the measurement system.
1.5 - 2.0Unproductive for construction of measurement, but not degrading.
0.5 - 1.5Productive for measurement.
<0.5Less productive for measurement, but not degrading. May produce misleadingly good reliabilities and separations.

254. 2nd question

eagledanny February 3rd, 2013, 2:13pm: Dear Prof. Linacre,
I have one more question. There are two groups of 17 raters, and I want to compare the consistency at the group level. Can I perform T-test on raters' Infit Mnsq directly?

Mike.Linacre: Danny, what hypothesis are you testing? The standardized residuals in the Facets Residualfile= may be what are needed.

eagledanny: Dear Prof.Linacre,
My hypothesis is that there is no significant difference between Group 1 and Group 2 on the consistency(Measured by Infit Mnsq) on their ratings for 14 essays. Then can I use T-test?

I am sorry I can;t really understand your advice. In the output file, I cannot find the residualfile which you mentioned. Could you please tell me where can I find it?

Thank you so much.

Mike.Linacre: Danny, is this your hypothesis? "The average mean-square of Group 1 = The average mean-square of Group 2"

Mean-squares are non-linear, so linearize the mean-squares by taking their logarithms. Then an "Unequal sample sizes, unequal variances" t-test could be used: http://en.wikipedia.org/wiki/Student%27s_t-test

eagledanny: Dear Prof. Linacre,
Thank you so much for your help!
Best regards,

255. Are item Difficulties differnt from Angoff rating?

Seanswf January 31st, 2013, 1:28pm: Hi,
I have an interesting problem. I am conducting an evaluation of a training course. We developed a multiple choice test to measure learning. I have Angoff ratings for every item we developed. We want to know if the population who took the test achieved at least the Angoff rating on each question. For example if the Angoff rating on and item was 80% then we would like to see the p-value of that item is at least 80%.
My problem is determining when the p-value is lower than the Angoff rating on an item. When is it low enough for us to determine there is an issue that needs to be addressed? For example if the Angoff is 80% and the p-value is 78% then that may not be a problem but if the p-value is 60% then it may be a problem.
Is there a way using Rasch to determine a confidence interval around the average performance on an item to see if the Angoff rating is within that range?

Thanks for your help!

Mike.Linacre: This is a challenge, Seanswf :-)

It would seem that a standard statistical interval is not suitable here, because "the bigger the sample, the smaller the interval". Rather, we need a substantively-based interval. For instance, in many academic situations, 1 logit correspond to roughly one year's growth. We could say "if the group are more than 3-months behind, then they are in trouble". This would correspond to 0.25 logits.

So then we would surmise:
item p-value of 80% = 1.4 logits relative to the item difficulty
behind = -.25 logits = 1.15 logits relative to the item difficulty = 76% p-value.
So that p-values of 75% or below for that item are problematic.

Seanswf: Thanks Mike,
I understand your point about why the statistical interval would not be appropriate. But I have a few questions on your summary.

So then we would surmise:
item p-value of 80% = 1.4 logits relative to the item difficulty

Where did you get 80% = 1.4 logits? How can I get this logit value for all my other Angoff ratings?

behind = -.25 logits = 1.15 logits relative to the item difficulty = 76% p-value.

-.25 sounds fair to me, but still somewhat arbitrary which is my problem, I need to convince stakeholders that this threshold is appropriate. why not -.26 or -.30?

How do you then convert 1.15 logits back to 76%?

Thanks again for your help.

Mike.Linacre: Seanswf,

It is easier to do these calculations in linear logits rather than non-linear p-values. So, in logits, the baseline is a p-value of 50% = 0 logits.

A p-value of x% = ln ( x% / (100% - x%) )

For p-value of 80%, then ln (80% / (100% - x%) ) = 1.4 logits.

Backwards for y logits, p-value = 100 * exp(y) / (1 +exp(y))
For 1.15 logits, 100 *exp(1.15) / (1 + exp(1.15)) = 76%

You wrote: "-.25 sounds fair to me, but still somewhat arbitrary which is my problem"

Reply: That is true. The "one logit = one year" is a rough approximation, that may be a long way out in your situation. However, statistics are like that. Why are significance levels set at .050 instead of .053 or .048? The answer is that Ronald Fisher was counting fields of potatoes. 1 in 20 is easier to communicate than 1 in 19 or 1 in 21.

Seanswf: Thanks Mike - you are wise.

256. Ask for help with my specification

eagledanny February 3rd, 2013, 2:08pm: Dear Prof. Linacre,
I have to bother you again, please help me with my specification. Thanks a million.
This time, I have invited 19 raters to score 14 essays with a 15-level holistic scale. And I want to anchor the severity measure of expert rater at 0(for the expert is treated as the norm). I wrote a specification, but I am not sure if it is correct. Could you please help me check with it?
many thanks

Mike.Linacre: Danny, your anchoring looks correct, but here are two other changes:

Non-centered = 2 ; one facet must be unanchored and non-centered

Model = ?,?,MyScale
Rating scale = MyScale,R15 ; The rating scale name must match the Model= name
0 = lowest
15 = highest

eagledanny: It does work. Thank you so much.

257. Multiple Choice Distractor Analysis

NothingFancy January 30th, 2013, 5:22pm: I have a multiple choice test and would like to see how the distractors are operating. I want to make sure I go about it in the most appropriate way to do so. I've brought in the data with the item responses (A,B,C,D) and using the KEY= option to score it.

The exam is scored in a dichotomous manner, but is there an appropriate option for looking at the distractors, such as the Diagnosis / Empirical Item-Category Measures or Graphs / Empirical Option Curves?

Are these appropriate to use if you aren't using a Rating Scale or Partial Credit analysis?

Mike.Linacre: NothingFancy, yes those are appropriate for investigating options/distractors. Start with Winsteps Table 14.3.
The empirical curves are useful if you have a very large sample, otherwise they tend to be too fragmented.

dachengruoque: [quote=Mike.Linacre]NothingFancy, yes those are appropriate for investigating options/distractors. Start with Winsteps Table 14.3.
The empirical curves are useful if you [size=6]have a very large sample, otherwise they tend to be too fragmented.

Very large sample size? How large is it? Is there any rule of thumb about it? Thanks a lot, Dr Linacre!

Mike.Linacre: Dachengruoque, let's think about this.
We would want empirical response curves that are plotted from at least 10 points along the latent variable, and each point represents at least 20 people = 200 people for each distractor curve.
There are usually 4 or 5 curves for each item, so that implies at least 1000 person responding to each item.
But those 1000 will not be equally distributed across the distractors, or across ability levels, so, for meaningful distractor curves, we need samples of thousands of persons.
Of course, smaller samples may produce informative empirical curves, but the smaller the sample, the more influential will be the "luck of the draw".

dachengruoque: [quote=Mike.Linacre]Dachengruoque, let's think about this.
We would want empirical response curves that are plotted from at least 10 points along the latent variable, and each point represents at least 20 people = 200 people for each distractor curve.
There are usually 4 or 5 curves for each item, so that implies at least 1000 person responding to each item.
But those 1000 will not be equally distributed across the distractors, or across ability levels, so, for meaningful distractor curves, we need samples of thousands of persons.
Of course, smaller samples may produce informative empirical curves, but the smaller the sample, the more influential will be the "luck of the draw".

I see. Thanks a lot, Dr Linacre!

258. Guidance on how to do an item analysis using Rasch

dewmott February 1st, 2013, 12:09am: There is a Rasch Measurement group on LinkedIn. The following was a initial post to the group:

"Does anyone have any guidance on how to do an item analysis using Rasch?

"I work with a lot of educational assessments of low to moderate stakes. We use CTT stats like P-value and item total correlations to guide item revisions. When taking a purely Rasch approach to test development should you look only at fit statistics or include P-value and item-measure correlations. Sometimes I notice that an item may fit the Rasch model but may not have great CTT statistics and vice versa. Any references would be appreciated."

There are now 18 postings. Some are good; some are nonsense. Perhaps someone in this group who is a LinkedIn member may want to add their two cents. I have already put in mine.

David Mott

259. Effects of rating scale type on person separation

kbutkas January 29th, 2013, 6:06pm: Hi,

I am a research specialist working on a Rasch rating scale instrument that measures teachers self-perceived abilities in incorporating technology into their teaching practice. I am looking for any possible references on the effect of different types of Likert scales (e.g., ones that use a disagree to agree continuum versus ones that use a novice to expert continuum) on person separation. We have found that the typical 5 or 6 point Likert scale, ranging from strongly disagree to strongly agree, tends to result in a highly skewed distribution of responses toward the "Agree" side even after we have added more difficult items. While category combinations help improve rating scale function, we are contemplating the possible improvements in person separation that could come from changing the scale as well. Does anyone have any sources on this matter they could point me toward or any practical advice?

Mike.Linacre: Katrina, this is a substantive problem, rather than a statistical problem. We need a set of rating scale categories that will strongly discriminate between different levels of "self-perceived abilities". As you have discovered, it is unlikely that a Likert Agreement scale will do this, because of the natural human proclivity to be agreeable. Perhaps a frequency scale will work better: never, sometimes, often, always. Or an intensity scale: none, a little, some, a lot, complete.

kbutkas: Thank you! Your response confirms our latest idea which is to go with a scale of novice to expert.

260. Partial credit for multiple-choice

uve January 29th, 2013, 9:03pm: Mike,

I am exploring the possibility of giving partial credit to certain items where a distractor may exhibit properties of a partially correct answer. What I want to be able to do is to have the multiple choice items that do not have this characteristic remain scored 1 right, 0 wrong, but have certain items be scored such that the correct answer is 2, the partially correct answer is 1 and the other two distractors are 0.

Below is an example of what I have but am getting stuck on how to score the two wrong distractors 0.

Currently, below is how we are combining MC and CR items. That is, for item 1, if the student chooses A, then it is scored 1, 0 otherwise. However, item 9 is a CR item which receives a 1 if A is used, 2 for B, 3 C, 4 D and 5 for E.

I thought I could modify this somewhat to work with the partiall correct MC items scenario I just mentioned. The problem is that while A might be a partially correct distractor scored 1 for one item with the other two incorrect distractors are scored zero, for another item the partially correct distractor might be C with the remaining two incorrect scored 0.

First, I want to make sure that my MC and CR combination methodology works.

Second, I'm trying to find the best way to modify it to work with partially correct MC items, or perhaps do something completely different.

KEY1 = ABCDABCDABC ; KEY1 states that ABCDABCDABC are correct answers for MC items
KEY2 = ********B** ; Key2 states that B is an alternative correct answer for CR questions
KEY3 = ********C** ; Key3 states that C is an alternative correct answer for CR questions
KEY4 = ********D** ; Key4 states that D is an alternative correct answer for CR questions
KEY5 = ********E** ; Key5 states that E is an alternative correct answer for CR questions
KEYSCR = 12345 ; scores MC questions as 1 point and all CR questions as A as 1, B as 2, C
;as 3, D as 4, and E as 5

Mike.Linacre: That looks fine, Uve. You may find that using IREFER= and IVALUE= is simpler when the scoring keys become complicated. Here it is for your example:
IVALUE5=12345 ; A=1, B=2, C=3, D=4, E=5

261. Guttman Scaling

John_D January 28th, 2013, 11:17pm: Dear Mike,

Thank you again for helping me with the anchoring questions. I was also wondering if Winsteps provides Guttman's index of reproducibility and index of consistency coefficients? I see that output tables 22 provides "Guttman Scalogram of 'Zoned' and 'Original' Responses:"; however, I do see either indexes.

John D

Mike.Linacre: Sorry, John. Guttman indices are not reported by Winsteps. I considered them, but then did a simulation study: https://www.rasch.org/rmt/rmt142e.htm - the finding was that the Guttman Indices are not productive in a Rasch context.

262. How to anchor an item

John_D January 25th, 2013, 5:18pm: Dear Mike,

I am new to Rasch modelling and WINSTEPS. I am trying to anchor the first item and set the difficulty of that item to 0 (it is the easiest item in the scale) in a series of 5 items. I have read through the "Many-Facet Rasch Measurement : Facets Tutorial" but have still not been able to figure out how to anchor this item. What are the specific steps to anchor an item for a data set uploaded from SPSS?

John D

Mike.Linacre: John, if you are using Winsteps:
1 0

If you are using Facets:
1, Items, A ; your item facet
1= Item one, 0

John_D: Mike,

Thank you. However, I'm still getting an error message. Using Winsteps, where do I post: IAFILE=*

1 0

*? I've tried in IAFILE= after opening and running my data, but to no avail.

Mike.Linacre: John D.

There are two ways in Winsteps.

1. Edit your Winsteps control file (top of Edit menu)
2. At the Extra Specifications prompt:
IAFILE=* 1,0 *

These happen after selecting your control file, but before running your data.

John_D: Dear Mike,

Thank you! Very helpful. I was able to anchor the first item and run the program.

Warmest regards,
John D

263. Choosing a cut point

uve January 22nd, 2013, 8:33pm: Mike,

I am trying to make a better distinction in my mind between the comments from Winsteps help:

Most standard setting is based on "average" or "frequency" considerations. For instance, "If we observed 1000 candidates whose measures are known to be exactly at the pass-fail point, ...

..., we would expect their average score to be the pass-fail score." If this is how you think, then the Table you want is Table 2.2 (matches Table 12.5)

..., we would expect 50% to pass and 50% to fail the pass-fail score." If this is how you think, then the Table you want is Table 2.3 (matches 12.6)

..., we would expect more to be in the criterion pass-fail category of each item than any other category." If this is how you think, then the Table you want is Table 2.1 (no matching 12.)

My interpretations of Table 2.0 based on a recent survey focusing on item 14 attached:

1) For the first figure, Table 2.11, the most probable response between, roughly, logit 0 and .75 would be category 3. At about .75 we would see an equal probability of both options 3 and 4. Using the graphing function (not shown), this equates to about a 38% probability for both adjacent categories. Past this and the most probable response would be 4.

2) The second figure, Table 2.13, at about logit 1.0 there is a 50% probability of being observed in category 4 or any below.

3) For the third figure, Table 2.12, if I were to randomly sample from all respondents who score a logit at about 2.0, I would expect the average category chosen to be 4 and if I were to sample at about 1.3, I would expect to be in the transition point between 3 and 4, or an average category value of 3.5.

So now the tough part is choosing which figure best represents the cut score. Figure 2.13 seems fairly straight forward, but I'm having trouble distinguishing conceptually between the transition point of 2.12 and the equal adjacent point of 2.11. I am looking for the minimum point at which a respondent would exhibit the first signs of proficiency, which I believe likely to be category 4 for item 14.

What would you suggest?

Mike.Linacre: Uve, your inferences are correct except for 3)

The "mean" figure, Table 2.12, says almost nothing about the category chosen. For instance, consider a rating scale for which there were observations of categories 1,3,4 but not of 2. Then, in your example, "Most Probable" would not change (category 2 is never Most Probable). "50% Cumulative" would not show "2" because it has zero probability. "Expected Score" would show 2, halfway between 1 and 3 as the average of the sample ascends.

If you are looking for performance by an individual on a particular item, then it is defined either by a pass-fail category, or by a pass-fail "half" of the rating scale. If you are looking at performance by a sample or on a test, then the "MEAN" is what you want.

In educational settings, there is rarely huge interest in individual items for individual students, but in medical settings, there is often great interest in individual items for individual patients.

uve: Mike,

Your example for #3 makes sense.

:-/ Why then your statement:

we would expect their average score to be the pass-fail score

So I thought that if 2.12 shows option 4 at logit 1, and we sampled all persons who have a logit value of 1, we expect their average score to be 4 for the item in question, the pass-fail score. Or if we wanted to use the transition point, and that was logit 0, then we would expect the average score to be 3.5 for all persons sample who had an ability level of 0.

I must have some meanings switched around.

Mike.Linacre: Uve, aren't we say the same thing? The "mean" plot shows the expected average score on each item for every ability level.

uve: Yes, I think we are. It's just the comment below threw me a bit:

The "mean" figure, Table 2.12, says almost nothing about the category chosen.

So, with multiple-choice summative semi-high-stakes exams in public K-12 education, which method do you typically see used to set cut points?

Mike.Linacre: Uve, for dichotomous items, such as most MCQ, mean, median and mode are the same. They are different for polytomous items.

Often for dichotomous items, the original pass-fail point is set at a certain raw score on the test. This standard can then be converted into a logit value after the first test is administered. The logit standard is then maintained on subsequent tests by equating/linking the tests.

uve: Mike,

:B My apologies. I was confusing two of many different projects!

Let me rephrase. Based on the our discussion so far, what do you see used primarily to set cut points using selected response instruments in the public K-12 environment? This one in particular is measuring liking for the arts among grades 2-6. It reminds me somewhat of the Liking for Science instrument you often refer to.

I want to be able to set a "proficiency" point beyond which we would interpret a student as having sufficient liking for arts who would not require extra encouragement or intervention but enrichment. Those below the threshold would be tagged as perhaps needing mentoring or the utilizing of a different approach by the teacher towards confidence building or developing stronger connections to the visual and performing arts. The theory is that students who have these stronger connections in a classroom where the teacher effectively uses art in lesson delivery for English content and standards will enhance student learning and proficiency of English. The survey addresses part of this by ascertaining student opinion about art in various forms. If a student measures below a cut point threshold, there would likely need to be a change in strategy by the teacher to make the arts connection effective for the student.

The Rasch polytomous model provides several different methodologies for selecting the point, each one which radically changes the interpretation in my opinion. So it would be helpful for me to know if there is some common method used in a situation like this from which I could start.

Thanks again as always.

Mike.Linacre: Uve, practical standard-setting is outside my expertise. Please contact Gregory Stone - http://www.utoledo.edu/eduhshs/depts/efl/faculty/stone/index.html

264. Resits/retakes

miet1602 January 23rd, 2013, 12:51pm: Hi,

Returning to my operational data from which I am trying to figure out how 'reliable' the raters are and have encountered the issue of how to deal with the data from candidates who retake an exam.

My data currently include some candidates who have retaken the exam on at least one occasion - the exam is on-demand. It is also quite possible that the first time they take the exam they do it when they are not sufficiently prepared, just to have a go, and so get a 0 or other very low result. Then the next time they take it, they might even get a max mark.

If I am interested in assessor behaviour (rather than candidate ability), surely this will distort the results, as different assessors might be marking the same candidate on these several occasions, and sometimes the same assessor will mark on both occasions. So one rater might look like extremely severe giving the candidate a 0, whereas the next time, the same candidate will get the max mark from another rater (or sometimes the same rater marking on both occasions might appear to be erratic).

Generally, candidates will retake the exam doing a different version of the test, but sometimes they might even get the same version (totally aware of implications this has for the world, universe and everything assessment-wise, but assessment standards are not the strong point of vocational assessment in England).

Is there a way to get around this problem of resits distorting the rater behaviour parameters in Facets analysis while still retaining the data on resits?

Thanks for any advice!

Mike.Linacre: Thank you for this question, Milja. It sounds like Resits have become new examinees from the perspective of Facets. Please assign them new element numbers. Also, please specify a "Resit" facet with elements 1=first time, 2= resit, so that you can investigate resits separately from first timers.

miet1602: Hi Mike,

Thanks very much for your advice!

266. Different Cheating Criteria

uve January 23rd, 2013, 12:15am: Mike,

I’m revisiting another issue, but from a different angle. I recently ran the person paired agreement Table 35. For one pair the TWOLOW was 12. The test was a 47-item multiple-choice exam with 4 options. 728 students took the exam. Using a formula provided by you I ran the odds calculation in Excel as: =(4*(0.25^2))^12, which came out to .00000006. So the probability of this happening by chance is obviously extremely low, but the two persons in this pair do not have the same teacher. If we assume no electronic means of sharing information occurred, we have something that borders on winning the lottery without playing. Another interesting fact is that there were 611 out of 31,164 observations, or about 2% having less than 1/1million odds of choosing the same wrong options by chance (<0.0000001 if I set my criterion correctly) but did not have the same teacher, which if I did my math correct again involves around 300 respondents. The highest TWOLOW value for this group was 31, which represents 94% of the total wrong choices made by this pair: TWOLOW/ONELOW if I did my math correctly again, whereas the highest match for students who did have the same teacher was 13 for 31% of the total incorrectly scored non-missing items. What I’m beginning to take away is that with so many items, 4 options, and hundreds of students taking the exam, the chances of being struck by lightning several times suddenly doesn’t seem so unreasonable as we might think.

My question is: in your opinion, should I choose a different set of criteria for flagging possible cheating?

Mike.Linacre: Uve, this sounds like the same problem as cheating at the Casino. It is possible to guess the roulette wheel correctly 10 times in a row, but when it happens, the Casino management will examine the situation very, very closely. For instance, in one instance of suspected cheating, careful investigation revealed that the answer bubblesheets were not being collated and scanned correctly.

In the situation you describe, it is unlikely that all 4 options of all items are equally probable. So the flagging criteria would include outliers on the plots

267. Reconciling RSM, PCM, graphs & Tables 12.6, 3.2

uve January 19th, 2013, 10:36pm: Mike,

I apologize for all the many recent questions. As you can probably tell, I�m in the middle of many different projects and am feeling quite overwhelmed at the moment. :oSo much to do and so many questions. ::)

I wish to set some criteria cut points for a survey. I want to give the department that wrote and administered it several options on how to establish proper cut points. I was considering using cumulative thresholds, or median points, and wanted to demonstrate how this would apply.

When I ran the RSM, the graph relative to item difficulty (which as expected, remained the same no matter which item chosen) and the data for cumulative, or median points found in Table 3.2, appeared to match this graph. If I wanted to set a cut point based on the median point for the highest category on the easiest item (#5), Table 3.2 would do me no good and I would need to refer to something like Table 12.6. But viewing 12.6 only provided me a visual and not any precise data as seen in Table 3.2. Also, the spacing of the scale changes so it becomes harder to determine logit values beyond the -1 and 1 logit points. So, I thought what I could do was use 12.6 to get a sense of which item and category best represents the critical point, then go back to the graph but choose to have it just represent the measure and not relative to item difficulty (which as expected changed depending on the item chosen). Then I could find the item and category and draw a line from the .5 intersection line to the x-axis to gain a more accurate idea of what the logit value is. This worked and is in the attachment. I inserted in Table 12.6 what I thought these logit values might be if I were to use item #5 and its categories but I was still wondering if there was a better way.

I decided to run a PCM instead knowing full well this would likely change the interpretation but I was more curious as to understand how the two graphs and tables would reconcile with each other in this model. This time Tables 3.2, 12.6, and the Measure graph all reconcile. However, the Relative to Item Difficulty graph does not match the tables.

So here are my questions:

1)For RSM, the Measure graph appears to be the visual for Table 12.6, and the Relative graph the visual for Table 3.2, but what is the interpretation of these latter two?

2)For PCM, the Measure graph appears to be the visual for both Tables 12.6 and 3.2, and the Relative graph has no data output counterpart. Again, I'm not sure how to interpret this graph.

3)Finally, unrelated, all cumulative graphs are labeling the first three of the four categories. Shouldn't this be the last three? My understanding of the cumulative thresholds is that they represent the category at or above, the probability of which of being observed is 50%, versus any categories below. If the first category is represented as we see in my graphs, where is the point at which a respondent has a 50% of being in category 1 or above, versus below? There is no category below 1.

All other graphs seem to report the categories accurately, so I'm puzzled by this.

Thanks again for your patience and guidance.

Mike.Linacre: Uve, for precise locations on the latent variable for all the threshold types for all the items, please output ISFILE=.

1) and 2) Please ignore Winsteps graphs that are not clear to you. Winsteps produces many graphs and tables that were originally requested by researchers for specific projects. They are now available to everyone, but they are not necessarily meaningful or useful to everyone. When thinking about rating scales and items, my own preference is to start with the the subtables of Table 2, identifying the subtable which best communicates the functioning of the items to my audience.

3) Labeling of the thresholds is arbitrary. We can label a set of thresholds by the category above the threshold, or by the category below the threshold. Generally in Winsteps, the labeling is by the category above the threshold, but it looks like you have discovered an anomaly :-(

uve: Mike,

I forgot about ISFILE. Thanks for reminding me. Your strategy about using Table 2 makes more sense. I'll go that way from now on.

Follow-up questions for ISFILE:

1) Does CAT-.5 refer to the transition point--the ":" found in 2.2?
2) Do BOT+.25, AT CAT, and TOP-.25 refer to the category measure--the cateogry number/label also found in 2.2?
3) Does Measure refer to the structure measure found in 2.4?
4) And 50%PRB I assume is the cumulative probability found in 2.5

Mike.Linacre: Uve, your follow-up questions for ISFILE:

Yes - 1) Does CAT-.5 refer to the transition point--the ":" found in 2.2?
Yes - 2) Do BOT+.25, AT CAT, and TOP-.25 refer to the category measure--the cateogry number/label also found in 2.2?
Yes- - 3) Does Measure refer to the structure measure found in 2.4?
In 2.3 - 4) And 50%PRB I assume is the cumulative probability found in 2.5

268. Expected Category Clarification

uve January 18th, 2013, 11:47pm: Mike,

As I read more research papers, sometimes it can get confusing reconciling interpretations. I am trying to be general in the following statements, but perhaps I am being too much so. Please correct me where I am wrong.

1) If categories are truly modal, then the intersection of any two is where there is equal probability to be observed in both.

2) If we use the median approach then the .5 probability intersection with a category determines the point where there is 50% probability of being observed in that given category and higher, or below.

3) If we use the mean approach, then mid way between two categories is the transition point. Here is my primary clarification question: we can't say this is the point where there is equal probability of being observed in both, nor can we say there is a 50% probability of being observed in that category and higher, or below. We could only use the word transition.

Using the mean approach seems to replace probability with an average category value, not a probability for choosing one category or the other, though we can always determine the probability of choosing a given category for a respondent given polytomous Rasch model. Would that be a fair statement?

Mike.Linacre: Thank you for your request, Uve.

1. More generally, at the intersection of the probability curves for any two categories, there is equal probability to be observed in both. That is forced by the definition of the probability curves. In particular, the intersection of adjacent categories is also the Rasch-Andrich threshold. If these are ordered along the latent variable, then each category is modal in turn.

2. Yes. at the place where the cumulative probability for a given category or higher is .5, there is equal probability of being observed in the given category or any higher category versus any lower category.

3. The "mean" approach is based on the expected score on an item. This multiplies the category values by their probabilities. It says nothing about the probability of being observed in a particular category. Your "fair statement" is correct.

In general, modal thresholds are more central than median thresholds are more central than mean thresholds.

Choice of thresholds depends on the inferences that are to be made from them. To infer about an individual category: modal. To contrast top and bottom of the rating scale: median. To make summary statements about the whole rating scale: mean.

269. Missing Persons?

uve January 18th, 2013, 8:07pm: Mike,

I created a scatterplot using close to 250 persons. I calibrated their scores based on two different sets of items and should be seeing far more markers than I do. Any suggestions?

Mike.Linacre: Uve, each marker represents a combination of scores on the two tests. It looks like there are about 5 dichotomous items on the y-axis and about 35 dichotomous items on the x-axis. If this is right, then the markers are reasonable.

uve: Very close: 5 items and 45. I was expecting to see 250 markers since each person has two abilities calibrated based on the 5 or 45 items.

Mike.Linacre: Uve, the raw score is the sufficient statistic for a Rasch measure. So, if everyone responded to every item, there are 6 possible measures on the 5 items, and 46 possible measures on the 45 items. Thus the maximum possible number of different markers is 276. We could have seen 250 markers if no two persons had the same pair of scores on the two tests.

270. DIF Criteria

uve January 18th, 2013, 12:09am: Mike,
The situation is a typical one I encounter when running DIF. That is, I get some significant results but when I review the item, I can't seem to find a reasonable explanation as to the source. The item below is an example, but I've made up the pairs.

Find the range of the set of ordered pairs:

(-1, -3) (1, 7) (-10, 1) (-4, 6)

On the actual test, four choices appear below the pairs. There is only one correct answer. As you can see, the wording of the item is minimal. I ran a DIF based on student classification into three groups: English learner yes (ELY), no (ELN), or reclassified (ELR) which is a learner that has been determined to be as proficient as a non-learner. Below is Winsteps output for Table 30.2:

ELN 151 .55 .50 -.25 .05 -.47 -.22 .17 -1.25 .2137
ELR 45 .67 .62 -.25 .04 -.47 -.22 .34 -.64 .5266
ELY 47 .34 .53 -.25 -.19 .68 .93 .33 2.81 .0073

I have a feeling that this is likely a Type I error but I can't be sure. I was hoping to find some simulation studies based on sample sizes. I did find: http://www.jclinepi.com/article/S0895-4356%2808%2900158-3/abstract

The price is $31 for this seven page document. I was wondering if you knew of any other free resources that may help me out. Again, DIF appears quite common in much of my data, but the items rarely identify any cause, so much so that I think my data may not fall under the criteria in which DIF should be legitimately done.

Thanks again as always.

Mike.Linacre: Uve, your experience matches mine (and many other researchers). Most DIF is inexplicable. In fact, when content experts are told an item is reported to have DIF, but they are not told its direction, they often predict the wrong direction! Simulation studies don't help with this situation.

So, the statistical criteria are merely a screen. They point us to which items may have substantive DIF. Then the real work starts. Of course, for situations in which there are aggressive lawyers, the conservative option is to eliminate every item that might have DIF against the lawyer's clientele (whether we know the reason or not).

Here's a free paper that may be helpful: http://www.ets.org/Media/Research/pdf/RR-12-08.pdf

uve: Yes, I guess much depends on the application and situation. Over the past three years, I have looked at hundreds of tests, each with between 30-50 items and have not found to date any reasonable explanation for any DIF reported among the hundreds of various groups being compared. The return on investment in this very low for me given my situation and applications, but that's not to say I won't keep looking. I just need to put my limited time and resources into other aspects of Rasch measurement that provide more meaningful information. Thanks again!

271. Combined Category Label Wrong

uve January 14th, 2013, 5:00pm: Mike,

The following options are coded on a survey:

A Not At All
B Hardly Ever
C Sometimes
D A Lot

Unfortunately, the bubble sheets had an additional option E printed. Some students used this. I had Winsteps rescore it using:

Codes = ABCDE
Newscore= 12344

This worked fine and the both E and D are under the option D ‘A Lot’. However, the first two options were not used much. Looking at Table 2.5, the distances between these two categories was small, so I wanted to combine them using the following:

Codes = ABCDE
Newscore= 22344

This seemed to work; however, the category option label assigned is ‘Not At All’ or option A and not option B. This only happens when I try to combine the first two options. Do I need to do something else?

Mike.Linacre: Uve, to label the rescored options in Winsteps, please try
instead of

uve: Mike,

This did work in the sense that now there are only 3 categories numbered 2,3 and 4 which is what I wanted. But there are no labels attached to them. I'm assuming that's the difference between CLFILE and CFILE.

Mike.Linacre: Uve, does this work for you? It should ...

2 label for 2
3 label for 3
4 label for 4

uve: Mike,

The survey has only 4 options, but the scan sheets that were printed had 5. Some students chose this option, E. So I did:

A = Not At All
B= Hardly Ever
C= Sometimes
D= A Lot

This worked as you can see in the first graph. However, you'll notice category 2 is not modal, so I decided to collapse 1 with 2.

A = Not At All
B= Hardly Ever
C= Sometimes
D= A Lot

As you can see from the second graph, the numbers are correct but the labels are missing. If I use CLFILE instead, then I get the third graph, which now has the labels but the wrong ones.

Mike.Linacre: Uve, CFILE= uses the scored responses, see https://www.winsteps.com/winman/index.htm?cfile.htm Example 5.

2 Rarely or Never
3 Sometimes
4 A Lot

uve: Thanks. Worked great.

272. Cluster

uve January 15th, 2013, 9:56pm: Mike,

What is the criterion for assigning cluster groupings to the loadings in Table 23?

Mike.Linacre: Uve, the Fisher-linearized loadings in Table 23 are assigned to three (or less) clusters based on the centroids of the clusters.

273. Probability curves

Li_Jiuliang January 9th, 2013, 1:05am: Hi Mike, this is the probability curve for a component in my rating scale. How do I interpret this figure? Are there any problems with the categories? I¡¯m not sure if there might be problems with categories 2, 3 and 4 as they are very close to each other, and it seems that category 3 does not have a distinct peak. Would you please help me with that? thanks!

Mike.Linacre: Thank you for your questions, Li Jiulang.

Your probability curves indicates that categories 2 and 3 define narrow intervals on the latent variable. Please look at the definitions (wording) of these categories. Do the words define narrow or wide intervals?

Also, please look at the "Average Measures" for each category. Do they have ascending values?

dachengruoque: Does the narrow intervals indicate the necessity that category 2 and 3 should be collapsed into one category? thanks a lot, dr linacre.

Mike.Linacre: Dachengruoque, please look at the "Observed Average" Measures (OBSVD AVRGE) in Winsteps Table 3.2 of the persons scored in categories 2 and 3. If they are almost the same or disordered, then collapse them. If they are noticeably different, then categories 2 and 3 represent different levels of performance.

| 0 0 3 4| -.96 -.44| .74 .73|| NONE |( -3.66)| 0 Dislike
| 1 1 35 47| .12 .30| .70 .52|| -1.64 | -.89 | 1 Neutral
| 2 2 37 49| 1.60 1.38| .75 .77|| 1.64 |( 1.88)| 2 Like

dachengruoque: thanks a lot, dr linacre!

274. Help with Item Difficulties

Mantas January 4th, 2013, 8:55pm: Hi guys,

I'm new to Rasch analysis and not the best mathematician. I'm trying to figure out how exactly to compute an individuals score or measurement. Been trying to understand this, https://www.rasch.org/rmt/rmt122q.htm

So I've created a control file with 3 questions, and an Answer string of "ABC", I have a Anchor file (IAFILE) with the following difficulties for each question:
1 1.2
2 1.4
3 1.8

My data file contains 4 people's answer string.

555555555FABC // all right
666666666FABA // 2/3
777777777FAAA // 1/3
888888888FCCA // 0/3

each got a score or measure of

; 1 3.65
2 2.17
3 .76
; 4 -.72

which I'm assuming the 'measure' here is the Rasch score of an individuals test based on predefined item/question difficulties.

How can I compute these manually? How did winsteps make these measure values?

Mantas: I've edited original post. must have not saved over the IAFILE.

Current values posted above are what winsteps returns. How Can I compute them manually to get those measures?

Mike.Linacre: Mantas, the Excel spreadsheet at www.rasch.org/moulton.htm imitates Winsteps.

Mantas: Which table should I look at for calculating predefined item difficulties, in relations to a persons score?

Mantas: I changed the difficulties to:

1 1
2 1
3 1
4 1

and person answer data:


and I get measurements :

; 1 3.47 0 4.0 4.0
2 2.10 1 4.0 3.0
3 1.00 1 4.0 2.0
4 -.10 1 4.0 1.0
; 5 -1.16 -1 3.0 .0

I've confused myself even more now....

Mantas: so I'm using c# to successfully compute a persons abilty

//He got 3/4
double P = .75;


double logit = Math.Log(P / (1.00 - P));
Console.WriteLine("Ability logit : " + logit);

double M2 = logit;

now im trying to get his score like this

double top = Math.Pow(2.7183, (M2 - 1.00));
double bot = (1.00 + Math.Pow(2.7183, (M2 - 1.00)));

Console.WriteLine("FINAL RASCH SCORE : " + (top/bot));

but this is incorrect

Mantas: so if I multiple my final score (from the post above) with the number of total questions I get a correct measurement for the person who got 3/4 right. but the rest dont work. :/

Mike.Linacre: Mantas: please explain: "but the rest don't work."

What is "the rest"?

Mantas: I ment the other people's scores dont match winsteps

Mantas: I'm stuggling to program this equation for score : https://www.rasch.org/rmt/gifs/g16.gif

I calculate persons ability like this

double logit = Math.Log(P / (1.00 - P));

where P = ( Right / Wrong);
and this matches the excel file you provided in a response above.

then for score im doing this:
For each question:

( e ^ ( (persons ability) - (Question 1 Difficulty) ) / ( 1 + e ^ ( (persons ability) - (Question 1 Difficulty) ) )


( e ^ ( (persons ability) - (Question 2 Difficulty) ) / ( 1 + e ^ ( (persons ability) - (Question 2 Difficulty) ) )


( e ^ ( (persons ability) - (Question 3 Difficulty) ) / ( 1 + e ^ ( (persons ability) - (Question 3 Difficulty) ) )


= score?

Edit : removed "^2"
thats for variance not Score.

Mantas: but with question difficulties of :
1 1.0
2 1.0
3 1.5
4 2.5

and a person who gets 3 right and 4 wrong I get :


and winsteps tells me its 2.69

Mantas: btw : the measures I am trying to recreate are in the second column in


Mantas: What am I missing! (Going crazy over here)

Mike.Linacre: Mantas, this equation
double logit = Math.Log(P / (1.00 - P));
is correct when all the items have the same difficulty.

When items have different difficulties, then we usually need to iterate, as is demonstrated in https://www.rasch.org/moulton.htm

Mantas: Hey thanks a lot mike. I'll try to continue to figure this out tomorrow. That Excel sheet is not the most friendliest thing :).

If you could, can you tell me please give me a small example of how the ability equation would change over different item difficulties? would it be the "1.00" that changes? I will try to play around with the values tomorrow.

Thanks for your help.

Mike.Linacre: Mantas, how about starting from the PROX approximate estimation equation:

Ability of a person = (item spread)*Math.Log(right/wrong by the person)

Difficulty of an item = Mean Person Ability + (person spread)*Math.Log(wrong/right on the item)

Spread = square-root ( 1 + variance/2.9)

These formulae are iterated until the ability and difficulty do not change.

For derivation, see https://www.rasch.org/rmt/rmt83g.htm

thanks again for your help.

Mantas: I have written a C# program that mimics everything that happens in the Excel File https://www.rasch.org/moulton.htm ... I originally assumed that within the "OBSERVED RAW VALUES" , Difficulty Logit is where I would be able to substitute in my predefined difficulties for each item. However I now realize that that is not the case, and that winsteps does something different in calculating the persons measure, when the questions have predefined difficulties. Is there maybe another excel link which shows winsteps calculations with all predefined Anchored questions. Is the persons measurement calculated much differently with anchored questions then it is in the moulton excel?
I much trouble understanding the math in https://www.rasch.org/rmt/rmt83g.htm .

in the equations you posted above, can you explain how is 'variance' is gotten and what 'item spread is, any help or a nudge in the right direction would be greatly appreciated.


Mantas: Here is an image of my winsteps files

Mike.Linacre: Mantas: the moulton.htm Excel spreadsheet shows the process. If the item difficulties are anchored, then do not change the item estimates from their fixed values. Thus
in the spreadsheet: all these rows of item difficulties contain the anchor values: 35,35, 65, 66, 95,96, 125, 126, and so on ....

person variance = person measure standard deviation squared
item variance = item measure standard deviation squared

person spread = square-root(1 + person variance/2.9)
item spread = square-root(1 + item variance/2.9)

"When run with 1 person" - please be sure to set the convergence criterion very tight, and to report many decimal places.
In Winsteps

Mantas: SUCCESS!!!!

Thank you very much for helping me through this Mike Linacre

275. Identity equating

timstoeckel January 15th, 2013, 6:17am: Hi Mike,
I am trying to use Rasch analysis to balance four versions of a second-language vocabulary test so that raw scores can be used in reporting. The purpose in using raw scores is most students and teachers find raw scores in the form of percentages to be more meaningful than scaled scores or person measures. There is also a history of raw score reporting in the well-known second-language vocab tests.

The items were piloted and initially calibrated. They were then distributed across the four test forms, so that each included 60 unique and 30 potential anchor items. These four test forms were then field tested, and the data from field testing was used to simultaneously calibrate the item bank. These item calibrations were then "fixed" in each of the four separate test forms (using the IAFILE command in Winsteps) and score tables (20.1) for each form were generated . The table below shows a partial compilation of results, comparing Rasch person measures on each of the test forms for raw scores between 37 and 42 (which were around the mean).

The Handbook of Test Development (pp 516-517) explains that one way to decide between statistical vs identity equating is to compare (a) an estimate of "the bias that would be injected by not equating" and (b) the estimated standard error of equating.

I have two questions:

1) Am I correct to understand that (b) is the model error reported in Winsteps Table 3.1 or in Table 20.1, as shown below?

2) Does Winsteps give information pertaining to (a)? The Handbook gives a citation for finding one method for calculating (a), but maybe I can glean what I am looking for from something Winsteps already produces(?).

Table. Comparison of raw scores with person measures across four forms

Rasch person measure (S.E.)
Test Form
Raw score A B C D
37 .82 (.33) .58 (.34) .78 (.34) .90 (.34)
38 .94 (.34) .69 (.34) .89 (.34) 1.01 (.34)
39 1.05 (.34) .81 (.34) 1.01 (.34) 1.13 (.35)
40 1.16 (.34) .93 (.35) 1.12 (.34) 1.25 (.35)
41 1.28 (.34) 1.05 (.35) 1.24 (.34) 1.37 (.35)
42 1.40 (.35) 1.17 (.35) 1.35 (.34) 1.50 (.36)

You can see that at each raw score (row), the person measures are within one SE of each other. For test forms A and C, they are much closer. It looks to me that perhaps the distribution of items could be tweaked by exchanging some items between forms B and D, but I am not sure that that is even needed.

As always, thank you for your help.


Mike.Linacre: Thank you for your questions, Tim.

Looking across your Test Form table, we see that a measure of 1.16 logits corresponds closely to raw scores of 40, 42, 40, 39. There is definitely a score bias by Form if we report the raw scores strictly by Form (regardless of the standard errors).

If a raw score or raw-score-based percentage must be reported, then it would be fairer and more accurate to report the equivalent Form A raw score for everyone, regardless of which Form they answered. Another useful reporting approach is to convert logits into percentage-like numbers, such as reported-number = logit*10 + 50.

Sorry, "statistical vs identity equating" is new to me. Perhaps someone else reading this knows the answers to those questions.

timstoeckel: Hi Mike,

Thanks very much for your reply. I am waiting from some articles on this topic via interlibrary loan, and I will update this thread with what I find out.


276. Loadings Switched?

uve January 14th, 2013, 6:42pm: Mike,

I recently finished my dimensionality simulation using 250 persons and 30 items. 30 data sets were created total. The first simulates 5 items among the 30 total to be on the second dimension. The five items were simulated to have no correlation with the primary dimension. The second set is the same but simulates 10% common variance. The third set simulates 20% common variance and this repeats until there is 90%. The next group of ten data sets repeats but this time 10 items are substituted for the second dimension. And finally, another group of 10 sets is simulated but substituting 15 items. With two exceptions, Winsteps successfully identifies the second dimension items and assigns the positive loadings to them. There appears to be some blur or overlap as more common variance is added to the 2nd dimension, but that was expected. The two that failed were for the 15 item substitution using 0 and 20% common variance.

I've attached an example of one of the successful runs (5 items, no common variance) along with one of the two that "failed." The first 15 items were simulated to be on the second dimension, so the positive loadings should be attributed to these but are instead attributed to the items representing the primary dimension. So there appears to be a switch. I remember from a previous post of yours in which you stated that the loadings are arbitrary to some extent. My apologies if I have am not paraphrasing you correctly.

My question is should I simply reverse the signs on the items, or switch the item numbers? Put another way, for the second dimension, should I switch the loading range so it reads -.20 to -.50 or should I say items 1-15 are in the range .20 to .50 instead of items 16-30?

As always, I greatly appreciate you help.

Mike.Linacre: Uve, the convention in Factor Analysis (PCA, Common Factor) is that the bigger loading is reported as positive. Since the loadings are correlations with a hypothetical dimensional variable, and we cannot know the direction of that variable, the choice of which "end" is positive is arbitrary. Please reverse the signs of the loadings to suit your purpose.

277. Reliability & Variance

uve January 14th, 2013, 1:35am: Mike,

You were quoted recently in an article. I found the reference in Winsteps relating to PCA:

Tentative guidelines: "Reliability" (= Reproducibility) is "True" variance divided by Observed variance. If an acceptable, "test reliability" (i.e., reproducibility of this sample of person measures on these items) is 0.8, then an acceptable Rasch "data reliability" is also 0.8, i.e., "variance explained by measures" is 4 times "total unexplained variance".

I was curious as to the final line of the comment. I have analyzed many instruments with model reliabilities greater than .80 but none of them had ratios of explained to unexplained of 4 times. Can you provide further explanation?

Mike.Linacre: Thank you for your question, Uve.

That sentence of mine is incorrect. I am correcting it.

Reliability is based on observed variance (of the relevant measures) and the true variance (of the relevant measures).
Reliability = True variance / Observed Variance
Observed variance = True Variance + Error Variance
So, if Reliability = 0.8
Then True Variance = 4 * Error Variance

But this definitely does not apply to "data reliability". That sentence must have been written before https://www.rasch.org/rmt/rmt221j.htm - we can see that 0.8 (=80%) is almost never reached in practice. Here is the nomogram.

278. seeding

uve January 9th, 2013, 7:46pm: Mike,

When simulating data in Winsteps I want to be able to create a random data set but be able to reproduce it later or have someone else using Winsteps to be able to reproduce the same random set. It seems that this is where choosing a seed number comes in handy.

I will be creating three sets and thought I would choose a different seed number for each but record them so if I needed to replicate these three random sets again later, I could. Is this a viable procedure?

Mike.Linacre: Yes, that is viable, Uve. SISEED=

279. rater measurement report

Li_Jiuliang January 9th, 2013, 1:09am: Hi Mike, the following is an FACET output of rater measurement report. Am I right to interpret it as follows: FACETS provides a chi-square test of hypothesis that all raters participating in the scoring session exercised the same degree of leniency when rating examinees. As reported in the Table, the resulting chi-square test is statistically significant (X2=43.7, d.f.=2, p<.01). In other words, the probability is <.01 that the raters who participate in the scoring session exercise equal leniency.

Mike.Linacre: You are correct, Li Jiulang. There is almost zero probability that the rater measures are the same, except for measurement error.

281. Local Independence with Polytomous Items

alicihan January 5th, 2013, 11:01am: Hello.

I've a question about local independence. How can we investigate local independence with polytomus items. Which techniques should we use? Thanks for your interest.


Mike.Linacre: Thank you for your question, Alicihan.

Local independence between items produces close-to-zero correlation between the response residuals. (Residual = observed value - Rasch-expected value).

In Winsteps, this is reported as "Dimensionality" in Table 23. This is because items that are local dependent generate their own dimension.

For the correlation between every pair of items, look at the Winsteps ICORFILE= output file.

282. items differential functioning

albalucia January 4th, 2013, 11:39pm: hello guys, i have problems changing instructions for file attaach.
please help me with the posible answers..¡¡¡¡¡¡¡¡¡¡¡¡

Mike.Linacre: That file looks correct, Albalucia.

Do you want to do Differential Item Functioning for Sexo?

1. Analyze that file with Winsteps.
2. Winsteps menu bar: Output Tables
3. Click on Table 30. DIF
5. Click on: Table and Plot
6. Click on: OK

The Winsteps DIF Table, and the Excel DIF plot should appear.

283. Simulating dimensionality

uve January 3rd, 2013, 9:25pm: Mike,
I’ve recently been attempting to model my own simulated dimensional analysis patterned after the work of Dr. Everett V. Smith Jr. presented in his paper, “Detecting and Evaluating the Impact of Multidimensionality using Item Fit Statistics and Principal Component Analysis of Residuals” in Introduction to Rasch Measurement, pages 575-600. Three data sets were created, each with varying degrees of common variance from 0-90%. The three sets were based on a ratio of how many items represented each of the two dimensions of a 30-item 5-point instrument. These were 25:5, 20:10, and 15:15 as the primary and secondary dimensions respectively. So the first data set was simulated to have 25 items represent the primary dimension and 5 the secondary.

I’ve managed to simulate the person ability levels for 500 respondents using the N(0,1) distribution for each level of common variance between the two dimensions at 0, 10, 20. . .90% . The problem is how to generate the uniform item distributions. Here is the excerpt from the paper, page 582.

“The item difficulties used in the simulations were uniformly distributed in sets of five items (-1, -.5, 0, .5, and 1) so the number of items representing each component would not have an influence on the mean or distribution of the item difficulties for that data set (Smith, 1996). The step values were fixed at -1, -.33, .33, and 1, thus representing a 5-point rating scale.”

The 1996 article he refers to is, “A comparison of methods for determining dimensionality in Rasch measurement.” Structural Equation Modeling, 3, 25-40. I suspect this provides more detail on how the items were generated, but I’ve failed to get a copy thus far.

So I am still unclear how the 5 item sets were generated. A uniform distribution would require lower and upper bound parameters to be provided which I don’t see presented, but perhaps I’m missing something very basic in the meaning of the numbers provided in the parentheses.

Mike.Linacre: Uve, are you using a simulation method similar to https://www.rasch.org/rmt/rmt213a.htm ?

We usually don't do a random sample from a uniform distribution. Instead we generate item difficulties with uniform spacing, then we know we that we will get exactly what we want.

uve: Close. I generated the ability levels in Excel. I am stuck at the item stage. Once I get these, I will use PAFILE, IAFILE, EDFILE and SAFILE commands in Winsteps to generate the responses.

It sounds then that Dr. Smith generated five identical item sets each with the following measures for each of the five items in each set (-1, -.5, 0, .5 and 1). Would that be correct? He says that they were "uniformly" distributed so that's why I was thinking of a random uniform distribution.

Mike.Linacre: Yes, that is correct, Uve.

5 equally spread-out (= uniformly spread) items.

uve: Mike,

I'm assuming then that the item distribution could have looked something similar below:

1 -1
2 -0.5
3 0
4 0.5
5 1
6 -1
7 -0.5
8 0
9 0.5
10 1
11 -1
12 -0.5
13 0
14 0.5
15 1
16 -1
17 -0.5
18 0
19 0.5
20 1
21 -1
22 -0.5
23 0
24 0.5
25 1
26 -1
27 -0.5
28 0
29 0.5
30 1

Mike.Linacre: Looks reasonable, Uve. They could have been generated as 6 datasets of 5 items, and then rectangular-pasted together.

Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse Rasch Measurement Theory Analysis in R, Wind, Hua Applying the Rasch Model in Social Sciences Using R, Lamprianou Journal of Applied Measurement
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch Rasch Models for Measurement, David Andrich Constructing Measures, Mark Wilson Best Test Design - free, Wright & Stone
Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias Diseño de Mejores Pruebas - free, Spanish Best Test Design A Course in Rasch Measurement Theory, Andrich, Marais Rasch Models in Health, Christensen, Kreiner, Mesba Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen
Rasch Books and Publications: Winsteps and Facets
Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes Rasch Models for Solving Measurement Problems (Facets), George Engelhard, Jr. & Jue Wang Statistical Analyses for Language Testers (Facets), Rita Green Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind Rasch Measurement: Applications, Khine Winsteps Tutorials - free
Facets Tutorials - free
Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan

To be emailed about new material on www.rasch.org
please enter your email address here:

I want to Subscribe: & click below
I want to Unsubscribe: & click below

Please set your SPAM filter to accept emails from Rasch.org

www.rasch.org welcomes your comments:
Please email inquiries about Rasch books to books \at/ rasch.org

Your email address (if you want us to reply):


FORUMRasch Measurement Forum to discuss any Rasch-related topic

Coming Rasch-related Events
Oct. 6 - Nov. 3, 2023, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Facets), www.statistics.com
Oct. 12, 2023, Thursday 5 to 7 pm Colombian timeOn-line workshop: Deconstruyendo el concepto de validez y Discusiones sobre estimaciones de confiabilidad SICAPSI (J. Escobar, C.Pardo) www.colpsic.org.co
June 12 - 14, 2024, Wed.-Fri. 1st Scandinavian Applied Measurement Conference, Kristianstad University, Kristianstad, Sweden http://www.hkr.se/samc2024
Aug. 9 - Sept. 6, 2024, Fri.-Fri. On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com