Old Rasch Forum - Rasch on the Run: 2012

Rasch Forum: 2006
Rasch Forum: 2007
Rasch Forum: 2008
Rasch Forum: 2009
Rasch Forum: 2010
Rasch Forum: 2011
Rasch Forum: 2013 January-June
Rasch Forum: 2013 July-December
Rasch Forum: 2014
Current Rasch Forum

65. Some question on equating and linking method

iyliajamil December 14th, 2012, 3:07am: Hi,
i,m iylia, I am a Phd student from University Science of Malaysia.
My research is on developing Computer Adaptive Test for Malaysian Polytechnic diploma level student. To develop CAT one of the requirement is a Calibrated Item Bank. I am interested in Rasch Model and tend to use it to equate and link the question in the item bank.

First, about item bank, is it necessary to come out with more than 1,000 item to generate the item bank? i have read somewhere it need at least 12 time more question than the real test set. Let say we want i set of item containing 30 question so we must prepare 360 question for item bank.

Second, when we assign our item to a few set of questions, what is the minimum respondent need to answer to each set if we want to equate it using Rasch Model? Is 30 respondent enough?

Thirdly, I have read some of the journal regarding Equating and Linking Method saying that it needs a set of anchor item to equate a few set of test item.
The question is?
Is there a rule saying the minimum number of anchor item?
While reading i came across some say 2 item, some say 5 item, some say 10 item.what is your opinion as an expert in Rasch?

Thats all for now, really hope to get feedback from you.

Mike.Linacre: Thank you for your questions, iyliajamil.

There is no minimum size to an item bank, and there is no minimum test length, and there is no need for anchor items.

You are developing a CAT test. There will be no anchor items. You don't want anchor items, because those items will be over-used, and so are likely to become exposed.

If you are administering a high-stakes CAT test, then persons with abilities near crucial pass-fail points will be administered 200 items. Those far from pass-fail points (usually very high or very low ability) will be administered about 30 items, at least enough for complete coverage of the relevant content areas.

If you are administering a high-stakes CAT test with 200 items administered, and this is administered to 10,000 students, then you will need a very large item bank (thousands of items) in order to avoid over-exposure of items with difficulties near pass-fail points.

Or, if your test is low-stakes, and item exposure is not a problem, and low measurement precision is acceptable, then the item bank can be small, and only a few test items need to be administered. See https://www.rasch.org/rmt/rmt52b.htm

You may also want to look at www.rasch.org/memo69.pdf

iyliajamil: thank you for your swift reply.

if we don't have anchor item, how to link a test to another test?
to use CAT we need an item bank that have the difficulty level.

from what i know, in order to get the difficulty level of each question using rasch...student must answer to the question. let say, i have a 500 question to pun in an item bank. it is impossible to get a student to answer all the 500 question. so i need to break it to a few set of item containing less item, example 50 question to 1 set.so i have 10 set of question that will be answered by 10 different group of student.

can you explain this to me?
thank you.

Mike.Linacre: There are three stages in this process, iyliajamil.

1) Constructing the starting item bank. This is done once.

You have 500 questions you want to put into your bank. You can only administer 50 questions to one person.

So construct 11 tests with this type of linked design:

Test 1: 5 items in common with Test 11 + 40 items + 5 items in common with Test 2
Test 2: 5 items in common with Test 1 + 40 items + 5 items in common with Test 3
Test 3: 5 items in common with Test 2 + 40 items + 5 items in common with Test 4
Test 10: 5 items in common with Test 9 + 40 items + 5 items in common with Test 11
Test 11: 5 items in common with Test 10 + 40 items + 5 items in common with Test 1

These 11 tests have 495 items, so adjust the numbers if there are exactly 500 items.

Administer the 11 tests to at least 30 on-target persons each.

Concurrently calibrate all 11 datasets.

2) Administering CAT tests. This may be done every day.
This uses the item difficulties from 1)

3) Maintaining the item bank and adding new items. This may be done twice a year.


iyliajamil: thanks to you..its very helpful.

i will get back to you.after i read the article that you give me the link if i don't understand anything from it.

iyliajamil: hi..its me again.

i already get the data for my item.

its contain total of 1080 mcq type item, divided into 36 set. i have follow your advice before that is in each set there are 25 unique item and 10 common item (5 common item for the first 5 item in the set and 5 more for the 5 last item).

the question now is:

how am i going to analyze these data. most of the example show only linking method for two set of data.

i'.m using common item equating method.

is it possible to do it all at once those 36 set or i have to do it in pair?
for example:
set 1 with set 2
set 2 with set 3 until set 36 with set 1

i'll be waiting for your reply.
thank you.

iyliajamil: there is one more question.

usually for item analysis using rasch first step is to look at
the fit table and eliminate the item that is not fit and run again the analysis.
am i correct?

so if we have to do the equating....

do we have to eliminate the item that is not fit on each set of item before we do equating or after we do equating?

Mike.Linacre: Glad to see you are making progress, Iyliajamil.

1. Analyze each of the 36 sets by itself. Verify that all is correct. Don't worry about misfit yet. Be sure that the person labels for each set include a set identifier.

2. Analyze all 36 sets together. In Winsteps, MFORMS= can help you.

3. Do an "item x set identifier" DIF analysis. This will tell you if some of the common items have behaved differently in different sets. If they have, use MFORMS= to split those common items into separate items.

4. Now worry about misfit - if it is really serious. See: https://www.rasch.org/rmt/rmt234g.htm - in this design, the tests are already short, so we really want to keep all the items that we can.

iyliajamil: thanks for your swift reply.
i will try to do it.
will get back to you asap.

iyliajamil: hi, its me again.

what do you mean by including a set identifier?
i tried to create control file using MFORMS= function but not yet try to run it.

now i already done as instructed in (1) and (2) from your previous post. just not sure what a set identifier really means.

i attach to you example of my control file for analyzing each set individually and the control file using MFORMS= that i created following the example. (my data are dichotomous type (0,1)).

thank you.

iyliajamil: example control file for individual set.

i am not sure if you can open the file that i attach. if can not, can i have your email so that i can send directly to you.
thank you.

iyliajamil: i labeled person as for example;

person who answer set 1, 0101 to 0134 (example)

person answer set 13, 1301 to 1340 (example)

first two digit represent set number and third and forth represent person id.

is this what you mean by person label with set identifier?

Mike.Linacre: Yes, iyliajamil.

Now you can do an item by answer-set DIF analysis in a combined analysis of all the subsets.

If 0101 are in columns 1-4 of the person label, then
DIF = S1W2 ; the set number
for an analysis with Winsteps Table 30.

iyliajamil: thank you, mr. linacre.

i already try to run my control file using MFORMS=, its working.

i'll get back to you after run the DIF analysis.

iyliajamil: hi mr. linacre.
i have run all my 36 set of test separately, some of it has several misfit item.

i also try to combine all the 36 set into one using mforms .

my problem now is that i don't know how to do the DIF analysis to figure out witch
item that behave differently.

if there are such item, how to separate those item using mforms?

the set identifier, do i need to declare in the control file of each set, if yes, how to do it?

looking forward for your explanation.

thank you.

this is an example of my control file for one set:

MY mforms control file:

TITLE="Combination 36 set test item"
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=5 ;item 1-5 in colums 5-9
I31-35=10 ;item 31-35 in columns 10-14
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=10 ;item 1-5 in colums 10-14
I31-35=15 ;item 31-35 in columns 15-19
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=15 ;item 1-5 in colums 15-19
I31-35=20 ;item 31-35 in columns 20-24
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=20 ;item 1-5 in colums 20-24
I31-35=25 ;item 31-35 in columns 25-29
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=25 ;item 1-5 in colums 25-29
I31-35=30 ;item 31-35 in columns 30-34
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=30 ;item 1-5 in colums 30-34
I31-35=35 ;item 31-35 in columns 35-39
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=35 ;item 1-5 in colums 35-39
I31-35=40 ;item 31-35 in columns 40-44
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=40 ;item 1-5 in colums 40-44
I31-35=45 ;item 31-35 in columns 45-49
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=45 ;item 1-5 in colums 45-49
I31-35=50 ;item 31-35 in columns 50-54
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=50 ;item 1-5 in colums 50-54
I31-35=55 ;item 31-35 in columns 55-59
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=55 ;item 1-5 in colums 55-59
I31-35=60 ;item 31-35 in columns 60-64
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=60 ;item 1-5 in colums 60-64
I31-35=65 ;item 31-35 in columns 65-69
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=65 ;item 1-5 in colums 65-69
I31-35=70 ;item 31-35 in columns 70-74
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=70 ;item 1-5 in colums 70-74
I31-35=75 ;item 31-35 in columns 75-79
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=75 ;item 1-5 in colums 75-79
I31-35=80 ;item 31-35 in columns 80-84
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=80 ;item 1-5 in colums 80-84
I31-35=85 ;item 31-35 in columns 85-89
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=85 ;item 1-5 in colums 85-89
I31-35=90 ;item 31-35 in columns 90-94
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=90 ;item 1-5 in colums 90-94
I31-35=95 ;item 31-35 in columns 95-99
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=95 ;item 1-5 in colums 95-99
I31-35=100 ;item 31-35 in columns 100-104
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=100 ;item 1-5 in colums 100-104
I31-35=105 ;item 31-35 in columns 105-109
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=105 ;item 1-5 in colums 105-109
I31-35=110 ;item 31-35 in columns 110-114
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=110 ;item 1-5 in colums 110-114
I31-35=115 ;item 31-35 in columns 115-119
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=115 ;item 1-5 in colums 115-119
I31-35=120 ;item 31-35 in columns 120-124
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=120 ;item 1-5 in colums 120-124
I31-35=125 ;item 31-35 in columns 125-129
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=125 ;item 1-5 in colums 125-129
I31-35=130 ;item 31-35 in columns 130-134
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=130 ;item 1-5 in colums 130-134
I31-35=135 ;item 31-35 in columns 135-139
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=135 ;item 1-5 in colums 135-139
I31-35=140 ;item 31-35 in columns 140-144
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=140 ;item 1-5 in colums 140-144
I31-35=145 ;item 31-35 in columns 145-149
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=145 ;item 1-5 in colums 145-149
I31-35=150 ;item 31-35 in columns 150-154
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=150 ;item 1-5 in colums 150-154
I31-35=155 ;item 31-35 in columns 155-159
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=155 ;item 1-5 in colums 155-159
I31-35=160 ;item 31-35 in columns 160-164
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=160 ;item 1-5 in colums 160-164
I31-35=165 ;item 31-35 in columns 165-169
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=165 ;item 1-5 in colums 165-169
I31-35=170 ;item 31-35 in columns 170-174
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=170 ;item 1-5 in colums 170-174
I31-35=175 ;item 31-35 in columns 175-179
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=175 ;item 1-5 in colums 175-179
I31-35=180 ;item 31-35 in columns 180-184
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
I1-5=180 ;item 1-5 in colums 180-184
I31-35=5 ;item 31-35 in columns 5-9

Mike.Linacre: iyliajamil, please add your set number to the person label, then use it in your DIF analysis

TITLE="Combination 36 set test item"
DIF=6W2 ; the set number
L=1 ;one line per person
P1-4=1 ;person label in columns 1-4
C5-6 = "01" ; set 01
I1-5=5 ;item 1-5 in columns 5-9
I31-35=10 ;item 31-35 in columns 10-14
L=1 ;one line per person
P1-4=1 ;person label in columns 1-4
C5-6 = "02" ; set 02
I1-5=10 ;item 1-5 in columns 10-14
I31-35=15 ;item 31-35 in columns 15-19

iyliajamil: ok...i have added those command in the control file for DIF.

what about control file for each individual set, do i need to do anything or leave it as it is?

Mike.Linacre: iyliajamil, there is no change the control files for each of the 36 datasets unless you want to do something different in their analyses.

The change in MFORMS= is so that you can do a DIF analysis of (item x dataset).

iyliajamil: I have done it..

But the result does not make sense because the label of item is not the same with mine.
the result came out something like this:
TABLE 10.1 Combination 36 set test item ZOU421ws.txt Sep 5 15:06 2013
PERSON: REAL SEP.: .00 REL.: .00 ... ITEM: REAL SEP.: .00 REL.: .00


| 2 2 4 .55 1.07|1.47 1.2|1.56 1.3|A .10| I0002|
| 35 2 4 .62 1.06|1.01 .1| .97 .0|B .51| I0035|
| 31 4 5 -1.08 1.15| .99 .2| .82 .1|C .18| I0031|
| 32 2 4 .62 1.06| .99 .1| .95 .0|c .53| I0032|
| 1 3 4 -.69 1.20| .94 .1| .75 .0|b .32| I0001|
| 5 3 5 -.02 .96| .73 -.9| .67 -.7|a .68| I0005|
| MEAN 1.6 4.2 1.32 1.41|1.02 .1| .95 .1| | |
| S.D. 1.4 .4 1.70 .41| .22 .6| .29 .6| | |

TABLE 10.3 Combination 36 set test item ZOU421ws.txt Sep 5 15:06 2013


| 2 A 0 0 | 2 33 | .72 1.02 1.9 |I0002 | 0
| 1 1 | 4 67 | .89 .33 1.2 | | 1
| MISSING *** | 1648 100*| .81 .25 | |
| | | | |
| 35 B 0 0 | 2 33 | .41 .36 .9 |I0035 | 0
| 1 1 | 4 67 | 1.15 .39 1.1 | | 1
| MISSING *** | 1648 100*| .76 .27 | |
| | | | |
| 31 C 0 0 | 1 17 | .05 .8 |I0031 | 0
| 1 1 | 5 83 | .40 .39 1.0 | | 1
| MISSING *** | 1648 100*| 1.18 .16 | |
| | | | |
| 32 c 0 0 | 2 33 | .39 .38 .8 |I0032 | 0
| 1 1 | 4 67 | 1.16 .38 1.1 | | 1
| MISSING *** | 1648 100*| .76 .27 | |
| | | | |
| 1 b 0 0 | 1 20 | .01 .7 |I0001 | 0
| 1 1 | 4 80 | .59 .44 1.0 | | 1
| MISSING *** | 1649 100*| 1.01 .22 | |
| | | | |
| 5 a 0 0 | 2 33 | -.14 .16 .6 |I0005 | 0
| 1 1 | 4 67 | .85 .35 .8 | | 1
| MISSING *** | 1648 100*| 1.05 .24 | |

TABLE 10.4 Combination 36 set test item ZOU421ws.txt Sep 5 15:06 2013


TABLE 10.5 Combination 36 set test item ZOU421ws.txt Sep 5 15:06 2013


TABLE 10.6 Combination 36 set test item ZOU421ws.txt Sep 5 15:06 2013


| 0 | 0 | .77 | -.77 | -1.82 | 1.20 | 2 | 100 | I0002 | TAB03 |
| 0 | 0 | .75 | -.75 | -1.75 | 1.12 | 31 | 8 | I0031 | TAB01 |
| 0 | 0 | .67 | -.67 | -1.42 | .70 | 1 | 54 | I0001 | TAB02 |
| 1 | 1 | .35 | .65 | 1.35 | -.61 | 35 | 54 | I0035 | TAB02 |
| 1 | 1 | .36 | .64 | 1.33 | -.57 | 32 | 8 | I0032 | TAB01 |
| 1 | 1 | .37 | .63 | 1.31 | -.53 | 2 | 54 | I0002 | TAB02 |
| 0 | 0 | .54 | -.54 | -1.08 | .15 | 35 | 146 | I0035 | TAB04 |
| 0 | 0 | .54 | -.54 | -1.08 | .15 | 32 | 146 | I0032 | TAB04 |
| 0 | 0 | .51 | -.51 | -1.01 | .03 | 5 | 54 | I0005 | TAB02 |
| 1 | 1 | .56 | .44 | .89 | .23 | 2 | 146 | I0002 | TAB04 |
| 0 | 0 | .43 | -.43 | -.87 | -.28 | 5 | 192 | I0005 | TAB05 |
| 1 | 1 | .60 | .40 | .82 | .39 | 1 | 192 | I0001 | TAB05 |
| 1 | 1 | .69 | .31 | .68 | .78 | 31 | 192 | I0031 | TAB05 |

is it because i labeled all the item in every set the same?

Do i have to label item differently in every set? so that the correct output will be produced by winstep.
Set 1
Set 2
Set 3

Mike.Linacre: iyliajamil, the item labels are place after &END in the Winsteps file that contains your MFORMS= instructions.

For instance:

L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
C5-6 = "36" ; set 36
I1-5=180 ;item 1-5 in colums 180-184
I31-35=5 ;item 31-35 in columns 5-9

iyliajamil: i have done putting item names after &END command and
the output is like this:

TABLE 30.1 Combination 36 set test item ZOU785ws.txt Sep 9 13:44 2013

DIF class specification is: DIF=6W2

| 2 .85< 2.18 3 .87> 2.20 -.03 3.09 -.01 0 .0000 1 i1 |
| 2 .85< 2.18 4 -.08> 2.19 .93 3.08 .30 0 .0000 1 i1 |
| 2 .85< 2.18 5 -1.14> 2.18 1.99 3.08 .65 0 .0000 1 i1 |
| 3 .87> 2.20 4 -.08> 2.19 .96 3.10 .31 0 .0000 1 i1 |
| 3 .87> 2.20 5 -1.14> 2.18 2.02 3.09 .65 0 .0000 1 i1 |
| 4 -.08> 2.19 5 -1.14> 2.18 1.06 3.09 .34 0 .0000 1 i1 |
| 2 -.83> 2.18 3 2.58< 2.17 -3.40 3.08 -1.11 0 .0000 2 i2 |
| 2 -.83> 2.18 4 -.07> 2.18 -.75 3.08 -.24 0 .0000 2 i2 |
| 2 -.83> 2.18 5 .55< 2.18 -1.37 3.08 -.45 0 .0000 2 i2 |
| 3 2.58< 2.17 4 -.07> 2.18 2.65 3.08 .86 0 .0000 2 i2 |
| 3 2.58< 2.17 5 .55< 2.18 2.03 3.08 .66 0 .0000 2 i2 |
| 4 -.07> 2.18 5 .55< 2.18 -.62 3.08 -.20 0 .0000 2 i2 |
| 1 -.79> 2.18 2 .85< 2.18 -1.65 3.08 -.53 0 .0000 5 i5 |
| 1 -.79> 2.18 3 .89> 2.19 -1.68 3.09 -.54 0 .0000 5 i5 |
| 1 -.79> 2.18 4 -.08> 2.18 -.72 3.08 -.23 0 .0000 5 i5 |
| 1 -.79> 2.18 5 .54< 2.18 -1.34 3.08 -.43 0 .0000 5 i5 |
| 2 .85< 2.18 3 .89> 2.19 -.03 3.09 -.01 0 .0000 5 i5 |
| 2 .85< 2.18 4 -.08> 2.18 .93 3.08 .30 0 .0000 5 i5 |
| 2 .85< 2.18 5 .54< 2.18 .31 3.08 .10 0 .0000 5 i5 |
| 3 .89> 2.19 4 -.08> 2.18 .96 3.09 .31 0 .0000 5 i5 |
| 3 .89> 2.19 5 .54< 2.18 .34 3.09 .11 0 .0000 5 i5 |
| 4 -.08> 2.18 5 .54< 2.18 -.62 3.08 -.20 0 .0000 5 i5 |
| 1 .88< 2.17 2 -.84> 2.18 1.72 3.08 .56 0 .0000 31 i31 |
| 1 .88< 2.17 3 .86> 2.20 .02 3.09 .01 0 .0000 31 i31 |
| 1 .88< 2.17 4 -.09> 2.19 .97 3.09 .31 0 .0000 31 i31 |
| 1 .88< 2.17 5 -1.15> 2.18 2.03 3.08 .66 0 .0000 31 i31 |
| 2 -.84> 2.18 3 .86> 2.20 -1.70 3.10 -.55 0 .0000 31 i31 |
| 2 -.84> 2.18 4 -.09> 2.19 -.75 3.09 -.24 0 .0000 31 i31 |
| 2 -.84> 2.18 5 -1.15> 2.18 .31 3.09 .10 0 .0000 31 i31 |
| 3 .86> 2.20 4 -.09> 2.19 .95 3.10 .31 0 .0000 31 i31 |
| 3 .86> 2.20 5 -1.15> 2.18 2.01 3.10 .65 0 .0000 31 i31 |
| 4 -.09> 2.19 5 -1.15> 2.18 1.06 3.09 .34 0 .0000 31 i31 |
| 1 -.79> 2.18 2 .86< 2.18 -1.65 3.08 -.53 0 .0000 32 i32 |
| 1 -.79> 2.18 3 .90> 2.18 -1.69 3.08 -.55 0 .0000 32 i32 |
| 1 -.79> 2.18 4 1.61< 2.18 -2.40 3.08 -.78 0 .0000 32 i32 |
| 2 .86< 2.18 3 .90> 2.18 -.04 3.09 -.01 0 .0000 32 i32 |
| 2 .86< 2.18 4 1.61< 2.18 -.75 3.08 -.24 0 .0000 32 i32 |
| 3 .90> 2.18 4 1.61< 2.18 -.72 3.08 -.23 0 .0000 32 i32 |
| 1 .89< 2.18 2 -.83> 2.18 1.72 3.08 .56 0 .0000 35 i35 |
| 1 .89< 2.18 3 .90> 2.18 .00 3.09 .00 0 .0000 35 i35 |
| 1 .89< 2.18 4 1.61< 2.18 -.72 3.08 -.23 0 .0000 35 i35 |
| 2 -.83> 2.18 3 .90> 2.18 -1.72 3.08 -.56 0 .0000 35 i35 |
| 2 -.83> 2.18 4 1.61< 2.18 -2.44 3.08 -.79 0 .0000 35 i35 |
| 3 .90> 2.18 4 1.61< 2.18 -.72 3.08 -.23 0 .0000 35 i35 |

TABLE 30.2 Combination 36 set test item ZOU785ws.txt Sep 9 13:44 2013

DIF class specification is: DIF=6W2

| 2 1 .00 .67 -.69 -.67 .85< 2.18 1 i1 |
| 3 1 1.00 .92 -.69 .08 .87> 2.20 1 i1 |
| 4 1 1.00 .81 -.69 .19 -.08> 2.19 1 i1 |
| 5 1 1.00 .60 -.69 .40 -1.14> 2.18 1 i1 |
| 2 1 1.00 .37 .55 .63 -.83> 2.18 2 i2 |
| 3 1 .00 .77 .55 -.77 2.58< 2.17 2 i2 |
| 4 1 1.00 .56 .55 .44 -.07> 2.18 2 i2 |
| 5 1 .00 .30 .55 -.30 .55< 2.18 2 i2 |
| 1 1 1.00 .52 -.02 .48 -.79> 2.18 5 i5 |
| 2 1 .00 .51 -.02 -.51 .85< 2.18 5 i5 |
| 3 1 1.00 .85 -.02 .15 .89> 2.19 5 i5 |
| 4 1 1.00 .69 -.02 .31 -.08> 2.18 5 i5 |
| 5 1 .00 .43 -.02 -.43 .54< 2.18 5 i5 |
| 1 1 .00 .75 -1.08 -.75 .88< 2.17 31 i31 |
| 2 1 1.00 .75 -1.08 .25 -.84> 2.18 31 i31 |
| 3 1 1.00 .94 -1.08 .06 .86> 2.20 31 i31 |
| 4 1 1.00 .86 -1.08 .14 -.09> 2.19 31 i31 |
| 5 1 1.00 .69 -1.08 .31 -1.15> 2.18 31 i31 |
| 1 1 1.00 .36 .62 .64 -.79> 2.18 32 i32 |
| 2 1 .00 .35 .62 -.35 .86< 2.18 32 i32 |
| 3 1 1.00 .76 .62 .24 .90> 2.18 32 i32 |
| 4 1 .00 .54 .62 -.54 1.61< 2.18 32 i32 |
| 1 1 .00 .36 .62 -.36 .89< 2.18 35 i35 |
| 2 1 1.00 .35 .62 .65 -.83> 2.18 35 i35 |
| 3 1 1.00 .76 .62 .24 .90> 2.18 35 i35 |
| 4 1 .00 .54 .62 -.54 1.61< 2.18 35 i35 |

is it correct now?

now i already can see the item number that is item 1,2,5,31,32,35.

what happen to item 3,4, 33 and 34?

Mike.Linacre: Thank you for sharing, iyliajamil.

1. Please look at Table 14.1 - are all the items listed correctly?
2. Please look at Table 18.1 - are all the persons listed correctly?
3. Look at the the person labels. In which columns are the set numbers? These should go from 01, 02, 03, 04, .... , 34, 35, 36
4. If the set numbers start in column 5 of the person labels, then
DIF = 5W2
5. Table 30.

iyliajamil: thanks to you...
1.i have check table 14 it seems that the item labels is correct.
2. but when i check table 18, i found out that the person labels are wrong.

when i analyze those set as individual both table 14 and 18 show correct figure
for set 1: item lable C1-C5, U0106-U0130, C6-C10 (total 35 item code C for common items and code U is for unique items) person labels start from 0101 to 0132

but when i run my mforms= control file i can not get person labels in table 18 same as in table 18 for individual set analysis

here is my mforms control file again, may be u can detact the mistake that i made:

TITLE="Combination 36 set test item"
DIF=6W2 ;the set number
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
C5-6="01" ;set 01
I1-5=7 ;item 1-5 in colums 7-11
I6-30=12 ;item 6-30 unique set1 in columns 12-36
I31-35=37 ;item 31-35 in columns 37-41
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
C5-6="02" ;set 02
I1-5=37 ;item 1-5 in colums 37-41
I6-30=42 ;item 6-30 unique set2 in columns 42-66
I31-35=67 ;item 31-35 in columns 67-71
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
C5-6="35" ;set 35
I1-5=1027 ;item 1-5 in colums 1027-1031
I6-30=1032 ;item 6-30 unique set35 in columns 1032-1056
I31-35=1057 ;item 31-35 in columns 1057-1061
L=1 ;one line per person
P1-4=1 ;person lebel in columns 1-4
C5-6="36" ;set 36
I1-5=1057 ;item 1-5 in colums 1057-1061
I6-30=1062 ;item 6-30 unique set36 in columns 1062-1086
I31-35=1087 ;item 31-35 in columns 7-11
U3630 ;item identification here

i labeled common items C1-C180

example control file for individual set:





really need your help. may be u can give me material to learnt or example that similar to my analysis.so that i can identify where i did wrong.

thanks you.

Mike.Linacre: iyliajamil, you say: "when i run my mforms= control file i can not get person labels in table 18 same as in table 18 for individual set analysis"

Reply: for the DIF analysis, we need the MFORMS table 18 person labels to be:
"Table 18 for individual set analysis" (4 columns) + set number (2 columns)

DIF = 5W2 ; the set number in the person label

iyliajamil: thanks to you.

i get it already.

the labels in table 14 and 18 i get it correct for item labels and person labels

but not when i open edit mforms= file the data are not arrange accordingly
suppose it come out for exampel:

020102 ..................10011....................11101
350135...................................................................... ....11111.............00101

but it did not

is something wrong with the arrangement?
only data for set 1 come out correctly
the others only person labels appear

Mike.Linacre: iyliajamil, the problem may be that I have misunderstood your data layout.

Please post a data record from set1.txt and a data record from set2.txt. They must look exactly like they look in the data files.

For example,



iyliajamil: thank you...
i have figure out where is the mistake in my mforms control file and i have corrected it.
i will continue with the analysis.
now i am correcting the arrangement of data in my mforms control file using your tutorial on concurrent common item equating.

will get back to you later.

thank you again for your help.

Mike.Linacre: Excellent, iyliajamil.

iyliajamil: hi again,
mr linacre.

i have success linking 36 sets of item contains 1080 mcq item in total (180 common items and others unique items) using common items concurrent equating methods with mforms function. this bank item is for CAT usage. so...the 'measure' value will be use as the difficulty level of each item. usually those item that fit ( MNSQ infit and outfit value 0.7 to 1.3 for MSQ type) will be administered in the item bank.am i correct?

you have suggested using DIF function across set to identify the common items that behave differently across sets.those common items should be separated also using mforms function. am i correct?

there is a few question for you...

1. I have read through winsteps manual on DIF function and find it complicated to understand. So for the purpose of my study that is not mainly on DIF analysis , only use DIF to identify common items that behave differently. is it enough if i look at common rule of thumb that say: consider the item has DIF if the p value <0.05 and t value >2 to make conclusion that those common items behave differently between sets compared?

2.Why we need to separate the item that have DIF into different group?let say item DIF identify in common item C4 compared between set 1 and set 2. if we separate it, does it means that C4 have different measure value for person answering set 1 and set 2?

3. How to do the separation procedure? i have read through but find it difficult to understand, may be because the example is way to different from my data type. can you explain it to me in more simple way to understand?

4. Let say, i manage to do the separation procedure, can i still use the common item in my item bank? does it have different measure value?if it has different measure value which one should i use the measure value from focal group or from reference group?

5. I came across the fact that says DIF researchers suggested group size of at least 200 for DIF study. my group size minimum only 30. so is it ok using DIF for my case?

that is all for now, waiting for your explanations. thank you :)

Mike.Linacre: iyliajamil,

Do not worry about the DIF analysis. Your group sizes are only 30, so that the DIF findings are very weak.

Your fit criteria are strict (0.7 to 1.3 for MSQ type) and are much more demanding than the DIF criteria. The fit criteria are enough.

iyliajamil: ok...

so...i do not need to do DIFt analysis for my data.

just use the result from mform containing all the combined data from set 1 to 36. look into fit criteria only to make decision on keeping the item or remove it from the item bank.

am i correct?

thank you for you explanation. it helps me a lot. :)

Mike.Linacre: iyliajamil,

DIF analysis is part of quality control. We usually like to reassure ourselves that the common items are maintaining their relative difficulty across sets. But we don't have to do this. It is like looking at the 4 tires of your car to verify that they are all about equally inflated.

iyliajamil: thank you for all the explanation.

iyliajamil: hi mr. linacre.

Let say we want to administer 3 test with 10 item in a set of test (dichotomous type). How many common item needed if we want to link those test using common item equating method?

Mike.Linacre: iyliajamil, the minimum number is 3 items, but 5 items would be safer.

223. distribution free

uve January 31st, 2012, 2:18am: Mike,
If you measure my height with a tape measure, the result is not dependent on other people whom you are also measuring at the same time. My height will be the same (error taken into consideration) regardless of anyone else being measured with me. Yet this is not true of person ability or item difficulty. Remove a person or item from the analysis and the results could be quite different. So for example, an item's true difficulty doesn't seem to be intrinsic because it could vary significantly depending on the other items incorporated into the analysis. This is quite contrary to height, and the reason I mention height is that we use it so frequently in our communications to illustrate the Rasch model. How then can we say that the measures we develop are truly distribution free when in fact they seem to be highly dependent upon each other?

Using the Rasch model, we can predict the probability of a correct score given the item difficulty and ability level of the person. I understand how among other things, this can help us with fit. However, if items change based on other items in the assessment, the fit results will also change. Though there is great logic in the math used to develop the Rasch models, there seems to be a great deal of fate involved with the ultimate outcome, something not as prevalent in standard test statistics.

I'm probably approaching all this from an invalid comparison, but I would greatly appreciate your comments.

Mike.Linacre: Uve, you have identified several problems here:
1. The problem of defining the local origin (zero point).
2. The problem of the sample of objects used to construct the measurement system.
3. The situation in which the measurement takes place

It is instructive to review the history of the measurement of temperature.

The very earliest thermometers looked like modern-day open-top test tubes and strongly exhibited problem 3 - until their manufacturers realized that air pressure was influencing their equipment. Then they sealed the top of the glass tube to eliminate air pressure as a variable and problem 3. was largely solved.

Galileo's thermometer - http://en.wikipedia.org/wiki/Galileo_thermometer - has problems 1. and 2. in its original version. See also my comment about the Galileo thermometer at https://www.rasch.org/rmt/rmt144p.htm

The problem of the local origin (zero point) was initially solved for thermometers by choosing definitive objects, e.g., the freezing point of water at a standard air pressure. The problem of the sample of objects was solved by choosing another definitive object, e.g., the boiling point of water at a standard air pressure.

We are heading in this direction with Rasch measurement. The Lexile system is a leading example - https://www.rasch.org/rmt/rmt1236.htm - but we have a long way to go.

In situations where this is true, "remove a person or item from the analysis and the results could be quite different", then the structure of the data is extremely fragile and findings are highly likely to be influenced by accidents in the data. In general, if we know the measures estimated from a dataset, we can predict the overall effect of removing one person or one item from that dataset. provided that person or item has reasonable fit to the Rasch model.

In the more general situation of change of fit, e.g., using the same test in high-stakes (high discrimination) and low-stakes (low discrimination) situations, then the "length of the logit" changes - https://www.rasch.org/rmt/rmt32b.htm

But we are all in agreement that our aim is to produce truly general measures: https://www.rasch.org/rmt/rmt83e.htm - meanwhile, we do the best that mathematics and statistics permit :-)

uve: Thanks Mike. These are great resources and you've given me much to think about.

Emil_Lundell: Hello, Dr. Lineacre

You did a more comprehensible review about the measurement history of temperature that explains the points Duncan left implicit in his book (1984). Have you published your example, using better references, anywhere?

Best regards.

Mike.Linacre: Emil, my comments about thermometers are in Rasch Measurement Transactions, but there almost no references.

In discussing the history of temperature in a Rasch context, Bruce Chopping blazed the path. Bruce Choppin "Lessons for Psychometrics from Thermometry", Evaluation in Education, 1985, 9(1), 9-12, but that paper does not have any references.

Many authors of Rasch papers mention thermometers in order to make abstract measurement concepts more concrete, but they do not have references related to thermometers.

Emil_Lundell: Thanks,

I will quote this the next time I write a paper about rasch.


P.s. The important thing for the reader is that Wikipedia isn't mentioned and that the reference doesn't go to a internet forum.

225. sample size for FACETS of a rating scale

GiantCorn October 19th, 2012, 7:54am: Hi everyone, I'm new to this forum and hope to learn a lot.

We are currently developing a rating scale to assess EFL students in a short speaking test. There will be 3 constructs on the scale (lexico-grammar, fluency, interaction skills) and the scale runs from 1 - 5 and half points can be given as well.

In order to evaluate this scale (in terms of how well it conforms to expectations about its use. i.e. do raters use all of the scale? consistently? is each construct robust etc?) can anyone provide a good guide on how to conduct a scale analysis using FACETS?

Also we are in a fairly small scale but busy department. Would it be possible to run such an analysis on the rating scale using only 3 raters (out of 14 we have) all rating say 10 video performances? each video has a pair of candidates. would this be a large enough sample for an initial trial of the rating scale? could we get any useful information on the scales performance to make changes and improvements? or would we need far more raters and /or videos?

Many thanks for your help in advance!


Mike.Linacre: Welcome, GC.

Here's a start ....

" the scale runs from 1 - 5 and half points can be given as well" - so your rating scale has 9 categories. The proposed dataset is 3 raters x 20 candidates = 60 observations for each construct = 7 observations for each category (on average). This is enough for a pilot run to verify scoring protocols, data collection procedures and other operational details, and also enough to investigate the overall psychometric functioning of the instrument. Precise operational definitions of the 9 categories may be the most challenging aspect.

"do raters use all of the scale? consistently? is each construct robust etc?"
Only 3 raters is insecure. (Notice the problems in Guilford's dataset because he only had 3 raters.) So 5 raters is a minimum. Also the choice of candidates on the videotapes is crucial. They must cover the range of the 9 categories for each construct, and also exhibit other behaviors representative of the spectrum of candidate behavior likely to be encountered.

GiantCorn: Mike,

thanks very much for your advice. I shall proceed as advised!


GiantCorn: Hi Mike,

Ok i finally got round to building the data table and my spec file (my first one in a very long time) but something has gone wrong - for some reason in the output it says that some of the half-point scores: -

"are too big or not a positive integer, treated as missing"

What's going wrong here?

Also on my Ruler table I would like each of the rating scale constructs (Fluency, Lexicogrmr, Interaction) to display so i can compare them but when i play about with the vertical= option nothing happens. How should i get each rating scale to display in the table?

I attach my spec file, data file for you. I'm sure I'm probably missing something very easy/obvious here......

Thank you for any help you may be able to offer.


Mike.Linacre: GC, good to see you are progressing.

1. Your Excel file:
A data line looks like this:
1 1 1 1-3a 1 2 1.5

Facets only accepts integer ratings, so please multiply all your ratings by 2, and weight 0.5
1 1 1 1-3a 2 4 3

?,?,?,?,R10K,0.5 ; highest doubled rating is 10 (K means Keep unobserved intermediate categories). Weight the ratings by 0.5

2. Rating scale constructs. Do you want each element of facet 4 to have its own rating scale? Then ...
?,?,?,#,R10K,0.5 ; # means "each element of this facet has its own rating scale"

GiantCorn: Mike,

1) Aha! thanks for a poke in the right direction, I think I've got it, Because im using actually an 11 point scale (if using half points from 0-5) I need to reconvert the raw scores to reflect the scale right?

So rater 1 giving a 2.5 for fluency should be 6 on the data sheet. Is this correct?

2) Thus some of my spec file commands need adjustment also. For example the model statement you gave above would be ?,?,?,#,R11

3) I remember reading a paper about a similar rubric where the author argued that the use of PCM would be better than RSM as it was assumed that each rating scale used different criteria that measure different constructs along a common 9 point scale. Would you agree that it would be better for me to use PCM over RSM in this case? It is chiefly the scales I am interested in.

Mike.Linacre: GC:

1) 0-5 x 2 = 0-10 -> R10

2.5 x 2 = 5

2) ?,?,?,#,R10

3) PCM or RSM? This is discussed in another thread. Please look there ....

GiantCorn: Dear Mike and others,

Haven't had a chance to write until now. But i have finally managed to run the data on this speaking test rubric we are experimenting with and thought I'd update you while also checking my thinking on a few points. I have attached the output.

Would you agree with the following brief comments regarding this piloting: -

1) seems like data methods and my setup of the spec file was ok, after your help, thnx mike!

2) I probably need more candidates for the middle section of the scales

3) It appears that, from the category probability curves, there is over categorisation for all 3 constructs but this is probably due to the small sample and fact i couldnt get videos that specifically fall into each category exactly. This may be an issue but the severity is too difficult to say at this stage. need more video performances.

4) the scale steps for each construct are not in-line suggesting (quite naturally I would assume) that these constructs are learned/acquired at different rates. Might there be argument for each to be weighted differently in a students overall score?

5) Is my following thought pattern correct regarding the nature of the 3 constructs (fluency, lexicogrmr and interaction): -

outfit msq and ICCs for each construct suggest they adhere to the rasch concept and that the categories and scale steps seem logical and sequential. Except, arguably, at the lower end of the ability scales but i would argue this noise is expected at this end due to the nature of beginning/early language acquisition?)

Model, Fixed (all same) chi-square significance value of .06 (Fig. 1) suggests that there is a 6% probability that all the constructs measures are the same. So does this mean that each construct probably does add some unique element to the overall measure of "English Speaking skill"? Or have i misunderstood something?

Thanks for your time and patience Mike!


Mike.Linacre: GC,

Overall the specifications and analysis look good, but did you notice these?

Warning (6)! There may be 3 disjoint subsets

Table 7.3.1 Prompt Measurement Report (arranged by mN).
| Total Total Obsvd Fair-M| Model | Infit Outfit |Estim.| Correlation | |
| Score Count Average Avrage|Measure S.E. | MnSq ZStd MnSq ZStd|Discrm| PtMea PtExp | N Prompt |
| 147 30 4.9 4.92| .41 .17 | .87 -.4 .85 -.5| 1.18 | .94 .93 | 3 Food | in subset: 3
| 89.5 15 6.0 5.60| -.18 .23 | .92 -.1 .90 -.1| 1.13 | .92 .93 | 2 Last Summer | in subset: 2
| 187 30 6.2 5.67| -.23 .18 | .96 .0 .91 -.1| .93 | .94 .94 | 1 Hobbies | in subset: 1

These "subsets" messages are warning us that there are three separate measurement systems. "Food", "Last Summer" and "Hobbies".
We cannot compare the measures of "Food" and "Last Summer", nor can we compare the measures of "Shinha" and "Scalcude", because they are in different subsets.
So we must choose:
Either (1) the prompts are equally difficult. If so, anchor them at zero.
Or (2) the groups of students who responded to each prompt are equally able. If so, group-anchor them at zero.

Rating scales: the samples are small and there are many categories, so the rating-scale structures are accidental. The probability curves in Tables 8.1, 8.2, 8.3 look similar enough that we can model all three constructs to share the same 0-10 rating scale, instead of 3 different rating scales. This will produce more robust estimates. See www.rasch.org/rmt/rmt143k.htm

GiantCorn: Mike,
Sorry i have been away from the office and have only just seen this reply. Thank you for your help so far it has been most useful.
Yes i had seen those subset warnings and for the purposes of this analysis will follow your advice (1) assume the prompts are equal and anchor them. About the rating scales - yes i see. I will need many more data points for each category of each construct (minimum 10 right?) before being able to say anything about this aspect of the analysis.

I will continue and try to process a much larger data set this coming semester.......

Thank you once agian for your help thus far!

Mike.Linacre: Good, GC.

You write: "before being able to say anything about this aspect of the analysis."

You will be able to say a lot about your rating scale, but what you say will be tentative, because your findings may be highly influenced by accidents in the data. For instance, the Guilford example in Facets Help has only a few observations for many of its categories, but we can definitely say that the rating-scale is not functioning as its authors intended.

242. Rater Bias without affecting measurement

Tokyo_Jim December 2nd, 2012, 6:29am: Hi Mike,

I'm working on an analysis where 3-4 teachers rate student performances, and the students then add their own self-ratings after watching their performance on video. We are interested in severity, accuracy, and consistency of the self-raters and specifically want to examine the distribution of self bias scores. I have designated all 'self' ratings as a single element. We can see that "self" is generally severe, has poor fit, and produces a larger number of unexpected observations.

What we'd like to do is see how 'biased' the self-raters are in comparison to the expert raters, without the self ratings affecting the measurement estimation. Not sure how to set this up. I can model the "Self" element as missing, but in that case, it can't be examined for bias interactions. If I anchor "self" to zero while leaving the "expert" elements free, severity ratings are calibrated to "self = zero." but it seems the self-rating is still influencing measurement, as the rank order of student measures is different from the order produced when "self" is modeled as missing.

Is there a way to set this up so that we can produce bias charts and values, without the non-expert rating affecting the student calibrations?

Mike.Linacre: Thank you for your question, Jim.

There are two ways to do this.

Set the "Self" to missing. Analyze the data. Produce an Anchor file=anc.txt
Edit the Anchor file to add the "Self" element. Analyze the anchor file. The "Self" will be analyzed without changing the anchored measures.


2) Weight the "Self" observations very small, such as R0.0001, or with a Models= specification with very small weight: 0.0001

Tokyo_Jim: Thank you for the very quick answer Mike. Both options are clear. The second one looks easiest.

Tokyo_Jim: Hello Mike,

A follow up to this previous question. We would like to use the Self-evaluation bias size as a variable in other analyses. I created a rater by student bias report and found what I was looking for in Table 13 (Excel, Worksheet), Students relative to overall measure. Only problem is the bias analysis is limited to 250 cases, and we have N = 390. I do see the same information in Table 13.5.1 in Notepad, but can't export it.

Is there an easy way to get or create this file for all cases is a way that can conveniently converted to an SPSS variable?


Mike.Linacre: Jim,

1. Copy the rows you want from Table 13.5.1 in NotePad into an Excel worksheet
2. In Excel, "Data", "Text to Columns"
This should give you the numbers you need in columns, suitable for importing into SPSS

Tokyo_Jim: Thanks Mike. It worked

265. Longitudinal Growth Measures

uve December 19th, 2012, 11:48pm: Mike,
In our district, secondary students typically take 4 quarterly exams in math for the year. These usually vary in length between 30 and 50 items. Recently I rectangular copied the responses to calibrate items using all four exams in a concurrent equating process. So I now have ability measures based on all 170 items. I'd say there is probably 80 to 90 percent respondent overlap, so the four different exams are well linked. The problem is that each covers mostly different material. There are no identical Q2 items on Q1, etc. What I'd like to do is see if I can determine if a student's measure has changed significantly across the exams. Now that the items are calibrated together, here would be my next steps:

1) Calibrate ability levels for Q1 using only the items found on Q1 anchored to the equated 170-item version.

2) Repeat for Q2-Q4.

3) To determine significant year change: Q4 score - Q1 score/SQRT(S.E.^2 Q4 + S.E.^2 Q1). Not sure about degrees of freedom, maybe total items answered between the two tests minus one?

Would this work?

Mike.Linacre: Uve, is this the situation?
a) There are 4 separate tests with no common items: Q1, Q2, Q3, Q4 (30 to 50 items each)
b) Students have responded to one or more of the tests.
c) Student abilities have changed between tests.

If this is correct, then the tests seesions do not have common items nor common persons, and so cannot be co-calibrated.

Since both the students and the items have changed between tests, then "virtual equating" (equating on item content) is probably the most successful equating method: https://www.rasch.org/rmt/rmt193a.htm

After equating the tests, then the 4 measures for each student can be obtained. Then your 3) will apply, with a little extra due to the uncertainty of the equating.

For significance t-statistic, use Welch-Satterthwaite https://www.winsteps.com/winman/index.htm?t-statistics.htm

uve: Mike,

a) yes
b) most students have responded to all 4 four
c) The mean and std dev ability measures are different: .38/1.04, -.07/.84, -.12/.95, and -.23/.86 (mean and std dev respectively)

If one student takes exams 1-3 and another takes 2-4, then both have taken 2 and 3. Wouldn't this mean that 1 and 4 are now on the same scale? I thought this represented a common person equating procedure.

I've attached an example of the response file. The dashes refer to skipped items and the X's refer to items from an exam the respondent did not take. The dashes are scored incorrect and the X's are not counted against the respondent.

Mike.Linacre: Uve, are we expecting the students to change ability between Qs? If so, the students have become "new" students. Common-person equating requires that students maintain their same performance levels.

uve: :-/

I want to be able to compare change over time between the Q's. Since these exams were initially calibrated separately, I didn't feel that comparing scores was viable. I wanted to place the items onto one scale if possible even though there were no common items. I had hoped the common persons would faciliate this, but it seems not.

If we can only use common person equating if we expect no change between Q's then what would be the point of linking them in the first place?

Mike.Linacre: Uve, in the Q design, we usually want to track the change of ability of each student across time.

"Virtual equating" is a technique to equate the Q tests.

Alternatively, a linking test is constructed using a few items from each of the Q tests. This test is administered to a relevant sample of students who have not seen the Q tests.

uve: Mike,

Sorry for the late response. As you probably know well by now, it often takes me quite some time to digest all your responses, sometimes months! :)

I suppose I could use the method you mentioned, but I guess I'm trying to work with data I already have. Remember, we give over 100 exams each year. Teachers feel we test too much as it is, so administering a linking test would be out of the question. I would need curriculum staff to review the standards to provide me the info I need to do virtual equating since I am not a content expert. Because of severe budget cuts, time and resources are constrained beyond reason. They would not be able to assist me.

So that brings me back to my original question, and perhaps the answer is still no. But I was hoping that since at least 70% of a given grade level takes all the benchmark exams (probably closer to 80-90%) in the series throughout the year, there is enough overlap that I could calibrate all items together. So if 500 kids all take Q1-Q4 out of 700 kids total, then wouldn't this be the common person group I would need to use common person equating to link the items? Couldn't I take each person's response to all Q's, string them together in Winsteps and calibrate all the items at once?

Mike.Linacre: Uve, if there are 4 tests with no common items, and the students change ability between test administrations, it appears that there is no direct method of comparing the tests. But perhaps someone else knows of one. The Rasch Listserv is a place to ask: https://mailinglist.acer.edu.au/mailman/listinfo/rasch

Antony: Hi, Mike. I am running a similar test. The same cohort of students completed 4 subscales from Questionnaire A. A year later, the same subscales were administered, with item order randomized. I suppose students' responses would change due to changes of some attributes (say self-concept). Is this sound? Would the randomization matters?

Mike.Linacre: Antony, reordering the items should not matter (provided that the Questionnaire continues to make sense).
Analyze last year's data and this year's data separately, then cross-plot the two sets of person measures and the cross-plot the two sets of item difficulties. The plots will tell the story ....

Antony: Thank you Mike.
I have read articles studying cohort effect by Rasch-scaling Ss' responses. Results were listed clearly, but articles skip the technical know-how. Would you please refer me to other posts or useful websites?

Mike.Linacre: Antony, there is usually something useful on www.rasch.org or please Google for the technical know-how.

280. Odd dimensionality pattern

uve October 16th, 2012, 5:20am: Mike,

We have implemented a new elementary English adoption and have begun our first set of assessments. I'm finding identical dimensionality distributions like the one attached. I'm probably not recalling accurately, but for some reason I think I remember a similar post with residual patterns loading like the one below. Any guesses as to what may be causing this?

Mike.Linacre: Yes, Uve. This pattern is distinctive. The plot makes it look stronger than it really is (eigenvalue 2.8). We see this when there is a gradual change in behavior as the items become more difficult, or similarly a gradual change in item content. Both can have the effect of producing a gradual change in item discrimination.

Suggestion: in Winsteps Table 23.3, are the loadings roughly correlated with the fit statistics?

uve: Mike,

Below are the correlation results.

loading measure Infit Outfit
loading 1
measure -0.856533849 1
Infit -0.85052961 0.781652331 1
Outfit -0.866944739 0.868780613 0.9363955 1

Mike.Linacre: Yes, Uve, the high correlation between Infit and Outfit is expected. All the other correlations are high and definitely not expected. Measures (item difficulties) are negatively correlated with Outfit. So it seems that the most difficult items are also the most discriminating. The "dimension" on the plot appears to be "item discrimination".

uve: Mike,

My apologies for the bad correlation matrix, but I see that the Measures are positively correlated to outfit, not negatively as you stated. Maybe I'm reading it wrong. Anyway, if I'm right, would that change your hypothesis?

Mike.Linacre: Oops, sorry Uve. But the story remains the same ....

uve: Mike,

Here's another situation that's very similar except it appears there are two separate elements to the 2nd dimension of item discrimination. Any suggestions as to what might cause something like this?

Mike.Linacre: Uve, the person-ability correlation between clusters 1 and 3 is 0.0. This certainly looks like two dimensions (at least for the operational range of the test). Does the item content of the two clusters of items support this?

uve: Yes, it appears that all but one of the 8 items in cluster 1 are testing the same domain, while all but 4 of the of the 11 items in cluster 3 are the same domain.

Mike.Linacre: Great! It is always good to see a reasonable explanation for the numbers, Uve. :-)

uve: Mike,

I am revisiting this post because I am attempting to define the 1st component in the second example. I have three additional questions:

1) In reference to the first example, you suggested I correlate the loadings with the fit statistics. After doing this and providing the matrix, you focused on the measure/outfit correlation, not the loading/outfit correlation. Why?

2) How does the measure/outfit correlation suggest discrimination?

3) In reference to the second example, you stated that the correlation between the two items clusters suggested two dimensions. However, it was my understanding that each PCA component is measuring a single orthogonal construct. How then can the 1st component be two dimensions?

Thanks again as always

Mike.Linacre: OK, Uve ... this is going back a bit ....

1) "you focused on the measure/outfit correlation, not the loading/outfit correlation. Why?"
Answer: The measure/outfit correlation is a correlation between two well-understood aspects of the data. The loadings are correlations with hypothetical latent variables. So it is usually easier to start with the known (measures, outfits) before heading into the unknown (loadings).

2) High outfits (or infits) usually imply low discrimination. Low outfits (or infits) imply high discrimination. So we correlate this understanding with the measures.

3) A single orthogonal construct is often interpreted as two dimensions. For instance, "practical items" vs. "theory items" can be a single PCA construct, but is usually explained as a "practical dimension" and a "theory dimension". We discover that they are two different "dimensions" because they form a contrast on a PCA plot.

uve: Mike,

Two more questions then in relation to your answer to #3:

1) The first example has a single downward diagonal clustering of residuals the construct of which you suggested might be gradual discrimination. I assume then that this is the 2nd construct the two dimensions of which I would need to interpret. These two dimensions bleed into one another gradually which results in the single downward grouping we see. Would these last two comments be fair statements?

2) The second example has two sets of downward diagonal clusterings of residuals also suggesting possible discrimination though I am baffled by the overlap and separation of the two clusters. A single construct appears to be elusive but its two dimensions are much easier to interpret because they load predominantly on just two math domains.

So it appears I have two opposite sitautions here: the first 2nd construct appears rather easy to interpret but its two dimensions are elusive, while the second 2nd construct is rather elusive but its two dimensional elements are much easier to interpret.

Mike.Linacre: Yes, Uve. There is not a close alignment between the mathematical underpinnings (commonalities shared by the correlations of the residuals of the items) and the conceptual presentation (how we explain those commonalities). In fact, a problem in factor analysis is that a factor may have no conceptual meaning, but be merely a mathematical accident. See "too many factors" - https://www.rasch.org/rmt/rmt81p.htm

uve: Mike,

Very interesting and very illuminating. Thanks again for your valuable insights and help.

284. How do you think about these?

dachengruoque December 31st, 2012, 2:07pm: "It says;
The Winsteps "person reliability" is equivalent to the traditional
"test" reliability.

The Winsteps "item reliability" has no traditional equivalent....[item
reliability is] "true item variance / observed item variance"."\

I quoted from language testing community discussion list. How do you think about them, Professor Linacre? Thanks a lot for your insightful and precinct explanation on Rasch of 2012. Happy new year to you and all Rasch guys!

Mike.Linacre: Those are correct, Dachengruoque.

We go back to Charles Spearman (1910): Reliability = True variance / Observed variance

"Test reliability" should really be reported as "Reliability of the test for this sample", but it is often reported as though it is the "Reliability of the test" (for every sample).

dachengruoque: Therefore, item reliability in the classical test theory is sample based and could vary from one sample to another while as Rasch-unique reliability test reliability is sample-free or constant for every sample. Could I understand like that? Thanks a lot for your prompt feedback and citation of literature.

Mike.Linacre: Dachengruoque, reliabilities (Classical and Rasch) are not sample-free.

The person (test) reliability depends on the sample-distribution (but not the sample size) of the persons
The item reliability depends on the sample-size of the persons, and somewhat on the sample distribution of the persons.

dachengruoque: Thanks a lot, Dr Linacre!

285. Opposite of logic/intentions

drmattbarney December 29th, 2012, 12:08pm: Thank you for your analysis, Matt.

Looking at Table 7.5.1, Facet 5 is oriented positively, so that higher score -> higher measure
"global" has an observed average of 5.1 and a measure of -2.03
"small changes" has an observed average of 6.4 and a measure of 2.19

In Table 8.1, "5" is "med" and "6" is "agree"

What are we expecting to see here?

drmattbarney: thanks for your fast reply, as always, Mike.

To give more context, the scale is a persuasion reputation scale - consistent with Social Psychologist Robert Cialdini's assertion that the grandmasters of influence have a reputation for successfully persuading highly difficult situations.

If you look at the item content in Table 7.5.1, the qualitative meaning of the easiest item, logit -2.03 relates to the board-of-directors level - the highest possible level in an organization, so this should be exceedingly difficult to endorse, not very easy.

Similarly, the item content of the item with a logit of 2.19 involves very easy persuasion tasks of persuading small changes. Taken together, it looks to me like the raw data are recoded, as the items fit the Rasch model but they are qualitatively the opposite of what should be expected (Board persuasion > small changes).

I'm not too worried about Table 8.1....it's reasonably okay. but Table 7.5.1's qualitative item content looks absolutely backwards.

Hopefully that clarifies

Thanks again


Mike.Linacre: Yes, Matt, it looks like there are two possibilities:

1) data recording somewhere.

2) misunderstanding of the items (paralleling my misunderstanding).
In his Questionnaire class, Ben Wright remarked that respondents tend not to read the whole prompt, but rather pick out a word or two and respond to that. A typical symptom of this is respondents failing to notice the word "not" in a prompt, and so responding to it as a positive statement.

drmattbarney: thanks, Mike and Happy New Year

286. understanding results

JoeM December 30th, 2012, 11:47pm: silly question: I always thought that a person's ability was based on which answers they got correct, not necessarily how many questions they got correct (a person who got 3 hard questions correct would have a higher ability than a person who answered 3 easy questions correct). but the simulations I have created are not showing this... all persons who got 15 out of 20 have the same ability, regardless of the difficulty of the items they answered correctly...

if this is the case, how would I weigh the answers so they affect the instrument participants based on which answers they got correct verses how many they correctly answered...

Mike.Linacre: JoeM, it sounds like you want "pattern scoring".

Abilities based on solely raw scores (whatever the response pattern) take the position that "the credit for a correct answer = the debit for an incorrect answer". Of course, this is not always suitable. For instance, on a practical driving test, the debit for an incorrect action is far greater than the credit for a correct action. In other situations, it may be surmised that the extra ability evidenced by success on a difficult item outweighs mistakes on items that are "too easy".

In Winsteps, we can trim "unexpected successes on very difficult items (for the person)" (= lucky guesses") and/or "unexpected failures on very easy items (for the person)" (=careless mistakes), by using the CUTLO= and CUTHI= commands. www.winsteps.com/winman/cuthi.htm

287. interaction report

Li_Jiuliang December 21st, 2012, 4:09am: Hi professor Linacre,
I have some problems with interpreting my FACETS interaction output which is like this (please see attached).
There is some confusion as to the target contrast in Table I dont have problem with 1 MIC and 2 INT. However, for 4 SU, I think the contrast should be .41 .22=.19, why the table gives me .20? Also for 3LU, I think it should be .34(.59)= .25, then why the table gives me .26?
Thank you very much!

Mike.Linacre: Thank you for your post, Li Jiuliang.

The reason is "half-rounding".

Please ask Facet to show you more decimal places:

Or "Output Tables" menu, "Modify specifications"

Li_Jiuliang: thank you professor Linacre! i got it!

288. Problem with interpretation

marlon December 21st, 2012, 10:18am: Goodmorning prof. Linacre,
Goodmorning Rasch-people,

I am doing my Rasch analyses for couple of years and would like to share with you the case I am not sure how to interpret. Maybe you could help me?

I conducted my Rasch analyses on the sample of more than 50.000 students. My analyses of several items resulted in a very good fitting statistics for items. Most of them have OUTFITs and INFITs in the limits of .8-1.2. The hierarchy of items seems valid and reasonble.

In the same time, I faced the problem connected with the value of person reliability indexes which are quite low (see below):


Where should I look for the solution for this problem?

I consider the hypothesis that in this particular test the categories of items are not very much discirminating between good and bad students. I've noticed that the categories of the items seem to be very close (too close?) to each other at the picture with empirical item-category means of the Rasch measures (see the pictre attached).

Am I right?
Is there the posibility to create the test with better realiability using this items on this sample?

Thank you for the help in advance.


Mike.Linacre: Thank you for your questions, Marlon.

As Spearman (1910) tells us: Reliability = True Variance / Observed Variance.

And: Observed Variance = True Variance + Error Variance

This is shown in your first Table. Let's use the "model" numbers:

True S.D. = .60, so True variance = .60^2 = .36

Model RMSE = .58, so Error variance = .58^2 = 0.3364

So, Observed Variance = .36 + .3364 = 0.6964. Then Observed S.D. = 0.6964^0.5 = 0.8345 - please compare this value with the Person S.D. in Winsteps Table 3.1. They should be the same.

Now, let's use these numbers to compute the reliability of a good 18-item test in this situation:

True variance = 0.36
Error variance of a test with 18 items with p-values around .7 (estimated from your = 1 / (18 * .7 * (1-.7)) = 0.264550265

So, expected reliability = (true variance) / (true variance +- error variance) = (0.36)/(0.36+0.2646) = 0.58

We see that you observed reliability is around 0.50, but a well-behaved test of 18 items with your sample would be expected to produce a reliability near 0.58.

This suggests that some of the 18 items are not functioning well, or that we need 18 * 0.3364 / 0.2646 = a test of 23 items like the ones on the 18-item test, if we want to raise the reliability of this test from 0.5 to 0.58.

To improve the 18 items, the first place to look in Winsteps is "Diagnosis menu", "A. Polarity". Look at the list of items. Are there any negative or near-zero point correlations? Are then any correlations that are much less than there expected values? These items are weakening the test. Correcting these items is the first priority.

For the expected reliability of tests with different numbers of items, and different average p-values, please see the nomogram on https://www.rasch.org/rmt/rmt71h.htm

OK, Marlon?

289. interaction report

Li_Jiuliang December 21st, 2012, 3:50am: Hi professor Linacre, i have some problems with interpreting my interaction report. please see the attachment. thank you!

290. Simulating 1-parameter probit IRT model

RaschModeler_2012 December 19th, 2012, 12:06am: Hi Mike,

Thank you again for all the help you've provided the past several months. I have another question...

I am simulating data (N=1000) in another software package as follows:

Fix the following beta_i parameters:

beta1 = -2.00 logit
beta2 = -1.00 logit
beta3 = 0.00 logit
beta4 = 1.00 logit
beta5 = 2.00 logit

Randomly generate 1000 data points from N (0,1) = theta_j

eta_ij = (1/1.7)*(theta_j - beta_i)

p_ij = 1 / [1 + exp(-eta_ij)]

item and person specific response = 0 IF p_ij < .50
item and person specific response = 1 IF p_ij > .50

Am I correct in assuming that the above simulation "code" will generate data that approximate a 1-parameter PROBIT IRT model?

I want to generate data from a 1-parameter PROBIT IRT model, not a 1-parameter LOGIT IRT model. To do so, I'm getting confused as to whether I should be multiplying the linear predictor ("eta") by 1 /1.7 or by 1.7.

Why I am doing this...

Since the software program I'm using can only fit a 1-parameter probit IRT model, it seems to me that I should simulate data which conform to the 1-parameter probit IRT model. After fitting the 1-parameter probit IRT model on the simulated data, I will convert (approximately) the probit estimates to logit estimates.

I hope this makes sense.

As always, thank you for your insight.


Mike.Linacre: RM, yes, what you describe should work :-)

RaschModeler_2012: Thanks for the confirmation!

RaschModeler_2012: Dear Mike,

So, I proceeded as indicated previously (simulating data that conform to a 1-parameter probit IRT model). I then fit two models:

1. 1 parameter probit IRT model (using the initial software)
2. 1 parameter logit IRT model (using a different software)

To my surprise, both yielded item difficulties which were nearly identical. Anyway, I proceeded to convert the item difficulties from the first model (in probits) to the logit scale by applying the following formula:

[pi / sqrt(3)] * probit

but this formula yielded estimates that were VERY different from the item difficulties from the logit model.

Clearly, I'm using the wrong probit-to-logit conversion formula. Do you know how to convert item difficulties on the probit scale to item difficulties on the logit scale? I realize the conversion will only approximate the estimated logit, but what I get by applying the formula above is WAY OFF.

Where did I go wrong?

Thank you,


RaschModeler_2012: Mike,

One additional point--I do recall you referring me to this website that shows that the conversion changes based on the probability:


With that in mind, suppose we have the following item difficulties in probits:

item 1 = -1.34
item 2 = -.520
item 3 = .396
item 4 = 1.451

Is there a more precise approach to converting these to probits to logits as opposed to the single equation:

logit =~ [pi/sqrt(3)] * probit

This conversion equation just does not work very well.

Thanks again,


Mike.Linacre: RM: the best equation for probits to logits is:
logit = 1.7*probit

291. Binary CFA versus Rasch/1-PL IRT

RaschModeler_2012 December 14th, 2012, 5:23am: Hi Mike,

Suppose one employed a confirmatory factor analysis (CFA) on binary-response data, where the model is parameterized with a single factor (e.g., depression) with direct causal paths to the binary-response items (e.g. crying: yes/no, sleep changes: yes/no). An underlying normal distribution is assumed to govern the binary-response items.

If I constrain the factor loadings to 1.0 (item discriminations, I think), the single factor to have a variance of 1.0, and the error terms associated with the manifest variables to have a mean of 0 and variance of 1, then the estimated intercepts of the manifest variables should reflect item difficulties, should they not? Does that sound correct to you?

Basically, I'm trying to see if there's a way to get as close to a Rasch or 1-PL model within a CFA modeling framework as humanly possible.



Mike.Linacre: RM, an interesting idea.

"constrain the factor loadings to 1.0" - in other words, all the item responses are constrained to be perfectly correlated with the latent variable except for random noise. This seems to contradict the philosophy underlying Factor Analysis, but it does match Classical Test Theory.

Rasch and IRT perform a non-linear transformation of the data, but the Rasch measures are often highly correlated with the raw scores for complete, reasonably well-behaved data: https://www.rasch.org/rmt/rmt121b.htm

RaschModeler_2012: Thank you for your reply, Mike.

I failed to mention that the program I'm using does convert the original scale to "probits" such that the intercepts (which I am assuming will approximate item difficulties) will be in probit units. I've been trying to see the connection between probit and logit. For example, if an item difficulty has an estimated probit of 1.5, what does that mean, and can it be approximated to a logit?



Mike.Linacre: RM, probits and logits: www.rasch.org/rmt/rmt112m.htm "The Normal Cumulative Distribution Function and the Logistic Ogive: Probit and Logit."

RaschModeler_2012: Hi Mike,

That webite you provided is very helpful.

The standard dichotomous Rasch model is parameterized as follows:

logit(p_ij) = theta_j - beta_i


theta_j = person ability
beta_i = item difficulty

The assumption of this model is that the item discrimination parameter (alpha_i) is fixed at 1. That is, the equation could be re-written as follows:

logit(p_ij) = alpha_i(theta_j - beta_i)


alpha_i = item discrimination parameter
theta_j = person ability
beta_i = item difficulty

However, because alpha_i is constrained to equal 1.0 for all items, it is not included in the standard dichotomous Rasch equation.

The structural equation modeling (SEM) software I'm using does NOT allow me to parameterize the model as a logistic function as depicted above. Instead, I must fit a probit function.

Further, in the context of SEM, as I understand it, the following is true:

factor loadings = item discriminations
item intercepts = item difficulties

If I were able to fit a logistic function, I would simply constrain the factor loadings to 1.0, but I must use the probit function.

Question: In order to parameterize the one-factor CFA on binary indicators to be as close to a Rasch model as possible, given that the factor loadings reflect item discriminations and that the estimates are in probit units, what value should I constrain the factor loadings to be? For a reasonable approximation, should I just constrain them to all be equal to 1.7 or is there a more precise way? While the webiste you provided is very insightful, I'm still a bit confused on how to proceed.

Am I making a mountain out of a mole hill, and just constrain the factor loadings to 1.7? Or, is it worth exploring a more sensitive approach to determining the exact factor loadings to constrain each item?



Mike.Linacre: RM, since we do not the true latent distributions, constrain the factor loadings to be the value that gives the best fit.

RaschModeler_2012: Thanks, Mike. I believe if we constrain the factor loadings (~ item discriminations) to be equal to each other, this is akin to a 1-parameter IRT model (in probit units).

Okay. Going to investigate.

Thanks again!


292. help with understanding maps

jjswigr December 11th, 2012, 5:40pm: Why don't the results from Table 17 coincide with the Person Map? For example, my persons (Subjects 111, 178 and 37) with the highest measures (3.4, 3.11 and 2.39 logits respectively) are found in the middle of the Person Map, at roughly 0, -1.5 and 0.5 logits. On the Person Map, subjects 113 and 38 are located at roughly 3.2 and 3.4 logits, but in Table 17, their measures are -3.21. What am I missing?

Mike.Linacre: Thank you for your question, jjswigr. Please give us some more information.

How about zipping the person map and Table 17 into a file and attaching it to a post to this Forum?

294. Comparing Rasch and Graded Response Model

jjohns December 9th, 2012, 5:32pm: For an exploratory project I am using both the Rasch model (Winsteps) and 2-PL graded response model (IRTpro) to look at the same data set. I have a Likert response scale.

I would like to compare difficulty in some way, but the GRM provides category thresholds, rather than a single difficulty parameter. I think these might be comparable to the Rasch-Andrich thresholds. Is that correct?

Also, the Rasch-Andrich thresholds would be assumed equal across items, right? The GRM thresholds vary by item, so I would need to compare the thresholds for each item with the 2-PL model to the Rasch-Andrich thresholds that are constant across items?

Mike.Linacre: Thank you for your questions, jjohns.

1) If you want different Rasch thresholds for each item, then please use the "Partial Credit" model. In Winsteps,

2) GRM thresholds are cumulative probability thresholds, so they are equivalent to Rasch-Thurstone thresholds. These are shown most conveniently in the Winsteps ISFILE= output.

3) What differences do we expect to see?
a) Winsteps usually reports in Logits. IRTPro may be reporting in probits. To instruct Winsteps to report in probits:
b) If the sample distribution is approximately normal (a usual assumption of GRM) and the items are reasonably homogeneous, then the Winsteps and IRTPro should report statistically the same numbers. Cross-plot the threshold values. We expect them to fall on a statistically straight line.
c) If the sample distribution is not normal, then the GRM estimates will be skewed to force the sample distribution to approximate normality. The cross-plot will show a curvilinear relationship.

295. Fitting a Rasch model on Guttman-like data

RaschModeler_2012 December 8th, 2012, 10:15pm: Hi Mike,

I posted this question a few days ago and took it down because I thought I had resolved it on my own, but I'm now questioning my decision. Very briefly, a CTT validated cognitive performance scale which was administered in a way such that after an individual answered 4 questions incorrectly, the test was stopped. The items were ordered successively from (theoretically) easier to more difficult items. I want to re-evaluate the data by employing a Rasch model, but I'm just not sure if this is possible given the way in which the test was administered. I'm really hoping there is some way to salvage the data to perform a Rasch model. Here's an illustration (there are more items and more people):

person i_1 i_2 i_3 i_4 i_5 i_6 i_7 i_8 i_9 i_10 etc.
1 1 1 0 1 1 1 0 0 0 0 . .
2 1 1 1 1 0 0 0 0 . . . .
3 0 1 0 1 1 1 1 0 0 0 0 .

Can I try to assess the psychometric properties of the data by employing a Rasch model, despite the fact that it was administered this way?

Any thoughts would be most appreciated!


Mike.Linacre: RM, first Rasch-analyze the data to verify the the items really do have a Guttman-like pattern.
Symptoms are
1) all items dropped as extreme
2) many items with extreme scores
3) very wide spread of item difficulties

Solution: with Guttman-style data, add a couple of dummy records to the dataset:

These will make every item estimable, and the item difficulties slightly more central.

Omit these records (Pdelete= from the Specification menu box) when doing the reports for the persons.

These data records will make all the item difficulty estimates slightly more central

RaschModeler_2012: Thank you!


296. I need an equation  

eagledanny December 7th, 2012, 2:22pm: SOS! I need help to get an equation which can enable me to compare separation index given by 9 male raters and 18 female raters for 30 essay ratings, is that possible? Prof Linacre, could you please lend me a hand! Thanks a million!

Mike.Linacre: Eagledanny, is this the situation:
9 male raters have rated 30 essays. You have the separation index for the 30 essays.
18 female raters have rated 30 essays. You have the separation index for the 30 essays.

If "separation index" = Reliability, then use the Spearman-Brown Prophecy formula to see which raters have higher discrimination.

Otherwise, Reliability = ("separation index")^2 / (("separation index")^2 + 1), then use the Spearman-Brown Prophecy formula to see which raters have higher discrimination.

eagledanny: Prof. Linacre,
It should be the former case. However, I have no idea to operate Spearman-Brown Prophecy formula in this situation. Now the 30 essays are marked both by 9 males and 18 females, and the facets result shows that the separation index for man and woman are 7.52 and 11.85 respectively. According to your suggestion, I will first calculate the female and male mean of raw scores for 30 essays, and then calculate the correlation between male and female raters? ? Is that correct? Thanks a million!

Mike.Linacre: Eagledanny, please give more details of your analysis.
1. Are you doing two separate analyses (male and female)?
2. Is the separation index for the essays or for the raters?
3. There is no need to compute mean raw scores nor correlations.
4. The Spearman-Brown Prophecy Formula is the usual method for comparing reliabilitiesfor test of different lengths. In this case, for rater samples of different size.

Let's assume that we have separate analyses of male and female raters. And the separation index for the essays in each analysis.
Essay separation index for 9 male raters = 7.52, so the essay reliability for male raters = 7.52^2 / (1 + 7.52^2) = 0.982623926

Essay separation index for 18 female raters = 11.85, so the essay reliability for female raters = 11.85^2 / (1 + 11.85^2) = 0.992928989

Let's apply the Spearman-Brown Prophecy Formula to the male raters. What would the reliability be if there were 18 male raters?
M18 = (18/9) * 0.982623926 / (1 + (18/9 - 1)*0.982623926) = 0.991235819 [this is less than the reliability for the 18 female raters]
The separation index for 18 male raters is
S18 = sqrt (M18 / (1-M18)) = sqrt (0.991235819 / (1 - 0.991235819)) = 10.63488565
10.63 is less than the female 11.85, so the male raters are less discriminating between essays of different competence than the female raters.

OK, Eagledanny?

eagledanny: Thank you so much, Prof. Linacre. What you list above is what exactly I want. Your are a genius, thanks a million!

297. Is this sensible?

LyndaCochrane December 6th, 2012, 4:43pm: I'm using Facets (for the first time) to analyse multiple mini interview data. There are eight stations, 500 candidates and 220 raters. Stations measure a range of skills (technical, academic, leadership etc) and scores are between 0 and 100. I used a three facts model (station, rater, candidate) but the results look unreliable. There are massive shifts from observed to fair scores, some infit / outfit statistics are very high / low and over 50% of variance is unaccounted for by the model. Am I using the wrong tool for the job? Any help would be greatly appreciated: I am totally baffled.

Mike.Linacre: Thank you for your question, Lynda.
"Over 50% variance unaccounted for" is expected.
But "massive shifts" are not. This suggests that the models specified in Models= may not match the observations at the stations. Are the ratings at the stations on a 0-100 rating scale? If so, please verify that the Models= specification has R100K.
Also, are the raters nested within station? If so, please group-anchor the raters within station.

LyndaCochrane: Thank you so much for your timely reply, much appreciated. The ratings are on a 0-100 scale and I included R100 in the model specification. I will try again with R100K (sorry if this sounds ridiculous, I am still a Baby User).

Raters assess candidates across a range of stations. The headline figure of over 50% variance unaccounted for is quite surprising but I will, of course, accept your expert advice. As Bridget Jones would say: Note to self - Facets training needed, must enrol on course,

Mike.Linacre: Lynda, if your rating scale is 0-100, then it may be better to model it as 100 binomial trials. This is because raters are unlikely to be able to discriminate 101 levels of performance, and also the frequencies of the different values will be irregular.

Model = ?,?,?,MyScale

Rating-scale = Myscale, B100

For expected "variance explained", please see https://www.rasch.org/rmt/rmt221j.htm for the simpler situation of two-facets and dichotomous data.

LyndaCochrane: Many thanks, I'll try this right away!

LyndaCochrane: This has been very useful. The observed to fair differences are not too dissimilar from before but the fit statistics are better. The differences can be explained by the balance of hawks and doves involved in the ratings. I am extremely grateful for your support, Mike, and motivated to learn much more about Rasch. Thank you!

298. RMSE & Mean Error

uve December 6th, 2012, 9:59pm: Mike,

I was hoping you could explain the diffence and purpose of two measures that seem to be virtually identical to me: RMSE and mean person error found in Table 3.1.

As I understand it mean error is simply that, the average of all the person errors, but RMSE is the square root of the sum of all the squared person errors. Both measures always seem to yield virtually identical results. In fact you usually don't see the difference in Table 3.1 unless you ask Winsteps to report more decimal places or do the calculations yourself from a PFile.

With that said it seems both are interchangeable, so I'm not sure why we have both and how we use them differently.

Mike.Linacre: Uve,

"mean error" is the arithmetic mean of the standard errors, which is convenient for humans.

"RMSE" is the root-mean-square-error, a statistically more exact average for the standard errors.

For complete, reasonably-targeted tests these numbers are usually close, but for incomplete, short tests they can be noticeably different

In Reliability calculations, we want "average variance" terms. The standard errors, these are MSE terms = square of the RMSEs.

299. Agreement statistics in Facets

windy December 6th, 2012, 5:34pm: Hi Dr. Linacre,

I am working on some projects with raters, and I'm having some trouble figuring out how the observed rater agreement statistics are calculated in Facets. I can't seem to find an equation for this statistic in the manual. Any suggestions?

Thanks for your help.

Mike.Linacre: Stefanie, "rater agreement" is the proportion of times the raters agree on the same rating in situations where they are rating in the same situation. A detailed explanation (with worked example) is shown at https://www.winsteps.com/facetman/index.htm?inter_rater_reliability.htm which is also in Facets Help

300. categories weighting differently

NaAlO December 5th, 2012, 6:13am: Hiprofessor LinacreWe revised our rating scale for a writing assessment i mentioned last time, and now there are five categories, content, language, mechanics, Number of words, and coherence. The problem is that each category is weighted differently, with content(0,3,4,5), language(0,1,2,3), mechanics(0,1), Number of words(0,1), and coherence(0,1). can we use Factet to examine the interactions among raters, examinees and rating categories? if yes, how to writie the specification? Can we use Model = ?,?,#, R6 ? if not, are there any other solutions? Thanks ahead.We are looking forward to your suggestions.

Mike.Linacre: Thank you for your questions, NaAIO.

Your rating scales have unobserved intermediate categories. For instance, 0,3,4,5 has 1 and 2 unobserved. But you want to keep these categories in the response structure.

The model is:

Model = ?,?,#, R5K

# says "each element of facet 3, the writing-category facet, has its own rating scale."
R5 says "the highest observed rating-scale-category for any element is 5"
K says "keep unobserved intermediate rating-scale-categories in the response structure"

301. Rasch assumptions

jenglund December 3rd, 2012, 9:22pm: Hi all,
I am new to the forum and couldn't find a comprehensive answer for the following question. Any help is appreciated. I am no mathematician and am learning Rasch from this web site and a Bond and Fox book on my own for my dissertation, so please limit references/resources recommended to those a beginner can understand (i.e., no hand calculations with more symbols than text would be nice!)

I have a dichotomous dataset from a test I created of various aspects of Working Memory. I have already run Rasch analyses in WINSTEPS and calibrated difficulties, person ability, etc. and tested the unidimensionality assumption using PCA in WINSTEPS.

My statistics advisor asked me to include details of the "other" Rasch assumptions and how I tested them in my Method section of my dissertation.

I found a resource explaining these assumptions:

conditional/local independence

I understand on a conceptual level what these are, what they mean for my data/model, etc. Yet I am having trouble finding a nontechnical paper explaining how to test them (in WINSTEPS or using WINSTEPS data) without taking 5 classes on how to interpret the equations in the papers themselves.

I can actually explain qualitatively/theoretically why my data can meet independence, and I just read on this forum that every fit test is really a test of sufficiency (I have misfit MNSQ stats in my output already), but is there a way to explain why the data should meet monotonicity without a statistical test? Is there an accepted method? Are all my analyses in WINSTEPS really tests of these assumptions in a way?

Any help appreciated...please excuse my greenness.

Mike.Linacre: Thank you for your post, Jenglund.

The fit statistics in Winsteps evaluate the empirical value of the data for constructing measures, but they are not hypothesis tests with the rigor expected of fit statistics used for model evaluation and selection. This is because Rasch is a prescriptive model (like Pythagoras Theorem), not a descriptive model (like a regression model).

However, hypothesis tests of the Rasch assumptions can be formulated and computed. For instance:

Local independence: https://www.rasch.org/rmt/rmt133m.htm suggests one approach.
Simulate 100 datasets similar to the current dataset. Count how many of them have worse local independence (inter-item correlations of the Rasch residuals) than the original data set.

Sufficiency: simulate 100 datasets. Count the number of datasets in which the Guttman Coefficient of Reproducibility is worse than the original dataset.

Monotonicity: simulate 100 datasets. Count the number of datasets in which there are more items with non-monotonic empirical ICCs than the original dataset.

If the counts are greater than 5, then the original dataset passes the hypothesis test.

302. MRCML

uve August 19th, 2012, 5:32pm: Mike,

I am encountering the MRCML more and more in the literature. The data seem to suggest this model explains more variance than the unidimensional version, yet interpretations of respondent scores seem lacking. I am wondering what you think of this and if you have any links or resources to articles that critique the model more objectivley.

Mike.Linacre: Uve, MRCML (Multidimensional Random Coefficient Multinomial Logit Model) is described at
http://bearcenter.berkeley.edu/publications/ConstructingMeasures.pdf - that document states it is implemented in ConQuest and GradeMap (see https://www.rasch.org/software.htm ).

MRCML is more like a descriptive IRT model than a prescriptive (unidimensional) Rasch model. Accordingly we would expect MRCML to explain more variance than a Rasch model, but we would not expect MRCML to produce easily-interpreted additive measures on a latent variable.

Here is the fundamental difference between IRT (in general) and Rasch:

IRT: we construct the model to describe the data (as well as possible)

Rasch: we construct the data that fulfills the requirements of the model (as well as possible)

This matches what George Bernard Shaw wrote:
"The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man."

Unidimensional Rasch is the unreasonable approach ....

uve: Mike,

How then would the unidimensional model deal with something like the SAT or PISA? Or any instrument that has strong multidimensional aspects? My response would be that we should separate the subject matter, but established instruments like these will likely not be changed in order to conform with unidimensional requirements.

Mike.Linacre: Uve, you have the situation pegged :-)

"established instruments like these will likely not be changed in order to conform with unidimensional requirements."

The original thermometers, in the earlier 1600s, were multidimensional. They were open-ended glass tubes that combined the measurement of heat and of atmospheric pressure.

Imagine if physicists around 1650 had said "our established open-tube thermometers will likely not be changed in order to conform with unidimensional (heat or atmospheric pressure) requirements."

Now compare progress in the last 100 years for social science and for physical science. Which science has the more effective methodology? Could social-science research have done worse if it had imposed the unidimensional rigor of physical science upon itself?

SAT scores are treated as though they are unidimensional, so the SAT test should be designed that way. Of course, "unidimensional" depends somewhat on the context. In the context of learning difficulties, "subtraction" is a dimension. In the context of academic achievement, "math" is a dimension. It is the same in physics; in some situations, "heat transmission by radiation", "heat transmission by conduction" and "heat transmission by convection" are different dimensions, but, for most purposes, "heat" is one dimension.

uve: Thanks again as always! :)

uve: Mike,

This past week I attended the California Educational Research Association (CERA). As you may know, 45 states have adopted national standards which will be assessed using CAT systems developed by two consortia. Currently, the focus is on math and English in grades 3-8 and 11 set for 2015. Literacy will also be incorporated using science and history as well. I have seen many of these items. Some are multiple-choice in which more than one distractor may be correct and students have space below to type in their justifications for the choices they made.

In California we refer to this as the Common Core State Standards (CCSS): http://www.cde.ca.gov/re/cc/

California has joined the Smarter Balance Assessment Consortium: http://www.cde.ca.gov/ta/tg/sa/smarterbalanced.asp

SBACs homepage: http://www.smarterbalanced.org/

The SBAC CAT system is being developed using a very similar system that used in the state of Oregon.

The point is that these high-stakes accountability driven instruments will be multidimensional. Though I am very excited and pleased at the higher cognitive level we can go with these exams, I am also concerned in how we can interpret the results when multiple subjects and constructs are mixed into a single exam, and even a single item. I expressed by concern over the meaning of the overall scores that will be developed based on CCSS exams. Later I approached one of our state's psychometricians and made the following statement:

If I have an exam that measures your height and weight at the same time and we put the overall score on a scale of 0-100 and your score is 70, what does that mean? Do you need to gain or lose weight? Do you need to become shorter or taller?

He responded by saying that if we imagine the person as a tree for a moment, then 70 could tell us whether the person is a elm or an oak (the combination of height and weight).

It appears that whether I like it or not, I will be forced into the realm of MRCML. Our district benchmark exams are already being revised to incorporate the mixture of subjects and constructs. Publishers are scrambling to revise their curriculum materials to accommodate this new multiple construct methodology. I am worried that the Rasch methodology will fade away due to this. I am getting the impression that many researchers feel that an overall score is not very helpful anyway, and that subscores will become more important. I will not argue the value of subscores, but it seems to me that the unidimensional issue must resurface as each construct identified by the MRCML is reported. However, those calibrations would not be pure because they are influenced by the inclusion of other constructs. If subscores will take center stage, why not calibrate them separately in a unidimensional approach and forgo an overall score?

My question at this point: is there a way to tease out each construct of an item and calibrate each measure on them?

Just when I thought I had a handle on the Rasch model, policy makers and educators throw a wrench into the machinery-very frustrating.

Mike.Linacre: Yes, Uve, this is a frustrating situation.

Your height+weight analogy is useful. MRCML is used when our measurements of height and weight are very rough, but we expect that the empirical relationship between height and weight improves our measures of both height and weight. So we end up with a combined height+weight measure, a weight-improved height measure and a height-improved weight measure. Of course, this biases the measures for short, fat people and tall, thin people, but the advantage is that fewer resources are needed for measurement.

This trend in education is opposite to the trend in physical science, which is more precise measurement of more exactly-defined attributes.

uve: Mike,

Is there a way in which through some type of contrast analysis that the effects of the height construct on the weight construct can be removed and the effects of weight removed from the height?

Everette V. Smith Jr. proposed one such procedure which can be found in, "Detecting and Evaluating the Impact of Multidimensionality using Item Fit Statistics and Principal Component Analysis of Residuals", Introduction to Rasch Measurement, 575-600 JAM Press, 2004.

I'm not sure if such an approach would be valid to this discussion.

Mike.Linacre: Uve, the point of MRCML is that they want the height estimate to impact the weight estimate, and vice-versa. Then both height and the weight can be measured sloppily (and cheaply). MRCML rescues the situation.

This is the opposite logic from the original application of MRCML. Height and weight were measured sloppily due to other constraints. MRCML was devised in order to improve their precision.

303. G-Theory: 1-facet or 2-facet design?

RaschModeler_2012 November 27th, 2012, 1:20am: Hi Mike,

As far as I am aware, there are two well-known contemporary theories of measurement:

1. Latent Trait Theory
2. Generalizability Theory

From what I understand, generalizability theory allows for multiple facets. Furthermore, a design in which there are multiple persons ("target") and multiple items ("facet") is considered a single facet design, according to a couple of measurement textbooks I've read. Now, if one incorporated multiple raters, the design would become a 2-facet design. However, on this webpage:


it is suggested that having multiple items and multiple persons is considered a 2-facet design.

I'm trying to make sense of the difference between what I've read in a couple of measurement textbooks and what is stated on this website.

Why do the measurement textbooks I've read consider persons the "target" (not a facet), and items, raters, situations, timepoints, etc. as facets?

Is just a matter of semantics?

Thanks again.


RaschModeler_2012: As an example,


If you click on the first page provided in this google books website (page 293), you'll see an example of a one-facet design where there are multiple persons and multiple items. A variance component is estimated for persons (the target), items (facet), and the residual, within an ANOVA framework.



Mike.Linacre: RM, at https://www.rasch.org/rmt/rmt162h.htm see: 3. The "Generalizability" approach

The different terminologies can be confusing :-(

RaschModeler_2012: Mike,

You always know the answer!

"3." answers my question.



RaschModeler_2012: Mike,

The intersection between g-theory and item reponse theory is quite intriguing--the idea of employing a multiple-facet design (e.g., muliple persons, multiple items, multiple raters) to construct a unidimensional Rasch model. Two questions:

1. Are you aware of how to simulate data using Winsteps and/or Facets that conforms to such a model?

2. My gut reaction is that being able to construct a Rasch measure of depression by using ratings on multiple persons (John, Jane, etc.) on multiple aspects of depression (e.g., sadness, guilt, etc.) made by multiple clinically-trained observers (Mary, Jack, etc.) is highly unlikely. That is, the chances that such data would conform to a Rasch measurement model seem low to me. Of course, this is a gut reaction--I need to do the necessary research, and would love your opinion on the matter.

As always, I understand if you're too busy to respond or if it takes you a while to get around to this post.



Mike.Linacre: RM:

1. Simulate data: www.rasch.org/rmt/rmt213a.htm

2. "chances that such data would conform to a Rasch measurement model seem low" - yes, in the same way that the chances that a random triangle would conform to Pythagoras Theorem. But, as you say, we have to deliberately construct the data. This may take considerable effort. In the same way that constructing effective thermometers originally took considerable effort.

RaschModeler_2012: Hi Mike,

I hate to be a nag, but I've been trying for a while to incorporate another facet in the simulation code using the website you provided, without any success. Would you mind providing a couple hints as to how to incorporate an additional facet to the following simulation code? For example, suppose we had 4 individiuals rating each person on each of the 50 items. Again, I understand if you're too busy to respond--just not sure how to proceed. :-(


TITLE="My simulation"
ITEM1 = 1
NI = 50
NAME1 = 52
CODES = 01
1 -4.493
2 -3.869
3 -3.502
4 -3.259
5 -3.036
6 -2.951
7 -2.862
8 -2.859
9 -2.834
10 -2.824
11 -2.76
12 -2.729
13 -2.639
14 -2.633
15 -2.610
16 -2.570
17 -2.503
18 -2.473
19 -2.461
20 -2.429
21 -2.41
22 -2.416
23 -2.38
24 -2.387
25 -2.374
26 -2.373
27 -2.346
28 -2.341
29 -2.317
30 -2.291
31 -2.281
32 -2.275
33 -2.241
34 -2.234
35 -2.232
36 -2.226
37 -2.224
38 -2.215
39 -2.203
40 -2.203
41 -2.183
42 -2.180
43 -2.177
44 -2.175
45 -2.174
46 -2.173
47 -2.163
48 -2.157
49 -2.156
50 -2.154
51 -2.135
52 -2.133
53 -2.116
54 -2.112
55 -2.101
56 -2.089
57 -2.087
58 -2.057
59 -2.047
60 -2.042
61 -2.031
62 -2.006
63 -1.994
64 -1.961
65 -1.951
66 -1.94
67 -1.939
68 -1.931
69 -1.921
70 -1.914
71 -1.905
72 -1.896
73 -1.86
74 -1.865
75 -1.864
76 -1.856
77 -1.850
78 -1.84
79 -1.830
80 -1.809
81 -1.802
82 -1.792
83 -1.780
84 -1.766
85 -1.764
86 -1.756
87 -1.753
88 -1.739
89 -1.73
90 -1.737
91 -1.732
92 -1.726
93 -1.702
94 -1.690
95 -1.682
96 -1.671
97 -1.669
98 -1.669
99 -1.666
100 -1.659
101 -1.65
102 -1.655
103 -1.649
104 -1.645
105 -1.634
106 -1.60
107 -1.593
108 -1.591
109 -1.574
110 -1.571
111 -1.555
112 -1.550
113 -1.546
114 -1.541
115 -1.535
116 -1.523
117 -1.520
118 -1.514
119 -1.506
120 -1.486
121 -1.466
122 -1.459
123 -1.452
124 -1.451
125 -1.441
126 -1.435
127 -1.430
128 -1.422
129 -1.41
130 -1.409
131 -1.403
132 -1.397
133 -1.392
134 -1.389
135 -1.38
136 -1.387
137 -1.387
138 -1.373
139 -1.372
140 -1.370
141 -1.363
142 -1.360
143 -1.359
144 -1.355
145 -1.342
146 -1.335
147 -1.333
148 -1.329
149 -1.327
150 -1.323
151 -1.306
152 -1.295
153 -1.285
154 -1.285
155 -1.271
156 -1.269
157 -1.267
158 -1.266
159 -1.262
160 -1.262
161 -1.240
162 -1.23
163 -1.237
164 -1.231
165 -1.231
166 -1.229
167 -1.221
168 -1.206
169 -1.202
170 -1.194
171 -1.18
172 -1.183
173 -1.176
174 -1.166
175 -1.150
176 -1.150
177 -1.141
178 -1.140
179 -1.132
180 -1.127
181 -1.125
182 -1.122
183 -1.119
184 -1.113
185 -1.075
186 -1.075
187 -1.073
188 -1.071
189 -1.064
190 -1.060
191 -1.046
192 -1.045
193 -1.044
194 -1.037
195 -1.035
196 -1.031
197 -1.021
198 -1.019
199 -1.017
200 -1.014
201 -1.013
202 -1.00
203 -1.000
204 -.999
205 -.999
206 -.995
207 -.994
208 -.987
209 -.980
210 -.966
211 -.964
212 -.95
213 -.949
214 -.947
215 -.947
216 -.945
217 -.941
218 -.933
219 -.932
220 -.926
221 -.922
222 -.921
223 -.919
224 -.91
225 -.916
226 -.914
227 -.913
228 -.90
229 -.906
230 -.905
231 -.904
232 -.902
233 -.895
234 -.893
235 -.892
236 -.889
237 -.887
238 -.884
239 -.880
240 -.87
241 -.876
242 -.872
243 -.872
244 -.871
245 -.871
246 -.862
247 -.859
248 -.854
249 -.851
250 -.849
251 -.84
252 -.840
253 -.836
254 -.831
255 -.830
256 -.827
257 -.827
258 -.817
259 -.812
260 -.794
261 -.791
262 -.790
263 -.780
264 -.776
265 -.774
266 -.766
267 -.764
268 -.763
269 -.752
270 -.751
271 -.751
272 -.746
273 -.743
274 -.743
275 -.742
276 -.740
277 -.729
278 -.725
279 -.723
280 -.719
281 -.716
282 -.714
283 -.713
284 -.713
285 -.713
286 -.710
287 -.70
288 -.705
289 -.691
290 -.68
291 -.684
292 -.681
293 -.680
294 -.67
295 -.677
296 -.677
297 -.673
298 -.672
299 -.666
300 -.664
301 -.663
302 -.661
303 -.659
304 -.655
305 -.64
306 -.645
307 -.639
308 -.637
309 -.637
310 -.636
311 -.635
312 -.630
313 -.627
314 -.625
315 -.623
316 -.620
317 -.61
318 -.614
319 -.609
320 -.606
321 -.605
322 -.604
323 -.604
324 -.596
325 -.593
326 -.590
327 -.590
328 -.585
329 -.576
330 -.574
331 -.569
332 -.56
333 -.566
334 -.563
335 -.560
336 -.557
337 -.556
338 -.554
339 -.54
340 -.547
341 -.547
342 -.547
343 -.544
344 -.541
345 -.540
346 -.53
347 -.53
348 -.530
349 -.529
350 -.524
351 -.522
352 -.522
353 -.51
354 -.507
355 -.499
356 -.49
357 -.495
358 -.493
359 -.493
360 -.489
361 -.48
362 -.483
363 -.475
364 -.471
365 -.464
366 -.460
367 -.457
368 -.456
369 -.454
370 -.452
371 -.452
372 -.451
373 -.449
374 -.445
375 -.444
376 -.441
377 -.439
378 -.434
379 -.433
380 -.42
381 -.425
382 -.425
383 -.424
384 -.422
385 -.421
386 -.412
387 -.411
388 -.409
389 -.401
390 -.397
391 -.396
392 -.391
393 -.391
394 -.390
395 -.37
396 -.364
397 -.35
398 -.357
399 -.357
400 -.356
401 -.355
402 -.351
403 -.339
404 -.33
405 -.337
406 -.334
407 -.330
408 -.330
409 -.323
410 -.323
411 -.321
412 -.319
413 -.306
414 -.305
415 -.301
416 -.300
417 -.29
418 -.296
419 -.296
420 -.295
421 -.292
422 -.289
423 -.289
424 -.28
425 -.286
426 -.283
427 -.283
428 -.282
429 -.276
430 -.275
431 -.271
432 -.270
433 -.266
434 -.263
435 -.25
436 -.257
437 -.256
438 -.252
439 -.252
440 -.249
441 -.239
442 -.237
443 -.237
444 -.231
445 -.229
446 -.22
447 -.22
448 -.226
449 -.220
450 -.216
451 -.215
452 -.20
453 -.204
454 -.204
455 -.19
456 -.193
457 -.186
458 -.186
459 -.184
460 -.184
461 -.184
462 -.183
463 -.181
464 -.176
465 -.173
466 -.173
467 -.155
468 -.153
469 -.149
470 -.147
471 -.144
472 -.135
473 -.12
474 -.124
475 -.123
476 -.122
477 -.120
478 -.109
479 -.106
480 -.101
481 -.099
482 -.099
483 -.09
484 -.09
485 -.097
486 -.095
487 -.08
488 -.08
489 -.086
490 -.086
491 -.082
492 -.079
493 -.076
494 -.074
495 -.071
496 -.069
497 -.06
498 -.065
499 -.062
500 -.059
501 -.051
502 -.050
503 -.045
504 -.045
505 -.040
506 -.03
507 -.035
508 -.032
509 -.032
510 -.031
511 -.030
512 -.027
513 -.026
514 -.022
515 -.022
516 -.021
517 -.010
518 -.006
519 -.004
520 -.003
521 -.002
522 -.002
523 -.001
524 .007
525 .00
526 .00
527 .00
528 .010
529 .011
530 .012
531 .014
532 .014
533 .016
534 .016
535 .019
536 .024
537 .026
538 .029
539 .031
540 .032
541 .041
542 .042
543 .043
544 .050
545 .052
546 .054
547 .066
548 .069
549 .071
550 .074
551 .076
552 .077
553 .085
554 .08
555 .090
556 .092
557 .092
558 .09
559 .099
560 .101
561 .103
562 .110
563 .121
564 .137
565 .137
566 .141
567 .150
568 .155
569 .155
570 .155
571 .15
572 .15
573 .159
574 .162
575 .163
576 .165
577 .167
578 .177
579 .182
580 .189
581 .190
582 .194
583 .197
584 .200
585 .202
586 .21
587 .227
588 .230
589 .231
590 .235
591 .239
592 .239
593 .239
594 .242
595 .24
596 .254
597 .265
598 .265
599 .271
600 .271
601 .273
602 .274
603 .276
604 .276
605 .277
606 .281
607 .283
608 .287
609 .291
610 .296
611 .296
612 .299
613 .300
614 .301
615 .313
616 .314
617 .316
618 .31
619 .322
620 .323
621 .32
622 .334
623 .342
624 .347
625 .350
626 .351
627 .351
628 .365
629 .367
630 .372
631 .373
632 .381
633 .382
634 .392
635 .393
636 .402
637 .404
638 .404
639 .412
640 .414
641 .415
642 .421
643 .422
644 .425
645 .426
646 .427
647 .434
648 .446
649 .451
650 .452
651 .453
652 .462
653 .466
654 .467
655 .470
656 .473
657 .482
658 .484
659 .485
660 .493
661 .503
662 .505
663 .506
664 .510
665 .511
666 .514
667 .516
668 .523
669 .525
670 .533
671 .537
672 .539
673 .542
674 .542
675 .54
676 .552
677 .553
678 .553
679 .555
680 .565
681 .57
682 .582
683 .583
684 .585
685 .587
686 .587
687 .58
688 .590
689 .597
690 .599
691 .600
692 .601
693 .601
694 .603
695 .604
696 .606
697 .615
698 .623
699 .62
700 .632
701 .635
702 .642
703 .645
704 .66
705 .671
706 .671
707 .674
708 .675
709 .676
710 .677
711 .679
712 .680
713 .680
714 .682
715 .687
716 .695
717 .699
718 .701
719 .706
720 .707
721 .709
722 .723
723 .726
724 .732
725 .735
726 .737
727 .740
728 .742
729 .749
730 .753
731 .753
732 .764
733 .770
734 .770
735 .771
736 .781
737 .78
738 .793
739 .794
740 .797
741 .803
742 .805
743 .807
744 .811
745 .813
746 .821
747 .822
748 .829
749 .834
750 .843
751 .855
752 .856
753 .863
754 .864
755 .879
756 .890
757 .897
758 .900
759 .904
760 .909
761 .911
762 .911
763 .920
764 .935
765 .936
766 .940
767 .941
768 .942
769 .947
770 .951
771 .952
772 .952
773 .953
774 .953
775 .957
776 .959
777 .972
778 .973
779 .97
780 .981
781 .981
782 .986
783 .990
784 .993
785 .993
786 .996
787 .99
788 1.007
789 1.007
790 1.011
791 1.017
792 1.023
793 1.03
794 1.040
795 1.045
796 1.05
797 1.05
798 1.065
799 1.067
800 1.067
801 1.06
802 1.070
803 1.070
804 1.071
805 1.071
806 1.075
807 1.083
808 1.084
809 1.089
810 1.090
811 1.096
812 1.107
813 1.10
814 1.112
815 1.11
816 1.119
817 1.125
818 1.129
819 1.129
820 1.130
821 1.131
822 1.132
823 1.133
824 1.149
825 1.160
826 1.161
827 1.166
828 1.173
829 1.177
830 1.179
831 1.182
832 1.190
833 1.193
834 1.200
835 1.205
836 1.213
837 1.216
838 1.223
839 1.229
840 1.23
841 1.241
842 1.250
843 1.251
844 1.251
845 1.254
846 1.259
847 1.262
848 1.267
849 1.285
850 1.287
851 1.325
852 1.325
853 1.334
854 1.344
855 1.351
856 1.352
857 1.370
858 1.394
859 1.397
860 1.39
861 1.402
862 1.403
863 1.41
864 1.419
865 1.419
866 1.429
867 1.449
868 1.451
869 1.453
870 1.456
871 1.459
872 1.466
873 1.46
874 1.46
875 1.477
876 1.480
877 1.496
878 1.503
879 1.509
880 1.512
881 1.522
882 1.526
883 1.540
884 1.55
885 1.563
886 1.570
887 1.577
888 1.587
889 1.59
890 1.613
891 1.61
892 1.624
893 1.632
894 1.641
895 1.644
896 1.655
897 1.663
898 1.663
899 1.664
900 1.673
901 1.679
902 1.683
903 1.714
904 1.720
905 1.721
906 1.72
907 1.737
908 1.73
909 1.739
910 1.741
911 1.74
912 1.754
913 1.76
914 1.769
915 1.770
916 1.771
917 1.786
918 1.78
919 1.790
920 1.791
921 1.795
922 1.801
923 1.803
924 1.806
925 1.807
926 1.816
927 1.822
928 1.830
929 1.831
930 1.834
931 1.837
932 1.867
933 1.872
934 1.879
935 1.882
936 1.885
937 1.885
938 1.889
939 1.897
940 1.916
941 1.935
942 1.945
943 1.947
944 1.975
945 2.001
946 2.00
947 2.010
948 2.032
949 2.033
950 2.03
951 2.040
952 2.05
953 2.064
954 2.069
955 2.081
956 2.106
957 2.111
958 2.117
959 2.124
960 2.139
961 2.139
962 2.175
963 2.196
964 2.209
965 2.244
966 2.245
967 2.256
968 2.262
969 2.270
970 2.293
971 2.306
972 2.330
973 2.354
974 2.367
975 2.390
976 2.40
977 2.41
978 2.434
979 2.463
980 2.467
981 2.49
982 2.530
983 2.564
984 2.59
985 2.600
986 2.649
987 2.67
988 2.703
989 2.705
990 2.754
991 2.791
992 2.813
993 2.956
994 3.042
995 3.106
996 3.111
997 3.365
998 3.644
999 3.88
1000 4.015

1 -4.60
2 -4.1
3 -3.89
4 -3.66
5 -3.4
6 -3.32
7 -2.94
8 -2.59
9 -2.44
10 -2.20
11 -1.95
12 -1.73
13 -1.55
14 -1.39
15 -1.24
16 -1.10
17 -.97
18 -.85
19 -.73
20 -.62
21 -.51
22 -.41
23 -.30
24 -.20
25 -.10
26 .10
27 .20
28 .30
29 .41
30 .51
31 .62
32 .73
33 .85
34 .97
35 1.10
36 1.24
37 1.39
38 1.55
39 1.73
40 1.95
41 2.20
42 2.44
43 2.59
44 2.94
45 3.32
46 3.4
47 3.66
48 3.89
49 4.1
50 4.60

0 0
1 0

1-1000 1-50 1


Mike.Linacre: RM, this simulates 2-facet data, so, for another facet, we would simulate another dataset. For instance, add 1 logit to all the items. Then analyzing both datasets together would simulate a 3-facet dataset.

RaschModeler_2012: Understood! I will try out the approach. Thanks!

304. Compare PCM and RSM

Not_Dr_Pepper November 27th, 2012, 12:35pm: Dear Mike et al

I'm comparing a PCM and an RSM for my data. Is there a neat way of creating an output comparing statistics such as the fit, measures, thresholds, discrimination, reliability and anything else I haven't thought of but should?

With thanks
David Pepper

Mike.Linacre: Thank you for your question, David.

For a philosophical comparison of RSM and PCM, see https://www.rasch.org/rmt/rmt143k.htm

In general, if the items are designed to share the same rating scale, then RSM vs. PCM should report only small differences.

If the items are designed to have different rating scales, then imposing one rating-scale structure on all the items (RSM) probably does not make sense. However, RSM applied to groups of items may make sense.

Differences will be most obvious in item-specific fit statistics. (Winsteps IFILE=, Facets Scorefile=).

Not_Dr_Pepper: Thanks for your response, Mike.

I had reviewed the arguments for PCM and RSM and spent quite a bit of time manually producing and comparing the outputs but a colleague had suggested there might be a way of getting software "to do this for you".

May I ask a follow-up question?

I have a big sample and the items I am using have the same response-scale structure on paper. On the whole, there are modest differences between the various statistics for the PCM and RSM. However, the PCM suggests substantial variation in the category thresholds between the items. In RSM, this is masked by the use of the same rating scale structure for all items. However, I think the various PCM thresholds provide insights into the ways respondents may be interpreting the different items.

Assuming I can communicate the complicated results, perhaps this is an exemplification of the kind of "strong argument" needed for PCM when all items share the same rating-scale structure?

With thanks again

Mike.Linacre: David:

There are two aspects here:
1) investigation of item functioning (PCM)
2) inferences about person performance and future items (RSM)

Cross-plot the person measures from (1) and (2). If they are statistically the same, then use (2) to communicate your findings about the sample to a non-technical audience.

305. RE: Master Thesis

Roy_Ho November 27th, 2012, 4:12am: Hi Mike,

I am writing a master thesis relating to education measurement with Rasch model and I am intending to use Winsteps to process the data, but I have a few questions that I want to ask you before I proceed since I am not an expert in Rasch Model. I have been trying to find your email through google because I know this might not be an appropriate place for me to say this but google failed me this time. Can you give me your email? Or you can email me at my profile email. I would really appreciate if you can help me out.


Mike.Linacre: Roy: please use the mail form on most pages of www.winsteps.com

306. Question Regarding ML Estimation of Binomial Dist

RaschModeler_2012 November 22nd, 2012, 1:59pm: Hi Mike,

This question is a bit off-topic, but is somewhat related since the Rasch model is often based on a type of logistic regression model which assumes that the distribution of the dependent variable y is binary/binomial.

With that said, I'm reviewing old notes that show a simple example of employing maximum likelihood estimation of a binomial distribution. The textbook states:

If we assume the underlying probability of a discrete random variable y is binomial, we have:



---the possible values of y are the (n+1) integer values 0,1,2,...,n

---nCy*theta^y = n! / y!(n-y)!


1. That statistical estimation problem concerns how to use n and y to obtain an estimator of theta, "theta_hat", which is a random variable since it is a function of the random variable, y.

2. The likelihood function gives the probability of the observed data (i.e., y) as a mathematical function of the unknown parameter, theta.

3. The mathematical problem addressed by maximum likelihood estimation is to determine the value of theta, "theta_hat", which maximizes L(theta)

--The maximum liklihood estimator of theta is a numerical value that agrees most closely with the observed data in a sense of providing the largest possible value for the probability L(theta).

Using calculus to maximize the function,(nCy*theta^y)(1-theta)^(n-y), by setting the derivative of L(theta) with respect to theta equal to zero and then solving the resulting equation for theta to obtain theta_hat.


I understand how the authors use the chain rule to take the derivative, but I cannot see how they simplify it to the final equation (second equation in the attachment).

I can't believe I'm stuck on what is supposed to be the "easiest" step in the problem.

I'd really appreciate any help.

As always, thank you.


Mike.Linacre: RM:

Inside the outermost parentheses:
theta^(Y-1) is common to both expressions
(1 - theta)^(n-Y-1) is common to both expressions
we are left with
Y*(1-theta) - (n-Y)*theta
Y - Y*theta - n*theta + Y*theta
y - n*theta
which is the third term in the second equation

RaschModeler_2012: Got it. Thanks!

RaschModeler_2012: Hi Mike,

In the same vein of this thread, is there a simple calculus-based numerical example showing how to obtain the initial estimates of items and persons for a dichotomous Rasch model?

For example,

Suppose we had the following binary data on the following 5 people and 5 items:

1 2 3 4 5
1 | 1 | 0 | 0 |0 | 0 |
2 | 1 | 1 | 0 | 0 | 0 |
person 3 | 1 | 1 | 1 | 0 | 0 |
4 | 1 | 1 | 1 | 1 | 0 |
5 | 1 | 1 | 1 | 0 | 1 |

I'd really appreciate any guidance.



RaschModeler_2012: Sorry, but I can't seem to line up the table above, despite having used courier font. Apologies.


Mike.Linacre: RM, that dataset is too Guttman-like (deterministic). It does not have enough randomness.

This would be estimable:

Observe that every row and column has at least one 1 and one 0. They also formed a linked network. This one does not:

For an Excel spreadsheet that does what Winsteps does, see

RaschModeler_2012: This is exactly what I was after. Thanks!

307. I got stuck while using Facets

Big_pongo November 20th, 2012, 1:45am: The research aims at studying the rater characteristics among 130 raters, and I would like to know who (among 130 raters) possess the differential rater functioning (DRF) and differential rater functioning over time (DRIFT)

What do I have now are
- the rater 130 elements
- students 50 elements (boys 26 and girls 24)
- essay 4 skills
- time 2 days

The problems on using facets are
1. I need the results of z-statistic and significant for gender difference
categorized into the groups of boy (n=26) and girl (n=24) and also z-statistic in summary
(remark : I can find the value of 'z', but it shows the each rater*student (by case) insteading of finding the value of the groups)

2. I need to find out the value of standardized difference of 130 raters in order to identify who possess DRIFT... WHAT and HOW syntax can be used?

Any suggestion?!?

Mike.Linacre: Yes, you can do these, Big_pongo.

For the gender effect, we need a "student gender" facet, then we can look at:
gender x rater
gender x skill
gender x day
in the Output Tables Menu, Table 14

The easiest way to add a gender facet is to have a gender code as the first letter of the elements labels for the students. Then:

Facets = 5 ; raters, students, essays, times, gender

Models = ?,?,?,?,?,R

Labels =
1, Raters
2, Students
1 = B John ; B means Boy
2 = B Mark
3 = G Maria
4 = G Anna
5 = ....
3, Essay
4, Time
5, Gender, D ; dummy demographic facet
1 = B: Boy
2 = G: Girl

Delements = NL ; element identifiers in data are element numbers and element labels

Dvalues =
5, 2, 1, 1 ; element identifier for facet 5 is in the element label for facet 2, character 1 width 1

Data =
(rater number), (student number), (essay number), (time number), rating

Output Tables menu, Table 14, rater x time.

Big_pongo: Thanks so much. Let me try with your guideline. Will come back later for the result!

Thank you so much!

308. Is this an example of when NOT to employ a Rasch?

RaschModeler_2012 November 21st, 2012, 3:11am: Hi Mike,

Suppose we are constructing an instrument intended to measure severity of a vicitim's sexual abuse. We have have yes/no items on whether someone has reported enduring a specific type of sexual abuse, some of which are clearly more severe than others. I see two potential problems with fitting a Rasch measurement model:

1. Items which are clearly less severe have higher difficulty levels than items which are more severe. Since an item's difficulty is based upon the extent to which the item is endorsed (e.g., low probability of endorsement = high severity; high probability of endorsement = low severity), it would not be surprising to envision such a scenario.

2. The assumption that items which have low difficulty estimates are likely to have been endorsed by individuals' who have endorsed more difficult items may NOT hold...Example: Suppose one item asks whether someone has endured unwanted kissing and another item item asks whether someone has been raped. Further, suppose that rape has a higher difficulty level than unwanted kissing (expected). HOWEVER, we find that person responses are not consistent with the Rasch model; that is, individuals who report being raped did not tend to report unwanting kissing.

In general, what's your take on such scales? Is a Rasch model simply not appropriate for such scenarios? Or is there a way to accomodate for these anomalies within a Rasch model?

Another example--Suppose we were measuring severity of criminal activity and find that setting places on fire is more difficult to endorse than murder, simply because murder is more common than setting places on fire. Clearly murder is more severe than setting a place on fire, despite the fact that it is more common.

I ask because I've encountered this type of instrument couple of times, and I'm not sure exactly what to do.



RaschModeler_2012: Mike,

I realize you haven't responded yet; I am still very interested in your response to my original post. Having said that, it dawned on me...Maybe the approach to measuring the approach proposed in the original post of measuring these types of psychological attributes is inappropriate.

Here's my point...If we want to determine the severity of acts of sexual (or physical abuse) or criminal activity, perhaps a better approach would be to administer a survey which asks a large represenative sample of the population of interest (e.g., U.S.) the extent to which they think each of the acts is severe (on a graded scale with anchored responses). This would result in item difficulties which reflect a measure of the U.S. society's *current* view of which acts are more severe than which.

Does that make sense? Anyway, I still am VERY interested in how to deal with the original message I posted.

Again, I am just so grateful for your insight. I am grateful to have access to your expertise--I do not forget that.

Thank you,


Mike.Linacre: Yes and yes, RM.

1. Rasch analysis of criminal behavior. George Karabatsos made this the focus of his Ph.D. dissertation, "Measuring Non-Additive Conjoint Structures". He devised some analytical approaches that made sense of these data. They are based on having reasonable preconceptions about the relative severity of crimes.

2. "current" view . When we look back at L.L. Thurstone's analyses, we see the expression of social perspectives that are now entirely unacceptable.

RaschModeler_2012: Thanks, Mike. I will try to get a hold of that dissertation. So, am I currect in assuming that you agree with the second approach I proposed (surveying a population, i.e. US population, on attitudes towards specific types of acts, with the understanding that the items themselves, item calibrations, etc. would likely change if this same instrument were developed and administered several years from now).

Somewhat off topic from this thread: I'm curious what your take is on a "multidimensional Rasch"; that is, an instrument which measures multiple unidimensional psychological attributes which happen to be correlated. I've seen research which has shown that estimates tend to be more precise (less measurement error) when accounting for such correlations as opposed to fitting separate Rasch models. I wonder what your take is on such an approach (e.g., Adams, Wilson, & Wang, 1997).

Thanks agian,


Mike.Linacre: RM, yes, measurement several years from now of almost anything will reveal "item drift". Some things drift quickly, such as computer skills (pre-1950 unknown, pre-1980 rare, now even small children have them).

Multidimensional Rasch: the history of the measurement of heat indicates that progress is faster when we make our measurement processes as unidimensional as possible. But no measuring process is ever perfectly unidimensional.

309. Simulate Differential Item Function

RaschModeler_2012 November 20th, 2012, 1:50pm: Hi Mike,

Question. Is there a way to simulate data in WINSTEPS which conform to a dichotomous Rasch measurement model [as you have shown me previuosly], with the exception of one item. That is, is there a way to have one item have differential item function (e.g., DIFF for one item with respect to gender--males versus females).

Any tips would be much appreciated!


Mike.Linacre: RM, organize your data so that the top half are males and the bottom half are females.
1) Simulate data.
2) Rectangle copy-and-paste an extra item. The top half of the responses from a simulated easy item. The bottom half of the responses from a simulated hard item.

Rectangle copy-and-paste can be done with freeware NotePad++ http://notepad-plus-plus.org

RaschModeler_2012: Hi Mike,

Thank you for the tips. Here's what I did:

1. Simulated data from a dichotomous Rasch model with 50 items and 1000 individuals.
2. Took the first 500 observations from the easiest item (1-500) and placed it in the first 500 rows of item 51 (1-500).
3. Took the second 500 observations from the most difficult item (501-1000) and placed it in rows 501 to 1000 of item 51.

I don't think this is correct based on the output...

Here's the Winsteps file (with the data truncated to save space) and Table 30.1:

Title= "dataset_gender_diff.sav"
; SPSS file created or last modified: 11/20/2012 6:34:51 PM
; SIFILE= SIMULATED DATA FILE FOR My simulation Nov 20 18:24 2012
; SPSS Cases processed = 1000
; SPSS Variables processed = 53
ITEM1 = 1 ; Starting column of item responses
NI = 51 ; Number of items
NAME1 = 53 ; Starting column for person label in data record
NAMLEN = 7 ; Length of person label
XWIDE = 1 ; Matches the widest data value observed
CODES = 01 ; matches the data
TOTALSCORE = Yes ; Include extreme responses in reported scores
; Person Label variables: columns in label: columns in line
@ENTRY = 1E4 ; $C53W4 ; Entry
@GENDER = 6E6 ; $C58W1
&END ; Item labels follow: columns in label
I1 ; I0001 ; Item 1 : 1-1
I2 ; I0002 ; Item 2 : 2-2
I3 ; I0003 ; Item 3 : 3-3
I4 ; I0004 ; Item 4 : 4-4
I5 ; I0005 ; Item 5 : 5-5
I6 ; I0006 ; Item 6 : 6-6
I7 ; I0007 ; Item 7 : 7-7
I8 ; I0008 ; Item 8 : 8-8
I9 ; I0009 ; Item 9 : 9-9
I10 ; I0010 ; Item 10 : 10-10
I11 ; I0011 ; Item 11 : 11-11
I12 ; I0012 ; Item 12 : 12-12
I13 ; I0013 ; Item 13 : 13-13
I14 ; I0014 ; Item 14 : 14-14
I15 ; I0015 ; Item 15 : 15-15
I16 ; I0016 ; Item 16 : 16-16
I17 ; I0017 ; Item 17 : 17-17
I18 ; I0018 ; Item 18 : 18-18
I19 ; I0019 ; Item 19 : 19-19
I20 ; I0020 ; Item 20 : 20-20
I21 ; I0021 ; Item 21 : 21-21
I22 ; I0022 ; Item 22 : 22-22
I23 ; I0023 ; Item 23 : 23-23
I24 ; I0024 ; Item 24 : 24-24
I25 ; I0025 ; Item 25 : 25-25
I26 ; I0026 ; Item 26 : 26-26
I27 ; I0027 ; Item 27 : 27-27
I28 ; I0028 ; Item 28 : 28-28
I29 ; I0029 ; Item 29 : 29-29
I30 ; I0030 ; Item 30 : 30-30
I31 ; I0031 ; Item 31 : 31-31
I32 ; I0032 ; Item 32 : 32-32
I33 ; I0033 ; Item 33 : 33-33
I34 ; I0034 ; Item 34 : 34-34
I35 ; I0035 ; Item 35 : 35-35
I36 ; I0036 ; Item 36 : 36-36
I37 ; I0037 ; Item 37 : 37-37
I38 ; I0038 ; Item 38 : 38-38
I39 ; I0039 ; Item 39 : 39-39
I40 ; I0040 ; Item 40 : 40-40
I41 ; I0041 ; Item 41 : 41-41
I42 ; I0042 ; Item 42 : 42-42
I43 ; I0043 ; Item 43 : 43-43
I44 ; I0044 ; Item 44 : 44-44
I45 ; I0045 ; Item 45 : 45-45
I46 ; I0046 ; Item 46 : 46-46
I47 ; I0047 ; Item 47 : 47-47
I48 ; I0048 ; Item 48 : 48-48
I49 ; I0049 ; Item 49 : 49-49
I50 ; I0050 ; Item 50 : 50-50
I51 ; Item 51 : 51-51
101000000000000000000000000000000000000000000000001 1 0
101110100000000000000000000000000000000000000000001 2 0
111001100000001000000000000000000000000000000000001 3 0
010101010110000000000010000000000000000000000000000 4 0
100101100000010000000000000000000000000000000000001 5 0
111011100000000000000000000000000000000000000000001 6 0
111101100110000100000000000000000000000000000000001 7 0
010100010000100000000000000000000000000000000000000 8 0
110100001000100100000010000000000000000000000000001 9 0
111111110000100000010000001100000000000000000000001 10 0
101001101001010110000100000000000000000000000000001 11 0
110101111000000100000000000000000000000000000000001 12 0
111010101101000100000011000000000000000000000000001 13 0
101011010000000010100000010000000000000000000000001 14 0
010110100100001000000000000000010000000000000000000 15 0
111101101000000000000001000000000000000000000000001 16 0
111110101001000000000000000000000000000000000000001 17 0
001111100000011000000000000000000000000000000000000 18 0
111111000110000010001000000010000000000000000000001 19 0
101111100010010000000000000000000000000000000000001 20 0

DIF class specification is: DIF=@GENDER
| PERSON Obs-Exp DIF DIF PERSON Obs-Exp DIF DIF DIF JOINT Welch Mantel-Haenszel Size ITEM |
| CLASS Average MEASURE S.E. CLASS Average MEASURE S.E. CONTRAST S.E. t d.f. Prob. Chi-squ Prob. CUMLOR Number Name |
| 0 .01 -4.87 .28 1 -.01 -3.67 .38 -1.19 .47 -2.53 973 .0116 1.3603 .2435 -2.01 1 I1 |
| 0 -.01 -3.91 .19 1 .01 -5.64 1.00 1.73 1.02 1.69 679 .0906 .1030 .7482 .29 2 I2 |
| 0 -.01 -3.57 .17 1 .01 -4.24 .50 .67 .53 1.27 795 .2050 .9204 .3374 .91 3 I3 |
| 0 -.01 -3.46 .16 1 .01 -6.82> 1.80 3.36 1.81 1.86 586 .0636 2.5078 .1133 4 I4 |
| 0 .00 -3.29 .15 1 .00 -3.53 .36 .24 .39 .62 855 .5362 .1088 .7415 -.43 5 I5 |
| 0 .00 -3.27 .15 1 .00 -3.53 .36 .26 .39 .68 854 .4982 .0025 .9605 .20 6 I6 |
| 0 .00 -3.01 .14 1 .00 -2.89 .26 -.12 .30 -.40 907 .6857 .3659 .5452 -.44 7 I7 |
| 0 -.01 -2.44 .12 1 .01 -2.64 .24 .19 .26 .73 897 .4666 .5029 .4782 .40 8 I8 |
| 0 .00 -2.42 .12 1 .01 -2.58 .23 .17 .26 .64 902 .5225 .9730 .3239 .61 9 I9 |
| 0 -.01 -2.23 .11 1 .01 -2.64 .24 .40 .26 1.54 886 .1230 1.7399 .1872 .55 10 I10 |
| 0 -.02 -1.69 .10 1 .02 -2.15 .19 .46 .22 2.11 913 .0351 .6860 .4075 .30 11 I11 |
| 0 -.02 -1.51 .10 1 .02 -1.91 .17 .41 .20 2.03 929 .0422 1.1595 .2816 .35 12 I12 |
| 0 -.02 -1.28 .10 1 .02 -1.55 .15 .26 .18 1.47 953 .1415 .0140 .9059 .01 13 I13 |
| 0 -.02 -1.32 .10 1 .02 -1.60 .15 .27 .18 1.50 950 .1345 .0068 .9341 .02 14 I14 |
| 0 -.01 -1.21 .10 1 .01 -1.42 .14 .21 .17 1.21 961 .2283 .0286 .8657 .09 15 I15 |
| 0 -.01 -1.03 .10 1 .01 -1.17 .13 .14 .16 .84 973 .3994 .2952 .5869 .18 16 I16 |
| 0 -.02 -.80 .10 1 .02 -1.10 .13 .30 .16 1.85 975 .0643 .0047 .9452 -.01 17 I17 |
| 0 -.04 -.68 .10 1 .04 -1.19 .13 .50 .16 3.07 971 .0022 4.8740 .0273 .56 18 I18 |
| 0 -.01 -.54 .10 1 .01 -.70 .12 .16 .15 1.08 989 .2807 .2093 .6473 .14 19 I19 |
| 0 -.02 -.60 .10 1 .02 -.87 .12 .27 .15 1.72 985 .0860 .4740 .4912 .20 20 I20 |
| 0 -.02 -.37 .10 1 .02 -.62 .11 .25 .15 1.69 992 .0913 .1748 .6758 .13 21 I21 |
| 0 -.05 -.11 .10 1 .05 -.67 .11 .56 .15 3.66 993 .0003 1.3202 .2506 .31 22 I22 |
| 0 -.04 -.02 .10 1 .04 -.47 .11 .45 .15 3.00 996 .0027 3.4089 .0648 .45 23 I23 |
| 0 .01 -.34 .10 1 -.01 -.26 .11 -.08 .14 -.52 996 .6016 .0048 .9450 -.04 24 I24 |
| 0 -.03 .08 .10 1 .03 -.23 .10 .31 .15 2.13 997 .0332 2.0934 .1479 .35 25 I25 |
| 0 -.01 .20 .11 1 .01 .12 .10 .08 .14 .56 997 .5749 .6985 .4033 -.22 26 I26 |
| 0 -.05 .50 .11 1 .05 -.03 .10 .53 .15 3.48 995 .0005 8.2031 .0042 .68 27 I27 |
| 0 -.04 .64 .12 1 .04 .18 .10 .46 .15 3.02 991 .0026 .1326 .7157 .12 28 I28 |
| 0 -.02 .52 .11 1 .02 .31 .10 .21 .15 1.39 992 .1640 .1740 .6766 .13 29 I29 |
| 0 -.03 .69 .12 1 .03 .37 .10 .32 .15 2.12 988 .0343 .0809 .7760 .10 30 I30 |
| 0 -.01 .68 .12 1 .01 .58 .10 .10 .15 .69 988 .4887 .6852 .4078 -.24 31 I31 |
| 0 .01 .59 .11 1 -.01 .66 .10 -.07 .15 -.48 990 .6313 .7487 .3869 -.24 32 I32 |
| 0 -.01 .98 .13 1 .01 .84 .10 .14 .16 .87 977 .3867 .0071 .9326 -.01 33 I33 |
| 0 -.01 1.01 .13 1 .01 .87 .10 .14 .16 .89 976 .3713 1.2147 .2704 .32 34 I34 |
| 0 -.01 1.23 .14 1 .01 1.15 .10 .08 .17 .48 967 .6335 2.5047 .1135 .49 35 I35 |
| 0 -.02 1.39 .15 1 .02 1.04 .10 .34 .18 1.97 957 .0494 2.3726 .1235 .45 36 I36 |
| 0 -.01 1.50 .15 1 .01 1.29 .10 .21 .18 1.17 952 .2424 .1082 .7422 -.14 37 I37 |
| 0 -.02 1.50 .15 1 .02 1.24 .10 .26 .18 1.43 951 .1518 .1065 .7441 .14 38 I38 |
| 0 .01 1.52 .15 1 -.01 1.64 .10 -.12 .18 -.66 958 .5081 1.1693 .2796 -.35 39 I39 |
| 0 .02 1.55 .15 1 -.02 1.86 .11 -.31 .19 -1.67 963 .0957 2.1866 .1392 -.48 40 I40 |
| 0 -.02 2.51 .23 1 .02 2.01 .11 .50 .25 1.97 883 .0490 .4078 .5231 .32 41 I41 |
| 0 -.02 2.88 .27 1 .02 2.16 .11 .72 .29 2.44 849 .0150 2.3009 .1293 .74 42 I42 |
| 0 .00 2.37 .22 1 .00 2.37 .12 .00 .25 .00 918 1.000 .5995 .4388 -.35 43 I43 |
| 0 -.01 3.23 .32 1 .01 2.82 .13 .41 .35 1.18 851 .2399 .0037 .9517 .13 44 I44 |
| 0 -.01 3.75 .41 1 .01 3.22 .15 .53 .44 1.22 821 .2243 .0019 .9650 -.17 45 I45 |
| 0 -.01 4.86 .71 1 .01 3.47 .16 1.40 .73 1.92 718 .0554 .0843 .7716 .56 46 I46 |
| 0 .00 3.46 .36 1 .00 3.53 .17 -.07 .40 -.18 884 .8543 .3674 .5444 -.50 47 I47 |
| 0 .00 4.46 .58 1 .00 3.84 .19 .62 .61 1.01 792 .3111 .0811 .7758 -.16 48 I48 |
| 0 .00 3.94 .45 1 .00 4.42 .24 -.48 .51 -.94 912 .3487 2.1277 .1447 1.31 49 I49 |
| 0 .00 4.46 .58 1 .00 4.52 .25 -.07 .63 -.11 863 .9154 .0356 .8503 -.28 50 I50 |
| 0 .65 -4.87 .28 1 -.65 4.54 .25 -9.40 .37 -25.1 995 .0000 99.9999 .0000 51 I51 |
| 1 -.01 -3.67 .38 0 .01 -4.87 .28 1.19 .47 2.53 973 .0116 1.3603 .2435 2.01 1 I1 |
| 1 .01 -5.64 1.00 0 -.01 -3.91 .19 -1.73 1.02 -1.69 679 .0906 .1030 .7482 -.29 2 I2 |
| 1 .01 -4.24 .50 0 -.01 -3.57 .17 -.67 .53 -1.27 795 .2050 .9204 .3374 -.91 3 I3 |
| 1 .01 -6.82> 1.80 0 -.01 -3.46 .16 -3.36 1.81 -1.86 586 .0636 2.5078 .1133 4 I4 |
| 1 .00 -3.53 .36 0 .00 -3.29 .15 -.24 .39 -.62 855 .5362 .1088 .7415 .43 5 I5 |
| 1 .00 -3.53 .36 0 .00 -3.27 .15 -.26 .39 -.68 854 .4982 .0025 .9605 -.20 6 I6 |
| 1 .00 -2.89 .26 0 .00 -3.01 .14 .12 .30 .40 907 .6857 .3659 .5452 .44 7 I7 |
| 1 .01 -2.64 .24 0 -.01 -2.44 .12 -.19 .26 -.73 897 .4666 .5029 .4782 -.40 8 I8 |
| 1 .01 -2.58 .23 0 .00 -2.42 .12 -.17 .26 -.64 902 .5225 .9730 .3239 -.61 9 I9 |
| 1 .01 -2.64 .24 0 -.01 -2.23 .11 -.40 .26 -1.54 886 .1230 1.7399 .1872 -.55 10 I10 |
| 1 .02 -2.15 .19 0 -.02 -1.69 .10 -.46 .22 -2.11 913 .0351 .6860 .4075 -.30 11 I11 |
| 1 .02 -1.91 .17 0 -.02 -1.51 .10 -.41 .20 -2.03 929 .0422 1.1595 .2816 -.35 12 I12 |
| 1 .02 -1.55 .15 0 -.02 -1.28 .10 -.26 .18 -1.47 953 .1415 .0140 .9059 -.01 13 I13 |
| 1 .02 -1.60 .15 0 -.02 -1.32 .10 -.27 .18 -1.50 950 .1345 .0068 .9341 -.02 14 I14 |
| 1 .01 -1.42 .14 0 -.01 -1.21 .10 -.21 .17 -1.21 961 .2283 .0286 .8657 -.09 15 I15 |
| 1 .01 -1.17 .13 0 -.01 -1.03 .10 -.14 .16 -.84 973 .3994 .2952 .5869 -.18 16 I16 |
| 1 .02 -1.10 .13 0 -.02 -.80 .10 -.30 .16 -1.85 975 .0643 .0047 .9452 .01 17 I17 |
| 1 .04 -1.19 .13 0 -.04 -.68 .10 -.50 .16 -3.07 971 .0022 4.8740 .0273 -.56 18 I18 |
| 1 .01 -.70 .12 0 -.01 -.54 .10 -.16 .15 -1.08 989 .2807 .2093 .6473 -.14 19 I19 |
| 1 .02 -.87 .12 0 -.02 -.60 .10 -.27 .15 -1.72 985 .0860 .4740 .4912 -.20 20 I20 |
| 1 .02 -.62 .11 0 -.02 -.37 .10 -.25 .15 -1.69 992 .0913 .1748 .6758 -.13 21 I21 |
| 1 .05 -.67 .11 0 -.05 -.11 .10 -.56 .15 -3.66 993 .0003 1.3202 .2506 -.31 22 I22 |
| 1 .04 -.47 .11 0 -.04 -.02 .10 -.45 .15 -3.00 996 .0027 3.4089 .0648 -.45 23 I23 |
| 1 -.01 -.26 .11 0 .01 -.34 .10 .08 .14 .52 996 .6016 .0048 .9450 .04 24 I24 |
| 1 .03 -.23 .10 0 -.03 .08 .10 -.31 .15 -2.13 997 .0332 2.0934 .1479 -.35 25 I25 |
| 1 .01 .12 .10 0 -.01 .20 .11 -.08 .14 -.56 997 .5749 .6985 .4033 .22 26 I26 |
| 1 .05 -.03 .10 0 -.05 .50 .11 -.53 .15 -3.48 995 .0005 8.2031 .0042 -.68 27 I27 |
| 1 .04 .18 .10 0 -.04 .64 .12 -.46 .15 -3.02 991 .0026 .1326 .7157 -.12 28 I28 |
| 1 .02 .31 .10 0 -.02 .52 .11 -.21 .15 -1.39 992 .1640 .1740 .6766 -.13 29 I29 |
| 1 .03 .37 .10 0 -.03 .69 .12 -.32 .15 -2.12 988 .0343 .0809 .7760 -.10 30 I30 |
| 1 .01 .58 .10 0 -.01 .68 .12 -.10 .15 -.69 988 .4887 .6852 .4078 .24 31 I31 |
| 1 -.01 .66 .10 0 .01 .59 .11 .07 .15 .48 990 .6313 .7487 .3869 .24 32 I32 |
| 1 .01 .84 .10 0 -.01 .98 .13 -.14 .16 -.87 977 .3867 .0071 .9326 .01 33 I33 |
| 1 .01 .87 .10 0 -.01 1.01 .13 -.14 .16 -.89 976 .3713 1.2147 .2704 -.32 34 I34 |
| 1 .01 1.15 .10 0 -.01 1.23 .14 -.08 .17 -.48 967 .6335 2.5047 .1135 -.49 35 I35 |
| 1 .02 1.04 .10 0 -.02 1.39 .15 -.34 .18 -1.97 957 .0494 2.3726 .1235 -.45 36 I36 |
| 1 .01 1.29 .10 0 -.01 1.50 .15 -.21 .18 -1.17 952 .2424 .1082 .7422 .14 37 I37 |
| 1 .02 1.24 .10 0 -.02 1.50 .15 -.26 .18 -1.43 951 .1518 .1065 .7441 -.14 38 I38 |
| 1 -.01 1.64 .10 0 .01 1.52 .15 .12 .18 .66 958 .5081 1.1693 .2796 .35 39 I39 |
| 1 -.02 1.86 .11 0 .02 1.55 .15 .31 .19 1.67 963 .0957 2.1866 .1392 .48 40 I40 |
| 1 .02 2.01 .11 0 -.02 2.51 .23 -.50 .25 -1.97 883 .0490 .4078 .5231 -.32 41 I41 |
| 1 .02 2.16 .11 0 -.02 2.88 .27 -.72 .29 -2.44 849 .0150 2.3009 .1293 -.74 42 I42 |
| 1 .00 2.37 .12 0 .00 2.37 .22 .00 .25 .00 918 1.000 .5995 .4388 .35 43 I43 |
| 1 .01 2.82 .13 0 -.01 3.23 .32 -.41 .35 -1.18 851 .2399 .0037 .9517 -.13 44 I44 |
| 1 .01 3.22 .15 0 -.01 3.75 .41 -.53 .44 -1.22 821 .2243 .0019 .9650 .17 45 I45 |
| 1 .01 3.47 .16 0 -.01 4.86 .71 -1.40 .73 -1.92 718 .0554 .0843 .7716 -.56 46 I46 |
| 1 .00 3.53 .17 0 .00 3.46 .36 .07 .40 .18 884 .8543 .3674 .5444 .50 47 I47 |
| 1 .00 3.84 .19 0 .00 4.46 .58 -.62 .61 -1.01 792 .3111 .0811 .7758 .16 48 I48 |
| 1 .00 4.42 .24 0 .00 3.94 .45 .48 .51 .94 912 .3487 2.1277 .1447 -1.31 49 I49 |
| 1 .00 4.52 .25 0 .00 4.46 .58 .07 .63 .11 863 .9154 .0356 .8503 .28 50 I50 |
| 1 -.65 4.54 .25 0 .65 -4.87 .28 9.40 .37 25.08 995 .0000 99.9999 .0000 51 I51 |
Width of Mantel-Haenszel slice: MHSLICE = .010 logits

Did I construct the file correctly? I indicated that DIFF=58 because gender is the 58th column in the winsteps file.

It appears as though item 51 is not the only item with substantial DIFF which leads me to believe that I did not construct the DIFF item correctly (item 51).

Where did I go wrong?


RaschModeler_2012: Thanks for the confirmation, Mike!

I tried both approaches, and found that simulating the data again with a different seed resulted in no other significant DIFs with the exception of item 51. I've figured out how to construct a DIFF item.


Thank you!


Mike.Linacre: Congratulations, RM :-)

310. Convergent and Discriminant Validity

RaschModeler_2012 November 19th, 2012, 4:11am: Hi Mike,

Suppose I created a new self-report measure of depression employing a Rasch measurement model. The global fit, item fit, and person fit statistics all conform to the Rasch measurement model. Item discriminations are near 1.0. Person and Item reliabilty estimates are near 1.0. PCA analysis performed on the residuals on the probability scale suggests no meaningful secondary dimension. Further, the hierarchy of the items [as demonstrated by the construction of the Wright map] is consistent with a priori theory (e.g., more severe items such as suicidal ideation is the most difficult to endorse, while mild items such as occassional low mood is one of the least difficult to endorse). I can see how this by itself is a form of construct validity, and is a unique strength of Rasch modeling as compared to the CTT approach of summative scales.

Still, I'm wondering if it would not be reasonable to output the person Rasch measures, and assess the extent to which they are correlated with other measures of depression, such as the raw scores on the BDI-II (convergent validity), diagnosis of depression (convergent validity), and IQ (discriminant validity--since IQ is generally not found to be correlated with depression).

Would you agree that correlating the Rasch measures with other validated instruments, albeit CTT-based, could be worthwhile? Or is this approach counterintuitive to Rasch modeling; that is, correlating interval-level measures of depression with ordinal level measures (at best) of depression, and with unrelated scales which have not gone through the rigorous methods of Rasch modeling (e.g., a CTT-based IQ instrument).

I'd really appreciate your take on this matter; that is, assessing convergent and discriminant validity of newly developed Rasch-based measures with CTT based measures.



Mike.Linacre: RM, in your description of this situation, the correlations are really assessing the convergent validity of those CTT instruments, because it is highly unlikely that they were subjected to the detailed psychometric scutiny which the Rasch instrument has undergone. But this type of analysis is useful if your plan is to persuade users of those other instruments to switch to your instrument.

In a similar situation we discovered that the empirical success of a CTT instrument was due to luck. All attempts to remedy the obvious flaws in that instrument produced instruments with lower validity. It seemed that the flaws in the original instrument had cancelled each other out.

RaschModeler_2012: Makes sense. Thanks!


311. chdir Problem in batch mode

moffgat November 19th, 2012, 8:59am: Hello,

I've created a batch file to run Winsteps in batch mode. The file looks like this:

chdir /d "X:\yyy\zzz"

START /Wait c:\winsteps\winsteps.exe batch=y controlfile.con item_out\itemoutfile.txt ^
PFILE="person_out\personfile.sav" ^

Problem is, Winsteps completely ignores the "chdir" command and always looks for the files in the last directory it was working in. The only way out that I found so far is to place the whole path into every command in the batch file which is rather uncomfortable and will be a real pain once we change the path. The command file consists of 19 calls to winsteps with 4 filenames in each call, meaning I'd have to put in and probably change the path 76 times in this one batch file. Is there another way out of this misery? Thank you in advance

Yours Frank

PS: I use the most recent Version of Winsteps 3.75 and I tested it on Windows XP and Windows 7

Mike.Linacre: My apologies, Frank. This is a known bug that will be remedied in the next Winsteps release.

Meanwhile, please email me mike \at/ winsteps.com for instructions to download a patched version of Winsteps.

moffgat: I 've sent you an email. Thank you very much for your quick reply.

Yours Frank

Mike.Linacre: Frank, have emailed the link to you.

312. longitudinal modelling in Winstep/Facet

GrahamD November 14th, 2012, 3:19am: Dear Mike,

Winsteps is simply the best statistical software package I have ever worked with. Even R's flexibility does not compare to the scope of Winsteps.

My questions relates to modelling the growth of individual people over time as discussed by Bond & Fox (p. 179-182) and Dawson (2000). Is it possible to conduct this kind of analysis in either Winsteps or Facets.

I scanned your manual for Facets, but did not find anything specific on longitudinal Rasch modelling.

Kind regards,

Mike.Linacre: Thank you for your kind remarks, Graham.

Winsteps and Facets are similar to tape-measures. They measure at an instant of time. We record those measurements and compare them externally with the measurements obtained at other instants of time.

In principle, every Rasch measurement is independent (as are measurements with a ruler), so we can put all our observations from all time points into one analysis, with each set of observations suitable annotated by person and time-point. We then extract the Rasch measurements from the output (Winsteps PFILE=, Facets Scorefile=) to construct a plot with, say, Excel similar to Bond & Fox Figure 9.7, p. 181.

If dependency between measurements of the same person could be thought to bias the measurement process, then a bias-free measurement framework can be constructed using the random-selection procedure, as described in https://www.rasch.org/rmt/rmt251b.htm

Facets also supports other conceptualizations of longitudinal change. See, for instance, "Stress after 3 Mile Island" - https://www.winsteps.com/facetman/index.htm?threemileisland.htm

Cheryl: Hello,

I am new to Rasch and have a follow up question to analyzing longitudinal data. I have a pretest and posttest, some students took the pretest only, some took the posttest only. Would I be able to use these students to construct an anchor file instead of using a random selection procedure on students who took both the pretest and the posttest?


Mike.Linacre: Cheryl, there is probably something different about students who sat only one test. But, "random" selection does not need to be elaborate. Assuming that the student data records are somewhat random in the data file, then every first record of two records for pre-test; every second record of two records for post-test.

313. Table 3.2 when fitting a Rating Rasch Model

RaschModeler_2012 November 14th, 2012, 3:59am: Hi Mike,

I just want to make sure I'm interpreting these summary values correctly. The andrich thresholds, average measures, and category measures all reflect trait levels relative to the item measures. That is, these are not the absolute trait levels. If one wanted to know the actual person trait level at which a person has an equal probability of endorsing adjacent categories for a PARTICULAR item, one would need to examine the probability category curves, specifying the ABSOLUTE x-axis via the graph feature in Winsteps.

I guess my point is that the summary statistisc are useful in terms of how far apart the trait values are from each other across all items (e.g., it would be interesting to see that the andrich thresholds are at least 1 logit unit away from each other) and whether they are ordered, but they do not provide the actual trait level with respect to categories for a SPECIFIC item along the entire trait contiuum. As a result, interpretation of a single value (e.g., an andrich threshold of -1.0 for the second category) by itself does not seem to provide anything interpretable. How far that threshold is from the next threshold would be useful, certainly. Whether that threshold is higher or lower than the next threshold is useful, as well. But one cannot directly determine the trait level necessary to have an equal probability of endorsing adjacent categories for a particular item. Same goes for the average measures and category measures. These do not reflect absolute trait levels, and as a result, have limited interpretation with respect to examining the absolute trait levels associated with a category for a specific item (e.g., what is the average trait level associated with endorsing this category for this item?; what trait level has the highest probability of endorsing this category for this item?...)

Do you agree? Am I way off the mark here?


Mike.Linacre: RM, yes, if you need information about the thresholds, etc., for a specific item, then Table 3.2 is generally not suitable.

In Winsteps, useful sources of detailed (but summary) item-level information are:

Table 14.3 for sample-related statistics:

ISFILE= for model-related parameters, thresholds, etc.

1 1 2 0 -2.47 1 1 -1.25 .00 -1.58 -.40 -1.40 2 2 .46 .00 .79 1.68 .61
2 1 2 0 -2.78 1 1 -1.57 .00 -1.89 -.71 -1.71 2 2 .15 .00 .48 1.37 .29

Do these numbers help you, RM?

RaschModeler_2012: Hi Mike,

Yes, this is exactly what I was hoping for.

Thank you!


314. Data Simulation Code

RaschModeler_2012 November 10th, 2012, 4:43am: Dear Mike,

I'm finding that examining output from the simulated data [generated from the code you gave me in the other thread] to be a powerful learning tool. As a result, I'm trying to figure out how to create three different simulations to generate data which conform to a Rasch Rating measurement model that needs adjustment:

Simulation 1. Two categories should be combined for all items (e.g., output shows ordered andrich thresholds but disordered average measures)

Simulation 2. One category should be removed for all items (e.g., output shows disordered andrich thresholds because one of the categories has an unexpectedly low frequency)

Simulation 3: Categories are actually disordered (e.g., expected order is never--> sometimes-->often-->almost always, but empirically is found to be never-->sometimes-->almost always-->often)

I'd really appreciate any tips. Again, thank you VERY much for your help. Needless to say, I understand if you're too busy to respond.



Mike.Linacre: RM: the situations you describe only need small tweaks to standard Rasch simulations:

For Simulation 1: take a data file generated from your standard simulation,
then analyze while recoding the categories in a different order.
Try different NEWSCORE= until you get the output you want to see.

For Simulation 2: try an SAFILE= for the standard simulation similar to:
0 0
1 -1
2 5
3 -4

Simulation 3: same as Simulation 1 above, but with a different NEWSCORE=

RaschModeler_2012: Hi Mike,

I tried out your suggestion for the first scenario, where one would decide to combine categories (1 and 2). I'm curious if that's the decision you would make based on the output below.

Thank you,



| 0 0 16292 33| -1.60 -1.46| .64 .66|| NONE |( -1.79)| 0
| 1 1 8799 18| .37 -.43| 1.92 1.66|| -.30 | -.48 | 2
| 2 2 8971 18| -.37* .42| 1.90 1.66|| -.03 | .48 | 1
| 3 3 15938 32| 1.58 1.43| .65 .66|| .33 |( 1.80)| 3
OBSERVED AVERAGE is mean of measures in category. It is not a parameter estimate.

| 0 NONE |( -1.79) -INF -1.15| | 89% 64% .6354| | 0
| 1 -.30 .01 | -.48 -1.15 .00| -.78 | 19% 30% 1.0019| 1.42| 2
| 2 -.03 .01 | .48 .00 1.15| -.01 | 20% 30% .9984| -.36| 1
| 3 .33 .01 |( 1.80) 1.15 +INF | .79 | 89% 64% .6409| 1.44| 3
M->C = Does Measure imply Category?
C->M = Does Category imply Measure?

CATEGORY PROBABILITIES: MODES - Structure measures at intersections
P -+--------------+--------------+--------------+--------------+-
R 1.0 + +
O | |
B | |
A |0 |
B .8 + 0000 3333+
I | 000 3333 |
L | 000 333 |
I | 00 33 |
T .6 + 000 33 +
Y | 00 333 |
.5 + 00 33 +
O | 00 33 |
F .4 + 00 33 +
| 00 **222 |
R | 111111111111**1***2** 22222222 |
E | 11111 222*0 3*111 22222 |
S .2 + 111111 2222 3*0 111 22222 +
P |111 2222 333 000 11111 222|
O | 22222 3333 0000 11111 |
N | 2222222222 3333333 0000000 1111111111 |
S .0 +**3333333333333 0000000000000**+
E -+--------------+--------------+--------------+--------------+-
-2 -1 0 1 2

Mike.Linacre: RM: In a practical situation,
First, we would notice that the Observed Average measures for the middle categories are disordered, and that there is considerable category-level misfit.
Second, we would look at the definition of the categories. We see numbers like this when the category definitions are not homogeneous, for instance: 0=strongly disagree, 1= no, 2 = yes, 3 = strongly agree.
Third we could try several options:
a) if it makes sense to combine the central categories, then we would try that remedy, which has the benefit that all the category frequencies will be about the same.
b) if it makes sense to reverse the central categories, .....
c) if it makes sense to delete one or both central categories, ... (I would try this first, to verify that the extreme categories are functioning correctly by themselves. This would be the benchmark. Any other option must do better to be acceptable. "better" = greater person reliability.)
d) if it makes sense to combine a central category with an extreme category, ....

Usually, there is no clear-cut winner. A lot depends on the nature of the rating scale. For instance, in medical applications, each category tends to have a precise clinical definition. However, in survey work, the categories may be merely numbers with no definitions at all, such as "on a scale from 1 to 10, what do you think of ....".

RaschModeler_2012: Mike,

Thank you for sharing your thoughts. Much appreciated!



RaschModeler_2012: Hi Mike,

I am continuing to explore various simulations that lead to different kinds of scenarios. I'm curious if you know of a simple way to generate a dichotomous Rasch model where there is one item that clearly misfits (say, a high infit MNSQR). That is, with the exception of a single item, all other items fit the Rasch model. I'd really appreciate any tips.



Mike.Linacre: RM, this is usually easier if you think of the rectangle of responses transposed so that the rows are items and and columns are persons (Winsteps Output Files menu can do this for you). Add the misfitting item as another row. Then transpose back again.

For the misfitting row, build on one of the misfitting patterns in https://www.rasch.org/rmt/rmt82a.htm - there is no recommended way to do this. Only the easiest way for you ....

RaschModeler_2012: Perfect! Thank you!

Out of curiosity, what are the top 3 or so books you'd recommend to learn/teach Rasch modeling, leaning towards those that use WINSTEPS.


Mike.Linacre: RM:

Bond & Fox, "Applying the Rasch Model"
dichotomies: Wright & Stone, "Best Test Design"
polytomies: Wright & Masters, "Rating Scale Analysis"
Then, relevant chapters in a book that aligns with your activities, such as:
Smith & Smith, "Introduction to Rasch Measurement"
Or, for a broader view,
Wilson, "Constructing Measures"

RaschModeler_2012: Thank you, Mike. I have been reading through the Bond and Fox book--very informative! I will seek out the other books you suggest.

315. Category Probability Curves for a Rating Scale

RaschModeler_2012 November 6th, 2012, 1:27pm: Hi Mike,

I could really use your help.

I'm struggling with interpreting category probability curves with polytomous items. Say we have the following instrument: 20 items with Likert-type response options (strongly disagree, disagree, neutral, agree, strongly agree).

After reviewing several books on the matter, I have seen what the "ideal" category probability curve would be for ordered-ratings in the context of a Rasch measurement model, but I'm still struggling with how to interpret it. And, equally as important, how to decide when it appears that two or three categories should be collpased.

Might you have a couple of visual examples that show (a) the ideal category probability curve for an Likert-type item and (b) a category probability curve that is indicative of the need for two or three categories to be collapsed.

Any help would be so much appreciated.

I'm stumped... :-(



RaschModeler_2012: I'm not sure if this will present clearly in this thread, but let's take this for example:

CATEGORY PROBABILITIES: MODES - Structure measures at intersections
P -+---------+---------+---------+---------+---------+---------+-
R 1.0 + +
O | |
B |111 6666|
A | 111 6666 |
B .8 + 111 666 +
I | 11 66 |
L | 11 66 |
I | 11 6 |
T .6 + 11 66 +
Y | 1 6 |
.5 + 11 6 +
O | 1 66 |
F .4 + 1 6 +
| 2222*22 6 |
R | 2222 11222 5*5555555 |
E | 222 *33**33 556 5555 |
S .2 + 222 333 11 4*****444 555 +
P | 2222 333 44*552* 33 444 5555 |
O |2222 3333 444 551*6 222 333 4444 555555|
N | 3333333 4444*555*666 1111 22223333 4444444 |
S .0 +**********************6666 11111*********************+
E -+---------+---------+---------+---------+---------+---------+-
-3 -2 -1 0 1 2 3

1 = Strongly Disagree
2 = Disagree
3 = Slightly Disagree
4 = Slightly Agree
5 = Agree
6 = Strongly Agree

Which categories might you think of collapsing and what would be the reason? Again, I really appreciate your help with this.



Mike.Linacre: RM. it looks like those thresholds are ordered, but let's start at the beginning:

1. Are the category "average measures" ordered?
Our hypothesis is "higher person measure <-> higher category". We verify this by looking at the category "average measure" in Winsteps Table 3.2

2. Is each category behaving reasonably?
Our hypothesis is "people with measures near the category tend to select the category, but people with measures far from the category tend not to select the category". We verify this by looking at the category fit statistics in Winsteps Table 3.2

3. Do we intend the categories to represent clearly different levels of agreement (advancing thresholds) or do we intend some categories to be transitional (one category blends into the next one, thresholds may not advance, but average measures do advance)?

"Slightly agree" sounds transitional to me. In contrast, we would expect these categories to be clearly different:
1 = Strongly Disagree
2 = Disagree
3 = Neutral
4 = Agree
5 = Strongly Agree

But "slightly agree" and "slightly disagree" do not sound like distinct levels, they sound more like transitional levels.

How do those categories sound to you, RM??

RaschModeler_2012: Mike,

Thank you so much for replying, and helping me with interpretation. It appears as though the average measures are ordered, with the exception of 6, maybe? Average measures are:

1 = -1.31, 2 = -.75, 3 = -.48, 4 = -.29, 5=-.12, BUT 6 = -.13

I think these values are interpreted as the average ability of people responding to each category. Fit appears to be OK implying that the categories are ordered correctly, right?

I'm confused by the Andrich thresholds; it appears as though thresholds 4 and 5 are mixed up:

None-->- .75--> - .21--> + .21--> + .19--> + .56

I'm not sure exactly how to intepret these values.

I see your point about slightly disagree and slightly agree. I would generally be more interested in advanding thresholds, not transitional.

Wouldn't this be the general goal of developing a rating scale for aRasch measure?

HERE IS Table 3.2. What would you conclude based on Table 3.2 below?

Thanks again!


| 1 1 3491 36| -1.31 -1.23| .95 1.00|| NONE |( -2.17)| 1
| 2 2 2694 28| -.75 -.83| 1.01 .84|| -.75 | -.86 | 2
| 3 3 1656 17| -.48 -.56| .87 .78|| -.21 | -.21 | 3
| 4 4 869 9| -.29 -.32| .94 .94|| .21 | .27 | 4
| 5 5 588 6| -.12 -.09| 1.05 1.11|| .19 | .87 | 5
| 6 6 348 4| -.13* .16| 1.43 1.95|| .56 |( 2.05)| 6

| 1 NONE |( -2.17) -INF -1.51| | 84% 33% .9914| | 1
| 2 -.75 .02 | -.86 -1.51 -.49| -1.17 | 35% 66% .6650| 1.42| 2
| 3 -.21 .02 | -.21 -.49 .04| -.38 | 29% 40% .8017| 1.06| 3
| 4 .21 .03 | .27 .04 .54| .07 | 21% 18% 1.3215| .84| 4
| 5 .19 .04 | .87 .54 1.44| .42 | 27% 9% 1.9614| .42| 5
| 6 .56 .06 |( 2.05) 1.44 +INF | 1.06 | 40% 1% 2.9195| .23| 6
M->C = Does Measure imply Category?
C->M = Does Category imply Measure?

Mike.Linacre: Thank you for these numbers, RM.

(Please use the Courier font to make the Tables align correctly).

1. Observed average measures. We can compare the observed average with its expectation. We see that category 6 observed = -.13 is noticeably less than its expectation = .16

2. We see that the fit statistics for category 6 = 1.95 is very noisy. Some people with low measures must be selecting category 6. These responses need to be investigated before any further action is taken. Are these people believable? Should their responses be eliminated from the analysis?

3. The Andrich thresholds between 4 and 5 are slightly disordered.
How would we combine the categories? The observed averages tell us that the respondents to category 5 (-.12) are more like those to category 6 (-.13) than to category 4 (-.29).
We also prefer to have more uniform category frequencies. 1,2,3, 4, 5+6 would be more uniform than 1,2,3,4+5,6. However, since the original category frequencies make a smoother distribution, we would prefer to keep them. Also the threshold disordering is so small we could probably attribute it to sampling error.

Conclusion: no collapsing!!

RaschModeler_2012: I will be sure to use Courier font to make sure the Tables align. Your input is EXTREMELY helpful. If you don't mind, I'm going to list the usual steps in examining the rating scale for all items combined:

Look at Table 3.2 to examine the observed average measures.
--one should see that the observed average measures increase as you go up the rating scale
--one should see that the observed average measures are similar to the sample expected.
--the fit statistics should be well below 2.0; otherwise people with low trait levels may be endorsing higher categories and vice versa. If there is misfit, examine these person's response patterns.
--andrich thresholds: see if the thresholds between categories are disordered.

Three follow-up questions:

1. If observed average measures are very similar for two categories, the andrich thresholds are disordered for the two categories, AND combining them makes the frequencies distribution more uniform, then one MIGHT consider combining the two categories. Is that right?

2. Exaclty what is the definition of an andrich threshold? Is it essentially the log odds of going from one category to the next?

3. How do does the category probability curve come into play? I'm still struggling with how to interpret it.

Regarding 3, I think we see, in general, that higher trait levels have higher probabilities of endorsing higher ratings. But I feel like I'm looking at it superficially. Speaking in probability terms, what can I say about going from category 1 to category 2 to category 3 to category 4 etc.?

Sorry for all the questions...just trying to wrap my mind around the category probability curve.

Thanks again. I'm learning a great deal from your posts!


Mike.Linacre: Good thinking, RM.

Your rules are along the right lines, but somewhat too rigid. Remember that these are humans trying to make sense of (usually poorly-defined) rating scale categories. These humans often have trouble aligning the categories with the stem of the question. How often in our day-to-day conversation do we say something like "I disagree strongly that education is ....".

1. Let's examine this:
"If observed average measures are very similar for two categories," - this implies that the sample of persons perceive these two categories to represent about the same amount of the latent variable.
"the andrich thresholds are disordered for the two categories" - which two categories? Disordered thresholds implies that at least one category has somewhat lower than desired category frequency. In your example the disordering is between categories 4 and 5, but the problem is between categories 5 and 6. Rating scale categories act as a set. So we have to think about them as a set.
Disordering relates to category frequency. If the thresholds are highly disordered, then a category probably has very low frequency. We may want to eliminate it on those grounds. But categories with ordered thresholds can have disordered average measures, so we may want to combine them for that reason.

2. Andrich threshold = point of equal probability of adjacent categories = where the category probability curves intersect on the probability graph.

3. Category probability is the key. Here is how it works:
The frequency of the categories in the data -> polytomous Rasch model -> the category probability curves (parameterized by the andrich thresholds) -> expected average measures

A polytomous Rasch model gives the relationship between the probability of observing adjacent categories for a person on an item

log (Pnij / Pni(j-1) ) = Bn - Di - Fj

where Pnij = probability that person n on item i is observed in category j
Pni(j-1) = probability that person n on item i is observed in category j-1
Bn = measure of person n
Di = difficulty of item i
Fj = Andrich threshold between categories j-1 and j

RaschModeler_2012: Dear Mike,

I cannot express in words how much I appreciate your taking the time to explain this to me. (I can't imagine how many times you've had to explain this very issue to others...).

At any rate, I think have a much clearer understanding of the polytomous Rasch model. The equation was very illuminating.

I hope it's okay if I ask a few more questions. I completely understand if you do not have time to respond or if it takes you a while to write back...

1. The andrich threshold is literally the person trait level (logit value on the x-axis of the category probability curve) where there is an equal probability of endorsing adjacent categories. Is that correct?

2. Assuming the polytomous Rasch measurement model fits the data, regarding the category probablity curve, one would expect to see a downward curve for "strongly disagree" until it intersects (first andrich threshold) with the upward curve "disagree." As "strongly disagree" continues to decrease, "disagree" continues to climb until it hits a peak and then begins to decrease until it intersects (second andrich curve) with "slightly disagree." As "disagree" continues decrease, "slightly disagree" continues to climb until it hits a peak and then begins to decrease until it intersects (third andrich threshold) with "slightly agree", etc. etc. If that sounds correct, is there a general rule of thumb when evaluating the curve as to what is considered a reasonable distance [along the trait level] between intersections or peaks? Is there a general rule of thumb regarding the peaks (the probability y-axis)

3. Even if the category probability curve is not *clean* as described above, one may still decide that it's close enough after investigating Table 3.2, right?

4. In Table 3.2, what does the "category measure" literally mean?

5. Table 3.2 appears to be a summary across all items. If one wanted to examine each item, I know it's possible to look at the item probablility curves via the Graph option. However, is there a way to look at the andrich thresholds etc. for each item in a tabular form? I ask because I assume one could find that thresholds and/or categories are disordered for a particular item, which could lead to collapsing of categories and ultimately a partial credit model.

Apologies for all the questions, but the information you're sharing is very, very helpful!



Mike.Linacre: RM:

1. "The andrich threshold is literally the person trait level (logit value on the x-axis of the category probability curve) relative to the item difficulty where there is an equal probability of endorsing adjacent categories. Is that correct?"

Reply: yes.

2. You describe the situation when the Andrich thresholds are ordered.
General rules: See https://www.rasch.org/rn2.htm

3. Yes, and we may not expect (or want) a "clean" set of curves if some categories are intended to be transitional.

4. "Category measure". This is used in two ways:
a) The average of the person measures of those responding in the category
b) The measure that best correspond to the category. Usually the measure at which the category has the highest probability of being observed.

5. "examine each item". This is the "partial credit" model. In Winsteps, "ISGROUPS=0".

RaschModeler_2012: Hi Mike,

Thank you so much for the help. I have a question (surprise, surpise!). :-)

You stated: The polytomous Rasch model gives the relationship between the probability of observing adjacent categories for a person on an item

log (Pnij / Pni(j-1) ) = Bn - Di - Fj

where Pnij = probability that person n on item i is observed in category j
Pni(j-1) = probability that person n on item i is observed in category j-1
Bn = measure of person n
Di = difficulty of item i
Fj = Andrich threshold between categories j-1 and j

Question: Is there a way to simulate data which conform to the model above using Winsteps? For example, could I simulate a sample size of N=1000 persons measured on on 50 items with 4 ordered categories that are based on the equation above? Or does this type of simulation have to occur outside of Winsteps?



Mike.Linacre: RM: yes, you can simulate the data using Winsteps.
1) Use Excel to generate the measures for the 1000 persons.
1 (value)
2 (value)
1000 (value)
2) Use Excel to generate the measures for the 50 items.
1 (value)
2 (value)
50 (value)
3) choose the values for the 4 categories:
0 0
1 (value for first threshold)
2 (value for second threshold)
3 ((value for third threshold)
4) give Winsteps some dummy data
1-1000 1-50 1 ; every response is "1" to make Winsteps function
5) Run Winsteps
There should be the same score for every person, with an anchored measure.
There should be the same score for every item, with an anchored measure.
Table 3.2: the thresholds should be anchored.

6) Output simulated data:
Output Files menu: SIFILE=

RaschModeler_2012: Hi Mike,

I'm sorry for the long file you're about to see, but when I try to run the code BELOW, I get a warning that there is a problem with item1. But before you see the code below, please note that I simply opened WINSTEPS, and told WINSTEPS that the notepad file with the code below is the is the "control file name" file.

Does this code need to be embedded in another file? Apologies if I'm making a silly mistake. This is the first time I've tried to simulate data using Winsteps.

Thanks again for your help.


p.s. I'm not sure why smiley faces are coming up on some of the lines in this post. I assure you that this is NOT happening in the notepad file to which I am referring.

Code to follow:

1 (-4.493)
2 (-3.869)
3 (-3.502)
4 (-3.259)
5 (-3.036)
6 (-2.951)
7 (-2.862)
8 (-2.859)
9 (-2.834)
10 (-2.824)
11 (-2.768)
12 (-2.729)
13 (-2.639)
14 (-2.633)
15 (-2.610)
16 (-2.570)
17 (-2.503)
18 (-2.473)
19 (-2.461)
20 (-2.429)
21 (-2.418)
22 (-2.416)
23 (-2.388)
24 (-2.387)
25 (-2.374)
26 (-2.373)
27 (-2.346)
28 (-2.341)
29 (-2.317)
30 (-2.291)
31 (-2.281)
32 (-2.275)
33 (-2.241)
34 (-2.234)
35 (-2.232)
36 (-2.226)
37 (-2.224)
38 (-2.215)
39 (-2.203)
40 (-2.203)
41 (-2.183)
42 (-2.180)
43 (-2.177)
44 (-2.175)
45 (-2.174)
46 (-2.173)
47 (-2.163)
48 (-2.157)
49 (-2.156)
50 (-2.154)
51 (-2.135)
52 (-2.133)
53 (-2.116)
54 (-2.112)
55 (-2.101)
56 (-2.089)
57 (-2.087)
58 (-2.057)
59 (-2.047)
60 (-2.042)
61 (-2.031)
62 (-2.006)
63 (-1.994)
64 (-1.961)
65 (-1.951)
66 (-1.948)
67 (-1.939)
68 (-1.931)
69 (-1.921)
70 (-1.914)
71 (-1.905)
72 (-1.896)
73 (-1.868)
74 (-1.865)
75 (-1.864)
76 (-1.856)
77 (-1.850)
78 (-1.848)
79 (-1.830)
80 (-1.809)
81 (-1.802)
82 (-1.792)
83 (-1.780)
84 (-1.766)
85 (-1.764)
86 (-1.756)
87 (-1.753)
88 (-1.739)
89 (-1.738)
90 (-1.737)
91 (-1.732)
92 (-1.726)
93 (-1.702)
94 (-1.690)
95 (-1.682)
96 (-1.671)
97 (-1.669)
98 (-1.669)
99 (-1.666)
100 (-1.659)
101 (-1.658)
102 (-1.655)
103 (-1.649)
104 (-1.645)
105 (-1.634)
106 (-1.608)
107 (-1.593)
108 (-1.591)
109 (-1.574)
110 (-1.571)
111 (-1.555)
112 (-1.550)
113 (-1.546)
114 (-1.541)
115 (-1.535)
116 (-1.523)
117 (-1.520)
118 (-1.514)
119 (-1.506)
120 (-1.486)
121 (-1.466)
122 (-1.459)
123 (-1.452)
124 (-1.451)
125 (-1.441)
126 (-1.435)
127 (-1.430)
128 (-1.422)
129 (-1.418)
130 (-1.409)
131 (-1.403)
132 (-1.397)
133 (-1.392)
134 (-1.389)
135 (-1.388)
136 (-1.387)
137 (-1.387)
138 (-1.373)
139 (-1.372)
140 (-1.370)
141 (-1.363)
142 (-1.360)
143 (-1.359)
144 (-1.355)
145 (-1.342)
146 (-1.335)
147 (-1.333)
148 (-1.329)
149 (-1.327)
150 (-1.323)
151 (-1.306)
152 (-1.295)
153 (-1.285)
154 (-1.285)
155 (-1.271)
156 (-1.269)
157 (-1.267)
158 (-1.266)
159 (-1.262)
160 (-1.262)
161 (-1.240)
162 (-1.238)
163 (-1.237)
164 (-1.231)
165 (-1.231)
166 (-1.229)
167 (-1.221)
168 (-1.206)
169 (-1.202)
170 (-1.194)
171 (-1.188)
172 (-1.183)
173 (-1.176)
174 (-1.166)
175 (-1.150)
176 (-1.150)
177 (-1.141)
178 (-1.140)
179 (-1.132)
180 (-1.127)
181 (-1.125)
182 (-1.122)
183 (-1.119)
184 (-1.113)
185 (-1.075)
186 (-1.075)
187 (-1.073)
188 (-1.071)
189 (-1.064)
190 (-1.060)
191 (-1.046)
192 (-1.045)
193 (-1.044)
194 (-1.037)
195 (-1.035)
196 (-1.031)
197 (-1.021)
198 (-1.019)
199 (-1.017)
200 (-1.014)
201 (-1.013)
202 (-1.008)
203 (-1.000)
204 (-.999)
205 (-.999)
206 (-.995)
207 (-.994)
208 (-.987)
209 (-.980)
210 (-.966)
211 (-.964)
212 (-.958)
213 (-.949)
214 (-.947)
215 (-.947)
216 (-.945)
217 (-.941)
218 (-.933)
219 (-.932)
220 (-.926)
221 (-.922)
222 (-.921)
223 (-.919)
224 (-.918)
225 (-.916)
226 (-.914)
227 (-.913)
228 (-.908)
229 (-.906)
230 (-.905)
231 (-.904)
232 (-.902)
233 (-.895)
234 (-.893)
235 (-.892)
236 (-.889)
237 (-.887)
238 (-.884)
239 (-.880)
240 (-.878)
241 (-.876)
242 (-.872)
243 (-.872)
244 (-.871)
245 (-.871)
246 (-.862)
247 (-.859)
248 (-.854)
249 (-.851)
250 (-.849)
251 (-.848)
252 (-.840)
253 (-.836)
254 (-.831)
255 (-.830)
256 (-.827)
257 (-.827)
258 (-.817)
259 (-.812)
260 (-.794)
261 (-.791)
262 (-.790)
263 (-.780)
264 (-.776)
265 (-.774)
266 (-.766)
267 (-.764)
268 (-.763)
269 (-.752)
270 (-.751)
271 (-.751)
272 (-.746)
273 (-.743)
274 (-.743)
275 (-.742)
276 (-.740)
277 (-.729)
278 (-.725)
279 (-.723)
280 (-.719)
281 (-.716)
282 (-.714)
283 (-.713)
284 (-.713)
285 (-.713)
286 (-.710)
287 (-.708)
288 (-.705)
289 (-.691)
290 (-.688)
291 (-.684)
292 (-.681)
293 (-.680)
294 (-.678)
295 (-.677)
296 (-.677)
297 (-.673)
298 (-.672)
299 (-.666)
300 (-.664)
301 (-.663)
302 (-.661)
303 (-.659)
304 (-.655)
305 (-.648)
306 (-.645)
307 (-.639)
308 (-.637)
309 (-.637)
310 (-.636)
311 (-.635)
312 (-.630)
313 (-.627)
314 (-.625)
315 (-.623)
316 (-.620)
317 (-.618)
318 (-.614)
319 (-.609)
320 (-.606)
321 (-.605)
322 (-.604)
323 (-.604)
324 (-.596)
325 (-.593)
326 (-.590)
327 (-.590)
328 (-.585)
329 (-.576)
330 (-.574)
331 (-.569)
332 (-.568)
333 (-.566)
334 (-.563)
335 (-.560)
336 (-.557)
337 (-.556)
338 (-.554)
339 (-.548)
340 (-.547)
341 (-.547)
342 (-.547)
343 (-.544)
344 (-.541)
345 (-.540)
346 (-.538)
347 (-.538)
348 (-.530)
349 (-.529)
350 (-.524)
351 (-.522)
352 (-.522)
353 (-.518)
354 (-.507)
355 (-.499)
356 (-.498)
357 (-.495)
358 (-.493)
359 (-.493)
360 (-.489)
361 (-.488)
362 (-.483)
363 (-.475)
364 (-.471)
365 (-.464)
366 (-.460)
367 (-.457)
368 (-.456)
369 (-.454)
370 (-.452)
371 (-.452)
372 (-.451)
373 (-.449)
374 (-.445)
375 (-.444)
376 (-.441)
377 (-.439)
378 (-.434)
379 (-.433)
380 (-.428)
381 (-.425)
382 (-.425)
383 (-.424)
384 (-.422)
385 (-.421)
386 (-.412)
387 (-.411)
388 (-.409)
389 (-.401)
390 (-.397)
391 (-.396)
392 (-.391)
393 (-.391)
394 (-.390)
395 (-.378)
396 (-.364)
397 (-.358)
398 (-.357)
399 (-.357)
400 (-.356)
401 (-.355)
402 (-.351)
403 (-.339)
404 (-.338)
405 (-.337)
406 (-.334)
407 (-.330)
408 (-.330)
409 (-.323)
410 (-.323)
411 (-.321)
412 (-.319)
413 (-.306)
414 (-.305)
415 (-.301)
416 (-.300)
417 (-.298)
418 (-.296)
419 (-.296)
420 (-.295)
421 (-.292)
422 (-.289)
423 (-.289)
424 (-.288)
425 (-.286)
426 (-.283)
427 (-.283)
428 (-.282)
429 (-.276)
430 (-.275)
431 (-.271)
432 (-.270)
433 (-.266)
434 (-.263)
435 (-.258)
436 (-.257)
437 (-.256)
438 (-.252)
439 (-.252)
440 (-.249)
441 (-.239)
442 (-.237)
443 (-.237)
444 (-.231)
445 (-.229)
446 (-.228)
447 (-.228)
448 (-.226)
449 (-.220)
450 (-.216)
451 (-.215)
452 (-.208)
453 (-.204)
454 (-.204)
455 (-.198)
456 (-.193)
457 (-.186)
458 (-.186)
459 (-.184)
460 (-.184)
461 (-.184)
462 (-.183)
463 (-.181)
464 (-.176)
465 (-.173)
466 (-.173)
467 (-.155)
468 (-.153)
469 (-.149)
470 (-.147)
471 (-.144)
472 (-.135)
473 (-.128)
474 (-.124)
475 (-.123)
476 (-.122)
477 (-.120)
478 (-.109)
479 (-.106)
480 (-.101)
481 (-.099)
482 (-.099)
483 (-.098)
484 (-.098)
485 (-.097)
486 (-.095)
487 (-.088)
488 (-.088)
489 (-.086)
490 (-.086)
491 (-.082)
492 (-.079)
493 (-.076)
494 (-.074)
495 (-.071)
496 (-.069)
497 (-.068)
498 (-.065)
499 (-.062)
500 (-.059)
501 (-.051)
502 (-.050)
503 (-.045)
504 (-.045)
505 (-.040)
506 (-.038)
507 (-.035)
508 (-.032)
509 (-.032)
510 (-.031)
511 (-.030)
512 (-.027)
513 (-.026)
514 (-.022)
515 (-.022)
516 (-.021)
517 (-.010)
518 (-.006)
519 (-.004)
520 (-.003)
521 (-.002)
522 (-.002)
523 (-.001)
524 (.007)
525 (.008)
526 (.008)
527 (.008)
528 (.010)
529 (.011)
530 (.012)
531 (.014)
532 (.014)
533 (.016)
534 (.016)
535 (.019)
536 (.024)
537 (.026)
538 (.029)
539 (.031)
540 (.032)
541 (.041)
542 (.042)
543 (.043)
544 (.050)
545 (.052)
546 (.054)
547 (.066)
548 (.069)
549 (.071)
550 (.074)
551 (.076)
552 (.077)
553 (.085)
554 (.088)
555 (.090)
556 (.092)
557 (.092)
558 (.098)
559 (.099)
560 (.101)
561 (.103)
562 (.110)
563 (.121)
564 (.137)
565 (.137)
566 (.141)
567 (.150)
568 (.155)
569 (.155)
570 (.155)
571 (.158)
572 (.158)
573 (.159)
574 (.162)
575 (.163)
576 (.165)
577 (.167)
578 (.177)
579 (.182)
580 (.189)
581 (.190)
582 (.194)
583 (.197)
584 (.200)
585 (.202)
586 (.218)
587 (.227)
588 (.230)
589 (.231)
590 (.235)
591 (.239)
592 (.239)
593 (.239)
594 (.242)
595 (.248)
596 (.254)
597 (.265)
598 (.265)
599 (.271)
600 (.271)
601 (.273)
602 (.274)
603 (.276)
604 (.276)
605 (.277)
606 (.281)
607 (.283)
608 (.287)
609 (.291)
610 (.296)
611 (.296)
612 (.299)
613 (.300)
614 (.301)
616 (.313)
616 (.313)
617 (.316)
618 (.318)
619 (.322)
620 (.323)
621 (.328)
622 (.334)
623 (.342)
624 (.347)
625 (.350)
626 (.351)
627 (.351)
628 (.365)
629 (.367)
630 (.372)
631 (.373)
632 (.381)
633 (.382)
634 (.392)
635 (.393)
636 (.402)
637 (.404)
638 (.404)
639 (.412)
640 (.414)
641 (.415)
642 (.421)
643 (.422)
644 (.425)
645 (.426)
646 (.427)
647 (.434)
648 (.446)
649 (.451)
650 (.452)
651 (.453)
652 (.462)
653 (.466)
654 (.467)
655 (.470)
656 (.473)
657 (.482)
658 (.484)
659 (.485)
660 (.493)
661 (.503)
662 (.505)
663 (.506)
664 (.510)
665 (.511)
666 (.514)
667 (.516)
668 (.523)
669 (.525)
670 (.533)
671 (.537)
672 (.539)
673 (.542)
674 (.542)
675 (.548)
676 (.552)
677 (.553)
678 (.553)
679 (.555)
680 (.565)
681 (.578)
682 (.582)
683 (.583)
684 (.585)
685 (.587)
686 (.587)
687 (.588)
688 (.590)
689 (.597)
690 (.599)
691 (.600)
692 (.601)
693 (.601)
694 (.603)
695 (.604)
696 (.606)
697 (.615)
698 (.623)
699 (.628)
700 (.632)
701 (.635)
702 (.642)
703 (.645)
704 (.668)
705 (.671)
706 (.671)
707 (.674)
708 (.675)
709 (.676)
710 (.677)
711 (.679)
712 (.680)
713 (.680)
714 (.682)
715 (.687)
716 (.695)
717 (.699)
718 (.701)
719 (.706)
720 (.707)
721 (.709)
722 (.723)
723 (.726)
724 (.732)
725 (.735)
726 (.737)
727 (.740)
728 (.742)
729 (.749)
730 (.753)
731 (.753)
732 (.764)
733 (.770)
734 (.770)
735 (.771)
736 (.781)
737 (.788)
738 (.793)
739 (.794)
740 (.797)
741 (.803)
742 (.805)
743 (.807)
744 (.811)
745 (.813)
746 (.821)
747 (.822)
748 (.829)
749 (.834)
750 (.843)
751 (.855)
752 (.856)
753 (.863)
754 (.864)
755 (.879)
756 (.890)
757 (.897)
758 (.900)
759 (.904)
760 (.909)
761 (.911)
762 (.911)
763 (.920)
764 (.935)
765 (.936)
766 (.940)
767 (.941)
768 (.942)
769 (.947)
770 (.951)
771 (.952)
772 (.952)
773 (.953)
774 (.953)
775 (.957)
776 (.959)
777 (.972)
778 (.973)
779 (.978)
780 (.981)
781 (.981)
782 (.986)
783 (.990)
784 (.993)
785 (.993)
786 (.996)
787 (.998)
788 (1.007)
789 (1.007)
790 (1.011)
791 (1.017)
792 (1.023)
793 (1.038)
794 (1.040)
795 (1.045)
796 (1.058)
797 (1.058)
798 (1.065)
799 (1.067)
800 (1.067)
801 (1.068)
802 (1.070)
803 (1.070)
804 (1.071)
805 (1.071)
806 (1.075)
807 (1.083)
808 (1.084)
809 (1.089)
810 (1.090)
811 (1.096)
812 (1.107)
813 (1.108)
814 (1.112)
815 (1.118)
816 (1.119)
817 (1.125)
818 (1.129)
819 (1.129)
820 (1.130)
821 (1.131)
822 (1.132)
823 (1.133)
824 (1.149)
825 (1.160)
826 (1.161)
827 (1.166)
828 (1.173)
829 (1.177)
830 (1.179)
831 (1.182)
832 (1.190)
833 (1.193)
834 (1.200)
835 (1.205)
836 (1.213)
837 (1.216)
838 (1.223)
839 (1.229)
840 (1.238)
841 (1.241)
842 (1.250)
843 (1.251)
844 (1.251)
845 (1.254)
846 (1.259)
847 (1.262)
848 (1.267)
849 (1.285)
850 (1.287)
851 (1.325)
852 (1.325)
853 (1.334)
854 (1.344)
855 (1.351)
856 (1.352)
857 (1.370)
858 (1.394)
859 (1.397)
860 (1.398)
861 (1.402)
862 (1.403)
863 (1.418)
864 (1.419)
865 (1.419)
866 (1.429)
867 (1.449)
868 (1.451)
869 (1.453)
870 (1.456)
871 (1.459)
872 (1.466)
873 (1.468)
874 (1.468)
875 (1.477)
876 (1.480)
877 (1.496)
878 (1.503)
879 (1.509)
880 (1.512)
881 (1.522)
882 (1.526)
883 (1.540)
884 (1.558)
885 (1.563)
886 (1.570)
887 (1.577)
888 (1.587)
889 (1.598)
890 (1.613)
891 (1.618)
892 (1.624)
893 (1.632)
894 (1.641)
895 (1.644)
896 (1.655)
897 (1.663)
898 (1.663)
899 (1.664)
900 (1.673)
901 (1.679)
902 (1.683)
903 (1.714)
904 (1.720)
905 (1.721)
906 (1.728)
907 (1.737)
908 (1.738)
909 (1.739)
910 (1.741)
911 (1.748)
912 (1.754)
913 (1.768)
914 (1.769)
915 (1.770)
916 (1.771)
917 (1.786)
918 (1.788)
919 (1.790)
920 (1.791)
921 (1.795)
922 (1.801)
923 (1.803)
924 (1.806)
925 (1.807)
926 (1.816)
927 (1.822)
928 (1.830)
929 (1.831)
930 (1.834)
931 (1.837)
932 (1.867)
933 (1.872)
934 (1.879)
935 (1.882)
936 (1.885)
937 (1.885)
938 (1.889)
939 (1.897)
940 (1.916)
941 (1.935)
942 (1.945)
943 (1.947)
944 (1.975)
945 (2.001)
946 (2.008)
947 (2.010)
948 (2.032)
949 (2.033)
950 (2.038)
951 (2.040)
952 (2.058)
953 (2.064)
954 (2.069)
955 (2.081)
956 (2.106)
957 (2.111)
958 (2.117)
959 (2.124)
960 (2.139)
961 (2.139)
962 (2.175)
963 (2.196)
964 (2.209)
965 (2.244)
966 (2.245)
967 (2.256)
968 (2.262)
969 (2.270)
970 (2.293)
971 (2.306)
972 (2.330)
973 (2.354)
974 (2.367)
975 (2.390)
976 (2.408)
977 (2.418)
978 (2.434)
979 (2.463)
980 (2.467)
981 (2.498)
982 (2.530)
983 (2.564)
984 (2.598)
985 (2.600)
986 (2.649)
987 (2.678)
988 (2.703)
989 (2.705)
990 (2.754)
991 (2.791)
992 (2.813)
993 (2.956)
994 (3.042)
995 (3.106)
996 (3.111)
997 (3.365)
998 (3.644)
999 (3.888)
1000 (4.015)

1 (-4.60)
2 (-4.18)
3 (-3.89)
4 (-3.66)
5 (-3.48)
6 (-3.32)
7 (-2.94)
8 (-2.59)
9 (-2.44)
10 (-2.20)
11 (-1.95)
12 (-1.73)
13 (-1.55)
14 (-1.39)
15 (-1.24)
16 (-1.10)
17 (-.97)
18 (-.85)
19 (-.73)
20 (-.62)
21 (-.51)
22 (-.41)
23 (-.30)
24 (-.20)
25 (-.10)
26 (.10)
27 (.20)
28 (.30)
29 (.41)
30 (.51)
31 (.62)
32 (.73)
33 (.85)
34 (.97)
35 (1.10)
36 (1.24)
37 (1.39)
38 (1.55)
39 (1.73)
40 (1.95)
41 (2.20)
42 (2.44)
43 (2.59)
44 (2.94)
45 (3.32)
46 (3.48)
47 (3.66)
48 (3.89)
49 (4.18)
50 (4.60)

0 0
1 (-1.00)
2 (0.00)
3 (1.00)

1-1000 1-50 1


Mike.Linacre: Good so far, RM.

1) Please remove "(" and ")". They are not needed. See www.winsteps.com/winman/iafile.htm (also in Winsteps Help)

2) In your list, person 616 is there twice. Person 615 is missing

3) Yes, please embed these instructions in a valid Winsteps control file, something like:
TITLE="My simulation"
ITEM1 = 1
NI = 50
NAME1 = 52
CODES = 0123

<- Your instructions here


4) Omit SIFILE=. Do this from the Output Files pull-down menu, where there are lots of simulation options.

Your simulated data will look something like:

01211000010000000000000000000000000000000000000000 -4.4930 1
12121300100010000000000000000200000000000000000000 -3.8690 2
23221220200000000001100001100000000000000000000000 -3.5020 3
33333333333333333333333333333333333333332212232020 3.8800 999
33333333333333333333333333333333332333222332123213 4.0150 1000

RaschModeler_2012: Mike,

Brilliant! It worked!

Thank you so much for taking the time to answer all of my questions and for providing detailed instructions on the data simulation.

Much appreciated!


RaschModeler_2012: Hi Mike,

I'm really sorry for continuing to ask questions.

1. I'm trying to wrap my head around the difference between the "average measure" and the "category measure" reported in Table 3.2.

From our conversations and research, I gather the following:
a. The "average measure" for a category is the average ability of the people who respond in that category
b. The "Category measure" is the average of the person measures of those responding in the category

These definitions sound identical to me?! I can't decipher the difference. Could you please help clarify?

2. I have seen others report the following graph when discussing the viability of the Rasch rating scale (note that this is based on the simulated data):

TABLE 2.5 simulated_data.sav ZOU738WS.TXT Nov 8 19:32 2012

-5 -4 -3 -2 -1 0 1 2 3 4
|-----+-----+-----+-----+-----+-----+-----+-----+-----| NUM ITEM
| 0 1 2 | 50 I50
| 0 1 2 | 49 I49
| 0 1 2 | 48 I48
| 0 1 2 3 47 I47
| 0 1 32 | 46 I46
| 0 1 2 3 | 45 I45
| 0 1 2 3 | 44 I44
| 0 1 2 3 | 43 I43
| 0 1 2 3 | 42 I42
| 0 1 2 3 | 41 I41
| 0 1 2 3 | 40 I40
| 0 1 2 3 | 39 I39
| 0 1 2 3 | 38 I38
| 0 1 2 3 | 37 I37
| 0 1 2 3 | 36 I36
| 0 1 2 3 | 35 I35
| 0 1 2 3 | 34 I34
| 0 1 2 3 | 33 I33
| 0 1 2 3 | 32 I32
| 0 1 2 3 | 31 I31
| 0 1 2 3 | 30 I30
| 0 1 2 3 | 29 I29
| 0 1 2 3 | 28 I28
| 0 1 2 3 | 27 I27
| 0 1 2 3 | 26 I26
| 0 1 2 3 | 24 I24
| 0 1 2 3 | 25 I25
| 0 1 2 3 | 23 I23
| 0 1 2 3 | 22 I22
| 0 1 2 3 | 21 I21
| 0 1 2 3 | 20 I20
| 0 1 2 3 | 19 I19
| 0 1 2 3 | 18 I18
| 0 1 2 3 | 17 I17
| 0 1 2 3 | 16 I16
| 0 1 2 3 | 15 I15
| 0 1 2 3 | 14 I14
| 0 1 2 3 | 13 I13
| 0 1 2 3 | 12 I12
| 0 1 2 3 | 11 I11
| 0 1 2 3 | 10 I10
| 0 1 2 3 | 9 I9
| 0 1 2 3 | 8 I8
| 0 1 2 3 | 7 I7
| 0 1 2 3 | 5 I5
| 0 1 2 3 | 6 I6
| 0 1 2 3 | 4 I4
| 0 1 2 3 | 3 I3
| 0 1 2 3 | 2 I2
| 1 2 3 | 1 I1
|-----+-----+-----+-----+-----+-----+-----+-----+-----| NUM ITEM
-5 -4 -3 -2 -1 0 1 2 3 4

11 12222563574544362322221 2
1 1 1 134679037789161445382730079975499489266 11111 PERSON
0 10 20 30 50 60 80 90 99 PERCENTILE

Would you mind helping me interpret this graph?

Let me see if I understand this graph. Winsteps calculates the average ability of the people who respond in that category for each item. One would hope to see the average abilities are ordered appropriately for each item.

**From what I've read, I think the rating Rasch scale assumes that the distance between the average abilities between categories would be the same across items. Is that correct?

Finally, Table 3.2 essentially averages across all items to obtain an average ability for each category. Am I way off here? Are there other aspects to this graph that I should consider?

I realize I've asked you many questions. I really am trying to do as much research as possible on my own.

With that said, feel free to disregard this email if you're too busy. I do not want to take up much of your time.

Thank you,


Mike.Linacre: RM, congratulations on your success with the simulation :-)

1. a. = observed average measure = shown in your Table 2.5
b. (from my answer above)
b) The measure that best correspond to the category. Usually the measure at which the category has the highest probability of being observed.
These are shown by the category numbers in Winsteps Table 2.2

2. The graph: Table 2.5
"Let me see if I understand this graph. Winsteps calculates the average ability of the people who respond in that category for each item. One would hope to see the average abilities are ordered appropriately for each item. " - Yes, that is correct.

"**From what I've read, I think the rating Rasch scale assumes that the distance between the average abilities between categories would be the same across items. Is that correct? " - No, there is no such assumption. In fact, it is highly unlikely. Look at your simulated data Table 2.5 to see the pattern that the Rasch model expects.

"Finally, Table 3.2 essentially averages across all items to obtain an average ability for each category.?" - Correct.

"Are there other aspects to this graph that I should consider?" -
Does the person ability distribution (bottom of Table 2.5) look reasonable? (approximately normal)
Does the item difficulty hierarchy look reasonable? (look at the item numbers)
Does the advance up the categories for each item look reasonable? Oops! Item 46. Is the "3" merely a random accident or a systematic problem? Answer: with simulated data, we don't believe one dataset too much. We need at least 10 datasets. Maybe 100 or 1000. If we see a prevalent pattern, then we believe it.

uve: Mike,

My apologies for interjecting here, but I have been following the thread and have some additional questions:

1a) When generating calibrations for persons and items in Excel, what would dictate these values? 1b)Would you recommend a random generation? 1c) I guess a lot depends on the purpose, but is there a general methodology for how I choose the values?

2) You stated the observed average measures can be seen in Table 2.5. But I understand this to be (sum(Bn))/N whereas the observed average measure in Table 3.2 is (sum(Bn-Di))/N. In Table 3.2 are we still making the same interpretation which we make in Table 2.5, which is the average of observed measures in each category (but for Table 3.2 across all items)?

3a) I am trying to reconcile EDFILE and SIFILE. I'm guessing that we are asking Winsteps to work backwards from item and person calibrations we chose and to produce responses accordingly. But what then is the purpose in the EDFILE of having all persons scoring 1 on all the items? 3b) Why would we need to run the SIFILE if the EDFILE is generating the values?

Mike.Linacre: Thank you for your post to the Forum, Uve.

1. Generating values. Either: (a) match previous empirical results, or (b) choose reasonable values, such as N(0,1) logits for person abilities, U(-2,+2) for items, and an Andrich threshold advance of 1 logit.

2) Your remark is correct. The choice of Table 2.5 or Table 3.2 depends on whether we are making item-level or rating-scale-level decisions. If we specified the Partial Credit model (ISGROUPS=0) then Table 2.5 and 3.2ff. report the same information.

3) EDFILE= is merely to generate a trivial dataset that will enable Winsteps to function. In the example above, every observation in the trivial dataset is "1". These "1"s do not influence the simulated values. The simulate values are based on PAFILE=, IAFILE= and SAFILE=. Instead of EDFILE= we can use any dataset of the desired size (persons and items).

316. Tagging Categories of Missing Data

uve November 10th, 2012, 10:45pm: Mike,

I just made my first attempt at running a concurrent equating of 4 MC exams totaling 170 items. However, I realized that many students did not take all 4 exams. I think I should code missing items for a test which the student took as incorrect and missing items for a test which the student did not take as skipped so the items and persons are calibrated based only on the exams taken, but I'm not sure if this is the best method. If it is, how would I tell Winsteps that missing items are to be handled differently depending on the test? For example, if the first exam is 30 items and I see two dashes (represents skipped in my file), then those two items are scored incorrect (missing =0). Then if the next 50 items are all dashes, then I know this second exam of 50 items was not taken and those dashes need to be recoded somehow as a different character and treated as skipped (missing= -1). I hope I'm on the right track.

Mike.Linacre: Uve, if you want skipped=wrong, then the data files need different codes for "skipped" and "not administered". The "skipped" code is included in CODES= but the "not administered" code is not.

uve: Mike,

These are 4 distracter (ABCD) MC exams and there are a total of 4 exams that I'm calibrating together at one time. Because it was hard to tell if I was aligning these exams in the proper way, I converted the blanks to dashes. I can go back into the file and convert the dashes that correspond to items of which an entire test was not taken into a 5th different character, say E. What I'm not clear on then is what coding to put in the control file that will tell Winsteps that this new character should be scored not administered and that any remaining dashes correspond to skipped items that should be scored incorrect.

Mike.Linacre: Uve,
codes included in CODES= are scored "1" if they match the KEY1= and 0 otherwise,
also, assuming you are using the default,
then all CODES not in CODES= are score "not administered",
so if "-" means "deliberately skipped" and "E" means not administered,
CODES="ABCD-" ; all "-" will be scored 0
MISSING-SCORED = -1 ; this will score "E" as "not administered"

uve: Thanks! :)

317. 1PL-IRT equals Rasch or not?

dachengruoque November 3rd, 2012, 2:09pm: Today I bumped into a LTEST-L threadit said that Rasch is different from 1PL-IRT which is different from what I read in extant references. Is that the case? Thanks a lot!
The differences are as follows,
"Jim, we often hear that the logistic Rasch model, developed by Georg Rasch, is the same as the 1PL model, developed by Allan Birnbaum. Indeed, they are not, although they have a great deal in common. For example,

1) the Rasch model would need a minimum sample size of around 30, whereas the 1PL would require a far larger sample of at least 200 people for a reliable parameter estimation.

2) Terminology: We say the data does/not fit the Rasch model (deterministic), whereas we say 1PL does/not fit the data (descriptive).

3) Units: the Rasch model logits; whereas 1PL: probits.

4) Although both fix the slope of their item characteristic curves (ICCs), the slope in R model is unity (1), but 1.7 in 1PL.

5) The R model is not based around the normality assumption.

6) Indeed, Wright argued that the Rasch model has two parameters: item difficulty and person ability. So, viewing it as a 1PL is erroneous. (Here is a link to Wrights debate with Hambelton linkhttps://www.rasch.org/rmt/rmt61a.htm ; you will see he does not feel comfortable with the assumption that 1PL = Rasch model.).

7) Item person map in Rasch model analysis, original with Wright, is a helpful visual aid for researchers who wish to graphically present the distribution of persons and items. IRT models do not provide this map as I explain briefly before.

Next, 2PL and 3PL: the former adds the slope parameter (discriminations) and letter adds both slope and lower asymptote (guessing) parameters (4-PL and 5-PL models have recently been proposed, too). I believe adding these parameters will make us, as practitioners, a bit confused. For example, if two items have different slope parameters (i.e., discriminate test takers differently), their ICCs will overlap. The overlap renders the interpretation of item difficulty difficult. And with a guessing parameter (e.g., if a 3PL model fits), you would presume that all test takers guessed, whereas as practitioners we know that it does not seem to be the casea 3PL model considers a penalty for all testees.

How do we find the guesser? how do we decide whether the test was discriminatory? The R model flags the persons that are prone to guessing. By examining the fit statistics of the model, one can learn about guessing in data; by looking the Rasch reliability and separation indices as well as the item-person map, one can learn about discrimination and sources of measurement error. etc.

Sorry for such the lengthy note. " ( by Vahid ARYADOUST with NUS)

Mike.Linacre: Thanks for this post, dachengruoque (corrected).

Some comments:

1) IRT-1PL estimation often uses Rasch estimation procedures, so the sample sizes would be the same. If IRT-1PL uses numerical quadrature, then the bigger sample size is needed.

2) Yes. IRT is descriptive. Rasch is prescriptive.

3) Yes. IRT is using the 1PL model as a substitute for the Normal Ogive model.

4) Yes. This is how (3) is accomplished.

5) So R uses Rasch estimation.

6) Yes. 1PL always means "one item parameter". In IRT individual persons are not parameterized. They are summarized as a distribution.

7) Yes. In general an person-item map does not make sense for IRT, because the perceived item-difficulty hierarchy differs at different ability levels.

8 ) Yes. Discriminations and asymptotes make interpreting person-item interactions more difficult, but their purpose is to describe the data better.

9) In general, Rasch analysts are much more interested in the details of item and person fit than are IRT analysts.

Vahid, for more comparisons of the Rasch dichotomous model and 1PL-IRT, see https://www.rasch.org/rmt/rmt193h.htm

dachengruoque: Thanks a lot, Dr Linacre!

I am not Vahid. I bumped into a thread at LTEST-L discussion list that Vahid who authored the cited part compared and displayed all the differences mentioned in the very first floor.

318. Bias interactions

NaAlO October 30th, 2012, 2:21am: Hi, Mike, i had another problem of interpreting the rater-criteria interaction. i want to see the number of the significant bias for categories shown by each rater. i attached the rater-category bias interaction report. In table 9, it says There are empirically 50 Bias terms, however when i read 13.4.1, the number is not the same. And what is pairwise report about? I am confused by those graphs.
Manys thanks for your help in advance.

Mike.Linacre: NoAIO, if you want to see all 50 bias/interaction terms, please change the default selection criteria, Zscore= 0,0 which is explained at https://www.winsteps.com/facetman/index.htm?zscore.htm

My best explanation of the pairwise table is at https://www.winsteps.com/facetman/index.htm?table14.htm

Which graphs are confusing you?

NaAlO: well, Mike, i tried again. i select the Ascending, Element number order, and set Bias reportable size 1, Bias reportable significance 2 to run the interaction analysis. i attacted the the report file. i have difficulties in how to interprete the tables. for example, are only those with significant bias presented in table 13.3.1? there is no bias report about R10,does it mean that R10 has no sinificant bias interaction with the five categories? From the criteria column, there are 7 items about the coherence category, does it mean that raters have difficulty in this category? Thank you for your patience and help ~

Mike.Linacre: NoAIO, please set Zscore=0,0 (size=0, significance=0) so that you can see all the bias/interaction terms. Then look at the explanation for Table 13 in Facets Help, also at at https://www.winsteps.com/facetman/index.htm?table13.htm

NaAlO: Thank you very much. Professor Linacre, I tried what you suggested, and i got all the 50 bias items. Thanks!

319. Good fit with lucky guessing?

uve October 29th, 2012, 9:31pm: Mike,

Attached is an oddity I can't quite wrap my mind around. This was a diagnostic math exam of 50 items given to about 36 7th graders. The items are sorted by fit with only the top 4 shown. Item 11 is very odd. The fit stats are within reasonable ranges and even the item discrimination index isn't too bad. But the lower asymptote is .83 which I interpret as an item for which it is very easy to guess. It would seem to me that we would get much worse fit stats with a lower asymptote this high. Another oddity is the fact the point-biserial is .04, yet the item discrimination index is .73. Those two data points seem to be at odds with each other. Any suggestions you might have would be most helpful. Thanks again as always.

Mike.Linacre: That item is unusual, Uve, but its statistics do make sense. 30 students succeeded on it, and 6 failed. So the item is easy for this sample. The mean-squares are on the noisy side of 1.0, indicating some unexpected behavior by the students. What is it? The lower asymptote of .83 indicates that there is unexpected success by low performers. The point-biserial of .04 indicates that these unexpected successes are flattening out the empirical item characteristic curve. The estimated discrimination (0.71) is computed using responses close to the item difficulty. The responses in the lower asymptote do not enter the computation.

Suggestion: In Winsteps, "graphs" menu, "empirical ICC".

uve: Mike,

Thanks for the clarifications, but now there's something else that is newly confusing :-/

I've attached the graph of item 11, but it certainly doesn't look like it represents an item whose Infit is 1.24 and whose Outfit is 1.38. By looking at the picture, I would have predicted the fit stats would have been much much worse. And the unexpected scores represented by table 10.4 are for likely higher performers who did not get the item correct (4th item down). Puzzling.

�13 23132 1231 111 2222231 1 21 2
1.72 A�0............................1....
1.71 B�.00..........................1.111
1.57 C�......0................0..........
1.38 D�.......0.......0..0...0...........
1.35 E�..0.0.............................
1.34 F�.......0.............0............
1.33 G�......0..0.............0..........
1.30 H�....0..0..00......................
1.25 I�..0..............................1
1.20 J�..........00...........0..........
1.19 K�........0.0.00....................
1.17 L�....0...0.........................
.98 M�............0....0..0.............
1.01 N�...........00....0................
1.13 O�..........................1..1....
1.12 P�.................................1
1.10 R�.....0........0...................
1.10 S�.................1............1...
1.08 T�.........0.....0..0...............
1.04 U�.....0............................
1.03 V�.0................................
1.03 W�............0..0..0...............
.82 X�..............0......0............
.97 Y�......0..0........................
.91 x�.........................1........
�13 86052 4512 978 7641025 6 93 3

Mike.Linacre: Uve, item 11 is item D in the Table. It has 4 unexpected 0 responses. This Table demonstrates that it is very difficult to quantify statistical misfit by eye. It is easier to quantify-by-eye from Table 22.1 - https://www.winsteps.com/winman/index.htm?table22_1.htm - because we can draw in the diagonal fit-zone lines.

In the plot, how about increasing the empirical intervals until there are only two plotted points? This will give a clearer idea of the slope of the empirical ICC.

uve: Mike,

I've cleaned up Table 10.4 and added it to the document along with the scalograms, person measures and the graph reduced to just two points. The green highlights the persons flagged by scalogram 22.2 as deviating beyond .5 of the expected response. But with the exception of person 23, all deviations are due to incorrect responses, not unexpected correct responses. I also noticed that the ordering of persons in Table 10.4 is somewhat different for persons scoring in the middle versus the ordering in 22.2.

Mike.Linacre: OK, Uve.

Table 22 is sorted by person measure (vertically), item difficulty (horizontally)
Table 10.4 is sorted by item fit (vertically), person measure (horizontally).
My apologies, there are slightly different sorting rules for persons with the same measures in the different Tables. I had not noticed this.

Based on Table 22.2, we can compute all the Winsteps statistics for Item 11 by hand. Which statistic is your primary concern?

uve: Mike,

Did the graph with just two points provide you with any insight? My current concern is still reconciling the good fit values with an item for which guessing is so apparently extreme. The scalogram doesn't appear to be providing me with why this could be the case. Even if we there were a significant number of low performers who got this item right, it would seem to me that this would manifest itself in greater MNSQ values than what I currently see. I guess I would not have a problem if the lower asymptote was, say, under .30, but at its current value this should be affecting the fit values far more.

Mike.Linacre: Uve, the two-point ICC was not informative, but look at the 3-point with empirical interval of 4.0 logits. It is V-shaped. That is not what we want to see.

So it is the computation of the lower asymptote that is the focus, so let us do it. :-)
1. Analyze the data.
2. Output the XFILE= for Item 11 to Excel.
3. The formula for computation of the lower asymptote is at www.winsteps.com/winman/asymptote.htm
This is the crucial line:
ci = sum(Wni mi (Xni - Eni)) / sum(Wni (mi - Eni)) for Bni<B(Eni=0.5)
4. Sort the Excel responses by "MEASURE DIFFERENCE"
5. Delete all rows where "MEASURE DIFFERENCE" > 0
6. We are left with only one observation!! It is a correct response.
7. The crucial line simplifies to:
ci = (Xni=1 - Eni) / (mi=1 - Eni)) = 1.0 so that the lower asymptote is estimated to be 1.0
8. But Winsteps reports 0.83, so what has happened?
There is an undocumented constraint. The lower asymptote is not allowed to be greater than the item p-value. Here are the numbers:

| 30 36 -.86 .48|1.24 .9|1.38 .7| .11 .32| 80.6 83.5| .83 .95| .83| 11 |

uve: Thanks Mike.

One more question and a comment: why are we only looking at measured differences less than zero in the analysis?

Then, given that we only have the data in front of us with which to make a decision presently, I am still puzzled with what to do with this item. On the one hand, the fit stats don't set off any alarm bells for me--all seem to be good. But the point-biserial (I use raw score because for some of my staff this is more familiar) is alarmingly low and the ability to guess seems extremely high. Perhaps because we are only using one observation from which to calculate the index (again, I'm not sure why we filtered out all observations below zero and so were left with only one observation), the index may not be reliable. If it is reliable, then I have an item correlation and guessing index which sharply contrast with the fit data.

I don't know what to recommend we do with this item. Certainly a review by content experts is warranted, but they will ask me why. Right now I would say that this item functions as we would expect, but it's easy to guess, which is not what we would expect. Those two statements contradict in my opinion.

Mike.Linacre: Uve, there are only 36 responses to an easy item. We are in a "pilot test" situation. The data are suggestive, but not definitive. Only one person in the sample has an ability lower than the item difficulty. The item's lower asymptote is really inestimable, but Winsteps takes a stab at it.

A useful rule-of-thumb is "at least 10 observations in every category". This item has only 6 observations in the "0" category.

uve: Thanks Mike,

This has been very enlightening. I appreciate your patience with me as I stumble through the learning process. :)

320. Distance in item difficulty

Juan November 1st, 2012, 8:22am: Hi Mike,

I have a simple question on the output of two tables regarding the difficulty of the items. The item map shows the relative distance in item difficulty but Table 2.6 (Empirical item-category measures) does not. What could be the reason for this as other binary test show the distance on both tables.

Keep well,

Mike.Linacre: Juan-Claude,

Table 12 is showing the Rasch measures of the individual items and persons.

Table 2.6 shows the average Rasch measure of the persons who responded in each category.

Example: Every person has the same ability, but the p-value of dichtomous item 1 is .25 and of dichotomous item 2 is .75.

Table 12: all the persons are at the same ability measure. Item 1 has difficult +1.1 logits, Item 2 has difficulty -1.1 logits.

Table 2.6: the average ability of the persons responding in every category of every item is the same. It is the ability shown in Table 12

Juan: Thank you Mike!

321. specification

NaAlO October 28th, 2012, 3:42pm: Hi, Mike
Here is my situation and I would love some advice or direction. i am concerned about raters severity and consistency, the interactions raters and the rating categories in a writing test, the facets include: test taker ability, rater severity (10 raters), and rating categories (5 different categories, that is content, language, mechanics, Number of words, coherence, each with a different maximum score).
The rating scale has 6 categories,ranging from 0-5. The proposed dataset is 10 raters x 24 candidates = 48 observations for each construct = 40 observations for each category (on average).Then, is the sample enough to conduct Multifacet analysis of the rater intractions? I tried to write the specification. i used Model = ?B,?B,#B, R6,and the data was presented as Data=
but it halted with Error 35, No useble data.what does that mean?I don't know what was wrong.
Many thanks for your help in advance!

Mike.Linacre: NaAlO, yes, this analysis should work.
Does your specification file include something like:
1, Candidates
2, Raters
3, Construct
1= content
2= language
3= mechanics
4= Number of words
5= coherence

And please omit the B's.
Model = ?,?,#, R6
Three B's means "three-way interactions". These are unlikely to be informative.
Please use the Facets "Output Tables" menu to perform 2-way interactions.

NaAlO: Mike , i followed your suggestions and it worked. It is really exciting to see all those beautiful tables and graphs. i am now trying to interprete the data, hopefully it will goes well. Thanks a lot!

322. Contrast Grouping

uve October 22nd, 2012, 2:41am: Mike,

In the attached contrast plot Winsteps grouped items A-D into Group 1, E-R into Group 2 and the remaining into Group 3. When plotting person measures, would you recommend plotting A-D (Group 1) with a-d, or Group 1 with all items in Group 3?

When using the latter method, the following statistics were produced using all 1,003 persons:

Mean 1.540892857
S.D. 1.645502166
Identity trend 4.242466982
Identity trend -1.160681267
Empirical trend 4.831897189
Empirical trend -1.750111474
Identity intercept 0.880178571
Identity slope 1
Empirical intercept with x-axis 0.511411126
Empirical intercept with y-axis -0.328220139
Empirical slope 0.641793114
Correlation 0.336790704
Disattenuated Correlation -4.345225843

Mike.Linacre: Uve, are you using Winsteps 3.75 (or later)? If so, in Table 23.0 will be the disattenuated correlation between person measures on the 3 clusters of items:
Approximate relationships between the KID measures
PCA ACT Pearson Disattenuated Pearson+Extr Disattenuated+Extr
Contrast Clusters Correlation Correlation Correlation Correlation
1 1 - 3 0.1404 0.2175 0.1951 0.2923
1 1 - 2 0.2950 0.4675 0.3589 0.5485
1 2 - 3 0.8065 1.0000 0.8123 1.0000

In this example, item clusters 1 differ dimensionally from 3, but clusters 2 and 3 are similar.

uve: Thanks Mike. I'm feeling a bit embarrased at the moment :B

I just downloaded the new version recently and noticed the new output but hadn't had time to take a closer look until just now. Thanks for the upgrade.

But while were on the subject, do you see any advantage of contrasting A-D with a-d as opposed to the standard cluster correlation?

Mike.Linacre: Uve, A-D against a-d is informative. Cluster analysis usually enables:
1) the actual split of an instrument into usable subsets of items
2) confirmation that differences cannot be attributed to measurement error

uve: Mike,

I thought you and/or others would find this interesting. I approached our ELA coordinator about the top four high loading items A-D. The questions were the same and asked whether the sentence below was an example of a similie or a metaphor. As it turns out the identity of the 2nd dimension is word recognition. The students are likley simply trying to identify the word "as" or "like". Also, there are only two options for each item because of this. We discussed the possiblity of providing four sentence options per item with the words "as" and "like" used in all but in different manners with only one option being the correct choice. This would force students to actually know the meaning of smilie and metaphor without using the "crutch" of the two words to rule out one or both possibilities.

Mike.Linacre: Then this is a useful investigation, Uve. Excellent :-)

323. Analysable with Facets?

miet1602 August 10th, 2012, 11:21am: Hi,
I am struggling to set up a data file for Facets - partly due to the fact that I am not sure my data are actually amenable for this analysis.
Very briefly about the data: Each candidate is marked twice, by one ordinary marker, and then by a senior marker. There is no overlap between candidates marked by ordinary markers, i.e. every ordinary marker has a different allocation of candidates. The (potential) link is provided by the senior markers. Each senior marker remarks the scripts marked by several ordinary markers. I dont have item level marks but just grades (pass or fail) at scrip level.
The actual scripts are different between candidates (I think there are several versions between them), but I have ignored this for the trial analysis as I am only interested in seeing differences in marker severity and reliability.
I am not sure that such a set up will provide enough linking for a Facets analysis... But I wanted to try and see if the connection works.
I have set up a spec file for 2 facets (markers and candidates), but the analysis cannot be executed. I have obviously specified something (probably more than one thing) wrongly (as you can see from the attached screen shot) and my attempt at data file spec.
I would be grateful for any advice regarding setting up this data file, and, indeed, whether such data can be analysed in this way in the first place.
Many thanks,

miet1602: I was unable to attach the spec file in the previous post, so here it is.

Mike.Linacre: Good so far, Milja.

Every element needs a number, see attached file.

Are your ordinary markers nested within senior markers? If so, you may need to anchor the senior markers.

Your data a very thin, so you may need to make a Bayesian adjustment to the analysis to make every element (=parameter) estimable.

miet1602: Thanks for this, Mike.
I have amended the spec file (and also addedd another module marked by the same markers to bulk up the data a bit), and managed to get some results. I attach the output file. As you can see from the attached output file, only about half the markers' measures are estimable. Also, the correlations are very low and lots are negative, so not sure if it is safe to rely on the derived measures at all...

Also, not sure why there is the list of unspecified elements in the output file (elements 660-826)?? There are only 659 elements (candidates) in the spec file - not sure why Facets was expecting more than that...

Regarding the nesting of the markers - about half are nested within senior markers, but there are some where two different senior markers have remarked the same ordinary marker's allocation (I guess these are the cases where Facets was able to derive a measurement). I attach a pivot table with all the markers for you to see. These are operational data from a not very clever remarking design...

Could you point me to some literature to find out how to spcify Bayesian adjustment for the anlysis?

BTW, for some reason the attachment facility only allows me to attach one file per post... Is this a bug?

Many thanks for your help!

miet1602: This is just so I can attach another file as per the previous post...

Mike.Linacre: Some more thoughts, Milja.

1. If you want to attach multiple files, please zip them together into one file.

2. Please upgrade to the current version of Facets: 3.70.1.
Then rerun your analysis with the current version of Facets.

3. Bayesian adjustment. When the data are very thin (many inestimable), then we need to add reasonable "dummy" data. One approach is to add two more observations to every element of the target facet: a good performance on a very easy item and a bad performance on a very hard item. Anchor the very easy item at a very low measure. Anchor the very hard item at a very high measure. Then every measure will be estimable between the very hard measure and the very easy measure.

miet1602: Thanks for your help, Mike. Unfortunately, I will have to wait for our IT dept to process the Facets update request.

miet1602: Hi again,
I was wondering - if the data I have are raw total scores out of 25 on a paper for each candidate (rather than item level data), is it possible to use such data for a Facets analysis? Not sure which model to specify because the data are not a rating scale, and obviously not dichotomous. Does it make sense to conceptualise this as a rating scale and recode somehow?

Thanks again!

Mike.Linacre: Milja, if there is only one score for each candidate, then Facets cannot be used effectively.

If there is more than one score for each candidate, then, statistically, those scores are rating-scale items.
The Facets model would look like:

miet1602: Thanks, Mike. Each candidate has two scores - provided by two markers - ordinary and senior. The candidates are nested within ordinary markers, however, there is some overlap provided via senior markers as the same senior markers will have double marked more than one ordinary marker's work.

May I just ask something else as I am struggling to translate this into Facets logic. So, my data are total marks on papers (rather than item marks). Supposing that all the candidates did the same paper, do I need the 'paper' facet specified at all? If I do, then there would be only 1 element, right? Which does not make much sense to me... So my facets would be markers, candidates and paper (?,?,?,R25).

Or do I just need two facets in this case, candidates and markers (?,?,R25)? This is what I was going to use because, given that I only have paper-level marks, no stats can be derived about items in terms of difficulty anyway.

Or am I not interpreting this properly?

Thanks again!

Mike.Linacre: You are correct, Milja. Two facets are enough.

miet1602: Thanks, Mike.
I have just installed the latest version of Facets and reanalysed a file that was coming up with an error on a previous version. You suggested running it on the new version should get rid of that. However, this has happened again. The problem is that the output contains a long list of unspecified elements which (e.g. for facet 1 67-107), but I only have 66 markers in my data and these are specified in the spec file. Not sure why Facets expects more than that. This is also the case with facet 2.
I attach the spec and the output file.
Do you have any suggestions how to sort this out?

miet1602: Apologies, managed to identify what the problem was - it was in my data, as might have been expected. Sorry again for wasting your time on this.

miet1602: Hi,
I have been playing around with some data trying a bias analysis. I got an error message from Facets:
Facets 3.70.1 failed at Facets 23150:9
Subscript out of range

The message also asked for this to be reported to Winsteps.com, which I will do, but I also wanted to post here in case this is due to a mistake in my spec file, which is attached.

Grateful for any suggestions,

Mike.Linacre: Apologies for the failure in Facets, Milja.

The problem is mismatch between the element identifiers and the data. Please try this:

Delements = NL
200037= UK200037
200073= UK200073
200319= UK200319
201116= UK201116
205576= UK205576
205594= UK205594
205595= UK205595
205599= UK205599
205603= UK205603
205604= UK205604
205607= UK205607
205609= UK205609
205610= UK205610
205613= UK205613
205614= UK205614
205617= UK205617
205626= UK205626
205645= UK205645
205655= UK205655
205661= UK205661
205668= UK205668
205670= UK205670
205672= UK205672
205677= UK205677
205696= UK205696
205716= UK205716
205743= UK205743
205791= UK205791
205801= UK205801
205842= UK205842
205850= UK205850
205914= UK205914
206096= UK206096
206100= UK206100
206164= UK206164
206167= UK206167
206176= UK206176
206179= UK206179
206182= UK206182
206187= UK206187
206188= UK206188
206194= UK206194
206209= UK206209
206316= UK206316
206323= UK206323
206324= UK206324
206326= UK206326
206331= UK206331
206339= UK206339
206348= UK206348
206350= UK206350
206352= UK206352
206388= UK206388
206389= UK206389
206515= IN206515
206516= IN206516
206517= IN206517
206519= IN206519
206520= IN206520
206523= IN206523
206524= IN206524
206710= IN206710
206719= IN206719
206720= IN206720
206721= IN206721
206726= IN206726
206728= IN206728


3,Country, A

Dvalues =
3, 1, 1, 2 ; Marker country elements for facet 3 are indicated in the label of Facet 1, column 1 length 2

324. Scaling with CAT

uve October 16th, 2012, 5:04am: Mike,

In the U.S., the Common Core assessments will be a reality in 2015 for grades 3-8 and 11 for math and English. California is a member of the Smarter Balanced Assessment Consortium which plans to roll this out using CAT.

I'm a bit puzzled how a CAT can be scaled if each student essentially takes a different form of the test. I realize that each item will be field tested first giving us difficulty levels on the same scale we can bank, but how can we produce a theta level for a student who takes, say, 20 items which may have only 10 items in common with another student who takes, say, only 15 items. I work right now exclusively with fixed forms. In this example, let's say 30 items, then I can assess the theta level of 15 versus 20, but CAT produces different upper limits for each student once a predetermined standard of error has been reached.

In other words, how will I produce Table 20 when administering a CAT?

Mike.Linacre: Uve, under CAT, each student is taking a different combination of items, and so has a different Winsteps Table 20. But Table 20 becomes meaningless.
Suppose that the CAT target success is set at 75%, and everyone is administered 20 different items. Then, at the end of the test, we expect everyone to have succeeded on 75% of 20 items = a raw score of 15 out of 20! This is possible because less able students will have been administered easier items.

Usually we start a CAT system with pre-calibrated items. However, from then on, we use the CAT data to re-calibrate the items. This is straightforward. One of the Winsteps example files, exam5.txt, https://www.winsteps.com/winman/index.htm?example5.htm, is the data from a CAT test. We used this to calibrate the items and measure the students.

uve: Mike,

My apologies for being so repetitive, but as I've mentioned before, we align our district assessments to the state exams using an equipercentile method. So a score of, say, 30 out of 50 on a district test may have the same percentile ranking as a 350 out of 600 on the state test. We then run the district exams through Winsteps and find the logit value that corresponds to a raw score of 30 in Table 20. We then transform that logit to 350 so that our exams now are reported on the same reporting scale as the state exam. Through this process I'm able to set criterion cut points that have similar meaning as the state exams.

So my challenge is how to continue to produce meaningful and aligned cut points on our exams when the state exams will now change over to a CAT format while ours remain in the traditional fixed form format.

Mike.Linacre: Uve, surely the state reports scale-scores for the CAT test? If so, you can apply the same equipercentile method to those scale-scores as your did to the previous "350 out of 600" scale scores.

It is probably a good guess that the CAT test will be reported with exactly the same numbers ("350 out of 600") as the conventional test. You may not need to make any changes to your procedures :-)

uve: Mike,

Let's hope you're right! :)

325. non-centered

wonny August 7th, 2012, 3:37pm: Dear Sir,
first of all, thank you so much for this great program and the tutorial you have posted for us. This has been a tremendous help for me in writing my thesis.
My paper is based on 4 facets; tasks, criteria, takers and raters.
1) as for the Non-centered specification; I am quite unsure of this concept and what to center it for?
2) If the raw scores; say test 1 is supposed to be measured by 0-5 point scale and test 2 has 0-3 points as its own 'scoring rubric' ; and if I need to use this raw scores as a base of my comparion: could I still use Facet program? If so, how can this adjusted?
Thank you so much!

Mike.Linacre: Thank you for your post, Won.
1) Non-center.
You have 4 facets; tasks, criteria, takers and raters.
Do you want to measure the takers relative to the tasks, criteria, and raters? If so, the taker facet is non-centered.

2) Yes, please use the Facets program, and then use the "Fair Average" column in Table 7 or the Scorefile= for your reporting.

wonny: Sir, thank you so much!
I really appreciate your reply and kind assistance to this matter. I am just amazed to be using your program and getting reply from you as well!
Just one more thing for the fair average column' so basically just use the raw scores as they are : test 1; 0-3 and test 2; 0-5 scale, run the facet program and then just use "Fair Average" column in Table 7 or the Scorefile= for your reporting?

Thank you so much.

Mike.Linacre: Yes, Won. If you must report using the original rating-scales, then "Fair Average".

wonny: Thank you so much! I will try and run it!

wonny: Hi Sir!
Another question came up that I couldn't quite get the grasp on.
On Table 6 and on the far right hand column we have the 'scale' which I do know indicates the 'raw - rating scale' however the horizantal line indicates some probablity I am quite unsure of.
Thank you

Mike.Linacre: Won: the scale is the expected score. The --- line indicates 0.5
The "scale" column in Table 6 is the x-axis of the Expected Score ICC

wonny: Thank you once again!
I did read all your manuals and just couldn't see ^^ Thank you!

wonny: Dear sir,
I have pulled out all the data and found out that my separation ratio is as high as 23! Compared to other examples from your manual and others (Which is uaully from 2-3) what can be said about this high ratio?
My data is from 10 different tasks; all different in their purposes, therefore wanting to seperate what is useful and not for my research.
Anyway, what can be said about this high ratio?
I have attached table 4.1.1.
Thank you so much!

Mike.Linacre: Won, separation is an indication of the precision of the measures.

Suppose we measure our heights.
Measurement to the ...
Nearest meter = very low separation
Nearest centimeter = low separation
Nearest millimeter = high separation
Nearest micrometer = very high separation

Thus we see that very high separation tells us that our measures are very precise, probably more precise than the natural variation in the object we are measuring.

In your example, the very high separation tells us that the Rasch measures are definitely precise enough for whatever decisions are to be based upon them.

wonny: Dear sir!
Thanks to your prompt and kind reply I was able to report my first findings. Thank you.
In line with my first questions ; 6 tasks having different scoring system
This is the comparison I am interesetd in:
6 tasks all have 1 scoring rubric in common (0-5 scale) : My scoring rubric
and all 6 has another set of scoring rubrics (1-3 scale, 1-9 scale, 0-10 scale and so forth) by themselve; their own scoring rubric
You have told me to use the 'fair-m' average column for adjustemnt.
However, in such case ; what I want to find out is;
- which 'scoring rubric of their own' is similar/ or most comparable with my scoring rubric?
- what can be said when they were compared together?

I wasn't sure how the scores should be first inputted on excel/ txt in such cases.

examinne rater task score 1 score 2
everytask has by their own: all with
this scoring rubric different max scores
in common
1 1 1 3 (out of 5) 2 (out of 3)
1 1 2 4 (out of 5) 6 (out of 10)
1 1 3 3 (out of 5) 4 (out of 9)
1 1 4 3 (ouf of 5) 2 (out of 4)

Can they still be on the same column? Or should I run FACETS sepearatly; say first task 1, then task 2 and so forth? BTW, they are all 'holistic' scores.

Thank you so much!

Mike.Linacre: Won, here is an approach:

A. Analyze each task separately and analyze each scale separately.
11. Analyze (task 1 score 1)
12. Analyze (task 1 score 2)

B. Correlate the examinee measures for 11 and 12. Result = corr(11, 12)

C. Do this for all the tasks.

D. The highest correlation of corr(11, 12), corr(21, 22), corr(31, 32), ... is the most comparable with your scoring rubric.

Is this what you need?

wonny: Thank you so much sir!
You really don't know how much these means to me!
Anyway.. I will try this approach.. BTW The suggested methods are all done by FACETS, right?
Thank you!

Mike.Linacre: Won, Yes, Facets is one way to do these analyses.

wonny: Thank you sir!
I will work on it right away!
Won :)

wonny: Hi!
While I was working on the suggested method:
So, after using FACETS to get the 'measures' for each task (1,1) (1,2) and so forth. - have finished this part ^^

The correlations should be run by SPSS, correct?
Or is there a way to do this with FACETS program? I can't find a way to put two measuers together...

Thank you!
Won. :)

Mike.Linacre: Yes, Wonny. Correlations done by SPSS, Excel or whatever software is convenient.

wonny: thank you once again..

Will work again right away!

wonny: hello~
Hi Sir!
I have worked on and made so much progress using FACETs & Correlation. Thank you!
One thing came up: One of the data has 4 scores; 4, 5, 4, 4 - they need to be averaged for one holistic of : 4.25 (this is the way that test is designed)
1) the raw data should not be in decimals, correct?
2) by the manal -> I have changed the numbers into integer -> 4.25 x 4 = 17
3) tried to use these numbers and then
4) Model = ?,?,?,R,0.25 on the fac file, however

on the output file... this is the message I am getting:
heck (2)? Invalid datum location: 1,1,Model = ?,?,?,R,.251,18 in line 97. Datum "18" is too big or not a positive integer, treated as missing.

-> and all the two digits number were treated as missing for being to big,
how can I solve this problem?

THank you!!

wonny: Sir.. I've realized..
instead of using the average raw scores of say, 4, 5, 4, 4, -> 17/4 = 4.25 -> then converting to integer of x 4 -> 17
And getting error messages,

I guess I should run facets with the raw score -> get the measure and use that as the average score?

Is this correct?
THank you!

Mike.Linacre: Won, please use the model specification:

Model = ?,?,?,R20,0.25

where R20 is R(maximum possible value)

wonny: Dear Sir,
A little belated thank you~ ^^
Thank you so much for the reply.
I am working on it!

wonny: Dear sir,
hope you are well!
Thanks to your help I was able to complete some data analysis! Thank you!
A question has rose: can we do Factor Analysis with FACETS as well?
Thank you

Mike.Linacre: Won, FACETS does not do factor analysis.

wonny: thank you sir!

326. Data log-likelihood chi square

windy October 10th, 2012, 12:29pm: Dear Rasch-stars,

I'm trying to figure out the data log-likelihood chi square statistic that is computed in Facets. Specifically, I used the ltm package in R to do a Rasch analysis, and found that my data set did not fit the Rasch model. However, when I run the same model in Facets, the data log-likelihood chi square statistic indicates that the model fits the data. Any thoughts about this discrepancy?


Mike.Linacre: Stefanie, it is rare for a chi-squared to report that any empirical dataset fits a Rasch model, so the R finding is more believable than the Facets finding.

However, it is easy to verify the Facets results. In the Facets "Residuals" output file is a column labeled "LProb". This shows the log-probability of each observation. The sum of these is the log-likelihood. The chi-squared is -2 * Log-likelihood. The degrees of freedom are the count of these, less the number of estimated elements. There are some other adjustments, but the adjustments are unlikely to change the finding of fit or misfit.

327. Differential Test Functioning

CraigAus September 27th, 2012, 1:08pm: I'm analysing a data set with ~20,000 respondents and am currently trying to test the assumption that the parameters remain constant for different samples.

When I carry out the Differential Test Functioning process the resultant scatter plot places the black curved lines (95% confidence bands) extremely close together. None of the Items sit within these lines.

Is the confidence interval affected by my high sample size or is this output highlighting a serious issue with my model? If so does that mean that this test is not valid for my sample?

When I carry out the Differential Item Functioning I also get worrying results. Whilst the diff plot appears to show men and women quite closely aligned, the differences (despite being very small) are nearly all significant (prob=0.0000). Would this be due to sample size impacting the test statistic or unacceptable differences?

I've noticed in some other studies that an Anderson Likelihood Ratio Test is utilised for this purpose.

Is this test available with Winsteps?

Thanks Again


Mike.Linacre: Craig, your large samples are making the standard error of the estimates exceedingly (unbelievably) small. It is like measuring our heights to the nearest millimeter when we know our heights can vary by much more than that during the day. This natural variation in item difficulty is why Educational Testing Service set a lower limit of 0.43 logits for meaningful changes in item difficulty. Their chart is shown at https://www.rasch.org/rmt/rmt203e.htm

Since LRTs approximate the difference between two chi-squared statistics, they would also report significance differences with large sample sizes.

Winsteps does not report LRTs, but they can be computed from the response probabilities in the Winsteps XFILE=.

CraigAus: Hi Mike,

Thank you very much for your assistance with this. I've managed to run the item measures for males and females individually and I've calculated a normalised SE for each item. What is the formula for the statistical test to see if the difference between female and male item measures is significant?

Thanks again,


Mike.Linacre: Craig: if you are interested in item comparisons for males and females, please do a standard pair-wise DIF analysis. Mantel-Haenszel is the most widely known.

Otherwise, the general formula is Student's t-statistic:
t = (M1 - M2) / sqrt (SE1 + SE2)

http://www.graphpad.com/quickcalcs/ttest1.cfm?Format=SEM has a calculator.

CraigAus: Thanks again Mike.

Is there an way to conduct a Mantel Haenszel test within Winsteps using new normalised SEs?

I've calculated normalised SEs for each item to account for thelarge sample size but am unsure how best to calculate the MH test statistics.

Mike.Linacre: Craig: Winsteps computes MH in Table 30.1.
MH does not require estimates of the individual item difficulties for each group. It estimates directly from the data.

CraigAus: Thanks Mike, will MH be impacted adversely by the large sample size? More than half of the MH probabilities that I'm getting are 0.000 even though graphically the measures look quite close (less than 0.5 logits).

Also, is there any way to conduct this MH test comparing subgroups of score level rather than gender?

When I enter MA2 or MA13 instead of gender the MH statistics aren't provided.

Mike.Linacre: Craig, all hypothesis-test fit statistics are influenced by sample-size. The more data that we have, the more certain we are that the data contains imperfections, and so the more likely that the usual null hypothesis of "these data are perfect" will be rejected.

To avoid this problem, analysts sub-sample, say, 500 or 1000 cases from the two groups that are to be compared.

Tests across score-level are termed "non-uniform DIF". These tests are much more awkward to conduct. https://www.winsteps.com/winman/non-uniformdif.htm discusses this for Winsteps.

CraigAus: Hi Mike,

Thank you very much for all of your assistance with this and the other questions I've had!

You've really helped me to clarify a great deal.



ayisam: i need to know the meaning of the following assumptions as they are used in economics
a) axiom of completeness.
b) axiom of transitivity
c) monotonicity

i will be very grateful to receive solutions from you people
i thank you in advance

Mike.Linacre: Ayisam, please try an Economics chat room, such as http://economics.about.com/mpchat.htm

328. Assumption of Monotonicity

CraigAus October 1st, 2012, 7:50am: Hi Mike,

I've read a number of papers which test the assumption of monotonicity of the Rasch model using the R1C statistic.

Are you familiar with this statistic? Is it possible to calculate this using Winsteps?

If R1C isn't calculated within Winsteps, are there any other inbuilt tests which can assist me in testing this assumption?


Mike.Linacre: Craig, the book, "Rasch Models: Foundations, Recent Developments and Applications" by Fischer and Molenaar discusses many different fit statistics for Rasch models from a mathematical perspective. The information in the Winsteps XFILE= can be used to compute these statistics. However, many of these statistics (including R1C) are formulated for rectangularly-complete dichotomous data estimated using CMLE. These statistics are not a good match with Winsteps (which is designed to accommodate incomplete, polytomous data estimated using JMLE).

R1C is also a global fit statistic. In practice, global fit tests tend to be unhelpful because we are rarely accepting or rejecting an entire dataset. We are usually looking for, and diagnosing, bad spots (miskeyed items, misbehaving respondents, etc.) so that they can be set aside or corrected. In fact, we expect a big-enough empirical Rasch dataset to fail a global fit test in the same way as an empirical right-angled triangle always fails Pythagoras Theorem, provided that the sides are measured precisely enough.

329. U Molenaar statistic

CraigAus October 1st, 2012, 7:50am: I've come across the U Molenaar statistic in a number of papers where it is used in order to test the Rasch hypothesis of equal slope of items.

Is ensuring equal slope for items the same as testing dimensionality? If this is the case can PCA of residuals be used instead of the U Molenaar statistic?

Is the U Molenaar statistic able to be calculated within Winsteps or is there an easy way to calculate this using Winsteps outputs?



Mike.Linacre: Craig, items on the same dimension can have different slopes (item discriminations). PCA of residuals is usually insensitive to differences in item slope.

Winsteps does not report Molenaar's U statistic. The response-level numbers required to compute the U statistic are output in the Winsteps XFILE=, but the computation is awkward. The Winsteps OUTFIT ZSTD statistics gives essentially the same information.

330. Sufficiency of Score

CraigAus September 30th, 2012, 5:27am: Im trying to understand the various assumptions of the Rasch model and link each assumption with tests that confirm the assumption. Sufficiency of the raw score is one of the assumptions and according to https://www.rasch.org/rmt/rmt63c.htm a raw score is a sufficient statistic for an ability measure

If individuals of equal ability are getting different raw scores then either the persons are a bad fit for the model or the items (questions) are. A good fitting Rasch model will mean that, any persons of equal ability will achieve the same raw score as each other for different sets of items and conversely items of equal difficulty will achieve the same raw score as each other regardless of person sample?

Put another way; if questions exist that have different difficulty paramaters depending on the individual then individuals of equal ability wont achieve the same raw score and therefore the item wont fit the Rasch model?

Is my understanding of this statistical sufficiency assumption correct? Is this assumption therefore basically the same as the assumption of Specific Objectivity?

How is this assumption best tested? If the person and item fit is good does this mean that raw scores are sufficient?

Thanks again,


Mike.Linacre: Craig: in statistical terminology, "sufficient" means "all the information that this dataset has about ...". So, from a Rasch perspective, the "raw score" has all the information there is in the dataset about the "ability" of the respondent. This is also the Classical Test Theory perspective. But this is not the 2-PL or 3-PL perspective. They say that the "pattern of responses" has all the information there is in the dataset about the "ability" of the respondent.

If we see two respondents with the same raw score on the same items, then we estimate them to have the same ability.

If we have two respondents whose true abilities are the same, the we would predict that each has the same distribution of possible raw scores, but we would not know, in advance, what raw score would be observed for each. They may be observed with the same raw score, so they would have the same estimated ability. They may be observed with different raw scores, so they would have different estimated abilities. Each of those estimated abilities would have a precision (standard error). We would expect that their two estimated measures would differ by more than twice their joint standard error less than 5% of the time.

From a Rasch perspective, "misfit" means that the observed response string has either too low probability or too high probability of being observed for someone of this estimated ability. Perhaps this deviation from the Rasch-predicted probability is due to an intervening variable. However, the misfit does not change the ability estimate. Every respondent with the same raw score on the same items is estimated to have the same ability, regardless of misfit.

CraigAus: Thanks Mike,

This makes sense.

In regard to testing this assumption of sufficiency is it best to compare the expected scores compared to the raw scores such that if the p values are less than 0.05 then they are acceptable?


Mike.Linacre: Craig: Rasch model estimates are based on the assumption of sufficiency, so a mismatch between observed and expected scores indicates that the estimation process was deficient in some way.

We expect every observation in a typical dichotomous dataset to be explained by f(raw score of person, raw score on item, model-conforming random noise). So, every Rasch fit test is a test of the hypothesis that the raw score is sufficient. We investigate: "Is there anything else in the data that is not explained by the raw score, other than model-conforming random noise?"

331. Reliability or DIF

CraigAus September 30th, 2012, 1:01pm: Hi Mike,

The measure of person reliability decreases whenever an item is removed. In order to make sample bias and uni-dimensionality acceptable though bad fitting items must be removed.

What is more important out of reliability and acceptable sample bias (Differential Item Functioning) outputs when deciding on the items to include?



Mike.Linacre: Craig: if the Differential Item Functioning (DIF) is big enough to influence the decisions that will be drawn from the person scores or measures, then action must be taken to counteract the DIF.

Since "reliability" means "reproducibility", high reliability with high DIF implies that the measures or scores are reliably biased.

332. Person Missfit

CraigAus September 30th, 2012, 4:44am: Hi Mike,

I have a sample size of ~20,000 respondents which Im using to run the Rasch model.

The Rasch measure for each respondent will be used in further analysis therefore Id like to be able to acquire a score for each individual regardless of fit.
Would you suggest that I remove these poor fitting respondents whilst the Rasch model is developed and then once Ive agreed on a Rasch model, re-run it with all respondents included?

Alternatively should I just go through the usual Rasch model development process ignoring Person misfit?

Thanks in advance


Mike.Linacre: Craig, if this is an MCQ dataset with low-ability guessers and high-ability snoozers then please use CUTLO= and CUTHI= to get good estimates of the item difficulties in IFILE= (ad SFILE= if polytomies)
Anchor the items at those difficulties with IAFILE= (and SAFILE=)
Then remove CUTLO= and CUTHI= and get ability measures for all your respondents.

In general, a good procedure to follow is: https://www.rasch.org/rmt/rmt234g.htm

333. Pweight

CraigAus September 23rd, 2012, 6:44am: Hi There,

I have what is probably a basic question relating to person weights.

Within my base data set I have a variable for Person Weighting.

Within the Winsteps data file the weighting is placed from column 60 to 68, therefore it is my understanding that to apply these weights I must use the specification; PWEIGHT=$S60W8.

When I select specification on the Winsteps menu and enter this code, the winsteps log tells me pweight=$60W8... not done.

Alternatively, when I open the file and am prompted for extra specifications I enter pweight=$60W8 however the output tables are not impacted.

Could someone please help me understand what I am doing wrong when trying to apply these weightings and how I can overcome this problem?

Many thanks in advance.


Mike.Linacre: Craig, two things here.

1. 60 to 68 is 9 columns so $S60W9

but, the crucial aspect, about which the documentation is vague :-(

2. $S60W9 says "starts in column 60 of the person label ...."
Do you mean "starts in column 60 of the data record ...."? This is:

CraigAus: Hi Mike,

Thanks for getting back to me about this so quickly.

The weight starts at 60 in the data record. I've tried entering PWEIGHT=$C60W9 into the specification but I still get the answer ...not done.

Have you got any further advice?



Mike.Linacre: Craig, that is strange.

Have just tested PWEIGHT= with Winsteps 3.74. It works fine for me.

Perhaps you are trying to weight after the estimates are made using the "Specification" pull-down menu. Sorry, that won't work. Weighting must be done before estimation either in the Winsteps control file or at the "Extra specifications?" prompt.


Add this control line at the top:

Add these two data lines at the start of the data:
2211011012222122122121111 M Rossner, Toby G. 3
1010010122122111122021111 M Rossner, Michael T. 2

The "3" is in column 60, the "2" is in column 68.

In Table 18 we see:
0| 88.0 62.5| 3.00| M Rossner, Toby G. |
1| 72.0 60.4| 2.00| M Rossner, Michael T|

CraigAus: Hi Mike,

I've started using a different data set and PWEIGHT is now working for me.

Still not sure exactly what the problem was. Perhaps some quirk with the data I was using.

Thanks for your assistance with this.


Mike.Linacre: Craig: a common problem in data-formatting is "tab" characters. They look like 6 or 8 characters or 0.5 inches or 1 centimeter on your screen, but they are only one character in the data file. These really foul up our attempts to count columns!

334. Comparing groups of item estimates

LKWilliams September 25th, 2012, 11:57pm: Sorry if this is an obvious question, but I'm new to Rasch and want to make sure I'm not doing something inappropriate to the method/results. We're developing a math assessment and estimated the difficulty of each item using a unidimensional Rasch model. In looking at the item estimates, we found a number of consistent patterns, such as items that use a triangle in the question are harder than items that use a circle (kids apparently have a harder time with triangles for some reason). I'd like to be able to quantify this difference, via hypothesis testing and/or CI, and show that, assuming all other factors about the question are equal, using a triangle is significantly harder than using a circle. I found references to compare a single item against another single item, but I would like to compare one group of items against another group of items.

Here's how I *think* I need to do it. Say I want to compare Group Circle against Group Triangle, and I have the b and se estimates for every item in those groups. I need to construct a t statistic where the numerator is the difference between mean(C) and mean(T). For the denominator, I need to first construct se(C) and se(T) by converting each item se to variance and then finding the pooled variance for Group C and Group T (as in https://en.wikipedia.org/wiki/Pooled_variance).

I then need to use the pooled variances to find se[b(C)-b(T)] via sqrt[(pooled var(C))/n(C) + (pooled var(T))/n(T)], where n(C) and n(T) are the number of items in each group and not the number of respondents. I can compare the t statistic to a t distribution with df from the Welch-Satterthwaite approximation using each group's pooled variance and number of items.

Did I misstep in my logic anywhere? Thanks in advance for the help!

Mike.Linacre: LKW, you are planning to do more than is necessary.

The error variance is already incorporated into the item difficulty estimates. (We wish we we could remove the error variance, for then we would have the "true" values of the estimates).

So, the t-test is a standard two-sample "difference between the means" test. I prefer the Welch version, but Student's original version is well-accepted.


LKWilliams: So I can ignore the fact that each of the values in my sample is a parameter estimate and treat them as if they were scores in a sample, doing a t test (Student or Welch) as normal?

Out of curiosity, would the approach I outlined have been acceptable? Was it just more that I needed to do or actually wrong?

Mike.Linacre: LKW, "scores in a sample" also have standard errors, but they are rarely computed or reported on an individual-score basis, though the errors may be summarized into "reliabilities". Nothing we encounter is ever exactly precise or true. Everything is estimated!

My quick perusal of your approach suggests that it would have double-counted the error

335. Infit & outfit in Facets

Raschmad September 21st, 2012, 6:45am: Dear Mike,
In some papers in the context of Facets I read that some authors consider acceptable infit and outfit range as +/-2 SD of the outfit/infit around the mean of outfit/infit. I read it in this article which cites Pollitt and Hutchinson (1987).


Can you please comment on this? How is this different from 0.70 to 1.30 recommendation? It's not practiced very much.


Mike.Linacre: Anthony, this acceptable range is new to me. My recommendation is to regard arbitrary rules as arbitrary. If you must have one:

Interpretation of parameter-level mean-square fit statistics:
>2.0Distorts or degrades the measurement system.
1.5 - 2.0Unproductive for construction of measurement, but not degrading.
0.5 - 1.5Productive for measurement.
<0.5Less productive for measurement, but not degrading. May produce misleadingly good reliabilities and separations.

But better is to follow a procedure similar to: "When to stop removing persons and items in Rasch analysis?" - https://www.rasch.org/rmt/rmt234g.htm

uve: Mike,

I too have read something similar:

Weaver, Christopher. (2011). Optimizing the compatibility between rating scales and measures of productive second language competence. Advances in Rasch Measurement, Vol.2, (239-257).

The reference is about the 2nd or 3rd paragraph down on page 246. The standard deviation of the mean fit statistic of .7 was multiplied by 2 to determine the acceptable underfit 1.4 fit criterion for the items. Not sure how overfit is handled using this method though unless we go with the .7 value initially mentioned.

Mike.Linacre: Raschmad and Uve, are we talking about the observed S.D. of the standardized mean-square fit statistics?

If so, then those references are following a procedure similar to that implemented in Winsteps with LOCAL=Yes - https://www.winsteps.com/winman/index.htm?local.htm. That page points out that this procedure is valid, provided that we make clear to the reader that the nul hypothesis we are testing is something like:
"These data fit the Rasch model exactly, after allowing for a random normal distribution of standardized fit statistics equivalent to that observed for these data."

uve: Mike,

My apologies for the last line or two of my comment. For some reason I was thinking in terms of MNSQ. However, I am now a bit unclear. Here's the quote:

"The frame of reference for the outfit and infit statistics was determined with simulated data that fit the Rasch model. This simulated data was based on the distribution of item and person estimates from a calibration of the real data. The standard deviation for the infit and outfit statistics was 0.07, which was then multiplied by two to provide a benchmark yielding an approximate Type I error rate of 5%. Thus, moves with outfit and /or infit statistics exceeding +/-1.4 were considered to be contributing more off-variable noise than useful information."

The term "move" I believe refers to a rated task/item. Again, the author never specifically mentions which statistic is used, but I would guess it would have to be ZSTD for there to be a +/- element. Then again, if we take the -1.4 extreme, how could this be interpreted as off-variable noise? If we accept -1.4 as significant, then it would be indicating significant overfit, not underfit which I believe is related to muted information not noise.

Mike.Linacre: Uve, our English language is deficient in describing statistical niceties. We don't have a convenient term (that I know of) to describe both too much randomness and too little randomness.

The author wrote "Thus, moves with outfit and /or infit statistics exceeding +/-1.4 were considered to be contributing more off-variable noise than useful information."

More precisely, "more off-variable noise than useful information." is the situation when the mean-square is greater than 2.0

So, the author apparently intends something like:
"Thus, moves with outfit and /or infit standard mean-square statistics outside +/-1.4 were considered to be too far away from the Rasch ideal of useful information."

uve: Thanks Mike,

Your link: https://www.winsteps.com/winman/index.htm?local.htm

with reference to Local=Y for ZEM output makes better sense for me, since I am working with testing populations of 2,000+ consistently.

336. Improving Prediction of a Screener

RaschModeler_2012 September 20th, 2012, 10:52pm: Dear Mike,

Suppose there is a screener consisting of 20 self-report items with graded response options (e.g., 1=never thru 5=always). The screener was developed to predict a specific behavior one year post-assessment (e.g., relapse of drug abuse), and has been shown, across various samples, to have a sensitivity and specificity of approximately .70. The ROC analyses performed to determine the accuracy of the prediction were based on summative scores from the screener and a criterion measure which is highly accurate.

Question: Suppose we have new data (along with the various datasets used previously) and we want to maximize precision as much as possible; that is, our PRIMARY objective is to develop a prediction equation that is more accurate than the existing one (which is currently based on simply summing the items). We want to develop the most highly predictive equation possible. What would be the general steps you would take in doing so? I realize this question is potentially a bit off topic, but given your expertise in measurement, I am curious what your response would be.



Mike.Linacre: RM,

Classification and Regression Tree (CART) Analysis was developed for exactly this situation. See http://en.wikipedia.org/wiki/Decision_tree_learning .

But my experience with optimizing future predictions based on previous datasets is that it is easy to overfit the previous dataset, leading to worse prediction of the future. See my example at https://www.rasch.org/rmt/rmt222b.htm - So, at least, split your data into two sets: prediction dataset (for constructing the prediction model) and confirmation dataset (for verifying that the prediction model really does perform better).

337. Calaulating standard error

turbetjp September 20th, 2012, 2:39pm: I am new to Rasch analysis and I am trying to replicate a previous analysis. How would I calculate a standard error of the mean from summary statistics provided by Winsteps? Thanks.

Mike.Linacre: Thank you for your question, Turbetjp.

For most numbers, the "standard error of the mean" is the standard deviation of the numbers divided by the square-root of the count of the numbers. But you already know that, .... :-)

338. factor analyse-Could I categorize each student?

ace September 14th, 2012, 10:54pm: Hi,

I collected 7 day food frequency questionnaire and the demographic information of same 300 students.after use factor analysis for students 7 day food frequency table ,I found there are 8 principle components (identified as food patterns afterward) which eigen value>1.

1) Could I use the 5 the principle components(diet patterns) which have highest eigen values out of 8 principle components(diet patterns) i got, for my further analysis?

2) Is it unfair If I assign each student to the principle component(diet pattern) they got highest factor score?

3)If not Is there any way I could classify each student in to a principle component (diet pattern) after factor analysis?(because finally i want to find out each students diet pattern)

Thank you!...

Mike.Linacre: Thank you for your questions, Ace.

As usual, the challenge is how to communicate our findings to ourselves, and then to others. Your analytical situation is similar to the Myers-Briggs Type Inventory and other psychological instruments. For these, we usually identify the dominant trait(s) and identify the subjects by those. So, it makes sense to do the same with your instrument.

Please look at the content of the 8 principal components. One or more may be uninformative for your purposes. For instance, one component may contrast "beef consumption" with "pork consumption". This contrast may be important ethnographically, but may not be important nutritionally.

ace: Again Many Thanks Mike...Could you tell me could I assign each student to a dominant diet pattern like below..

I reduce the number of components(dietary patterns) to 5 out of 8.I labeled the five components(dietary patterns) as A,B,C,D and E. Now I have 5 factors scores(belongs to diet patterns) for each student as below.

Student no. A B C D E
1 1.8976 0.0234 -0.3456 0.9872 1.3412
2 -2.2356 1.2890 -0.4500 -.2.9065 1.5680
3 -0.6712 1.9056 1.8567 -0.8790 1.0967
4 -0.0214 -0.0567 1.5432 1.3240 -2.7658
5 1.7895 1.1567 -0.4567 2.0123 -0.0345

Can I label each students diet pattern as A,B,C,D or E by looking the factor scores?Could I assign the Diet pattern which have highest factor score, for a student?Else is there any other statistical way i can label each student as A,B,C,D or E using the factor scores?

Thank you!...

Mike.Linacre: Ace: the usual approach is to categorize the students according to their absolute biggest factor scores, but this is outside of my area of expertise.

Ace, please ask this question on a SAS or SPSS Discussion Board.

ace: Ok Mike..Many Thanks for the advices you continually provided until now.I really appreciate it.. :)

339. RSM and PCM nested in Facets?

Raschmad September 14th, 2012, 3:57pm: Dear Mike,
I think Factets is a different and independent member of family of Rasch models.
However, I guess, when we analyse rating scales with Facets model to account for rater severity and leniency we still need to use rating scale model, partial credit model, or the dichotomous model depending on our data. Am I right?
So, are the other models nested within Factets?


Mike.Linacre: Anthony, for unidimensional Rasch models, yes.

RSM, PCM, DM and other models specify how the different possible scored categories of an observation relate to each other.

MFRM (Many-Facets Rasch Model) specifies how the different components of the situation interact to generate the observations.

340. Factor analysis-Large number of components?

ace September 13th, 2012, 5:44pm: Hi
[justify]I have done a factor analysis using SPSS 16.0 and the result gave 11 components that eigen value> 1.The total variance explained by the 11 components are 57.875%. Is it unusual to have such large amount of components which 1>eigenvalue?Is there any suggestion to reduce the number of factors to more smaller number like three or four? [/justify]

Thank you!

Mike.Linacre: Thank you for your question, Ace.

1) The number of components = number of variables - 1, but the smaller components are not usually reported.

2) In an unrotated, orthogonal PCA analysis, roughly half of the components are expected to have eigenvalues > 1.0

3) In an unrotated orthogonal Common Factor analysis, the proportion of components expected to have eigenvalues > 1.0 depends on the size of the auto-correlations in the main diagonal of the correlation matrix.

4) The number of reported components (usually with eigenvalue > 1.4) can be decreased by manipulating the components using rotation of axes and obliqueness of axes. SPSS supports these operations.

ace: ManyThanks Mike :) :) :)......

341. disattenuated clarification

uve September 10th, 2012, 11:58pm: Mike,

In the literature, the disattenuated correlation of two tests is defined as the correlation coefficient divided the by the square root of the product of the two test reliabilities. My question is: are the reliabilities mentioned the Rasch person reliabilities or the Cronbach alphas?

Mike.Linacre: Uve, if you are correlating person raw scores, then use Cronbach Alpha (a reliability based on raw scores). If you are correlating Rasch person measures, then use the Rasch person reliability (a reliability based on person measures).

uve: Thanks as always!

342. Hx of Scale Scores for MMPI and MMPI-2

RaschModeler_2012 September 11th, 2012, 3:47am: Hi again Mike,

You are such a wealth of information and you make yourself so accessible that I thought I might ask you yet another question. Please forgive me for all my questions.

From what I can tell from this website...


....traditionally, the MMPI scale scores were standardized to a mean of 50 and standard deviation of 10 (without normalization), allowing for the original shape of the distributions to remain positively skewed. Problem is, specific scores on a given scale were quite different with respect to percentile ranks compared to other scales. As a result, an additional study was conducted to normalize the standardized scores; that is, to change the shape of the scores to a normal distribution, thereby rendering the percentile ranks equivalent across scale scores. In the creation of the MMPI-II, however, they decided this was a unwise decision, and now they've created what's known as "uniform T scores" by combining the raw scores of the 8 clinical scales into a composite distribution, followed by regressing the composite scales against the overall composite to obtain T-score conversion formulas. They are basically trying to reduce the discrepancies across the scale distributions (without using an extreme approach such as normalization), while maintaining the greatest discrimination at the high levels of the scale scores.

As I see it, they're basically trying to retain the distributions of the scale scores as best they can such tha they can equate a T-score of say 65 to be considered the clinical threshold for all scales. This certainly makes it easier for the clinician in the field, but is it worth it?

I realize the statement above does not given them due justice (I have left out key details), but I believe it gets to my fundamental question As someone so knowledgeable in measurement theory, and particularly the Rasch model, is it really so important to equate scale scores measuring relatively unique personality traits to have equivalent "cut-off" scores?

Regardless of how they developed these scales (e.g., CTT based approach or a hybrid type of approach), would it not be better to simply develop a MEASURE of each trait separately (assuming unidimensionality, which may not be the case!) and stop worrying about having the cut-off scores that are equivalent across various personality traits; in other words, to stop worrying about forcing the distributions to be the same or roughly the same across measures of various personality traits.

Perhaps I'm missing the larger picture. Your thoughts would be most appreciated on this matter.

What's your take?

As always, thank you. Apologies for rambling...


Mike.Linacre: RM, good thoughts :-)

Let's consider the same situation in the physical world:
8 physical clinical scales = body temperature, blood pressure, ....

With some mathematical manipulation, we could transform each indicator to have the same threshold score, of say, 65, and also the same S.D. (Medical variables are sometimes graphed this way, with the x-axis labels maintaining their original values.) These transformed indicators could then be averaged or summed. Thus we could summarize all the physical indicators into a "health measure".

In practice, however, medical practitioners would probably prefer the composite "health measure" to be strongly influenced by "bad" indicator-values, because "good" values are unlikely to compensate for "bad" values.

343. Compare same item (different contexts)

JeySee September 4th, 2012, 6:57pm: Hi there!

I'm currently analysing a partial credit test which basically consists of two parts: test-takers have to do the same thing twice in different contexts (item a1, a2, b1, b2...). I assigned the same codes to the different contexts and now I want to check whether the items differ significantly according to contexts.
Can DIF / DPF help me there? I tried to work that out but wasn't very successful :(

I woud be very thankful for some help!

Mike.Linacre: Thank you for your question, JeySee. What software are you using?

Here is how I would do this in Winsteps:
1. Each item-context is a column
2. Each test-taker is a row
3. We can then compare the two matching items in any way we want

JeySee: Hi Mike,
thanks for your fast answer!

I'm using winsteps.
I've done steps 1&2, but I wonder how I would tell the program which items have the same context (which pairs belong together).
I tried to stack the items and entered a new variable (1 = context 1; 2 = context 2). then I computed the DIF for 1 vs. 2.
But I'm wondering about two things:
1. If I stack the items, then the program doesn't know that the two items were answered by the same person (and test statistics like reliability etc. change) and
2. I treat the context-variable as person-variable which it clearly is not. Can that be right?

I appreciate your help!

Mike.Linacre: JeySee, what comparison between the two versions of the items do you want?

For instance, if you want to compare the item difficulties to discover which pairs have significantly different difficulties, then

1. in the item labels, put the item-context code "a1", "a2", ... "b1", "b2", ....

2. model the two versions of each item to share the same partial-credit structure
ISGROUPS = 12......12.......

3. The analysis will produce a measure, S.E., and count of observations for each item.
So, for each pair of items, we can use Excel to compute Welch's variant of Student's t-statistic: https://www.winsteps.com/winman/index.htm?t-statistics.htm


JeySee: Mike,
that's exactly what I need! Thanks a lot!
I'll try this!

344. Interpretation of interval and Scale of Ratios

RaschModeler_2012 August 30th, 2012, 10:48am: Hello:

Suppose we parameterize the following binary-response Rasch model:

logit_ij = log[(pij) / (1-pij)] = theta_j - beta_i

The model is fitted employing maximum likelihood estimation. Further, suppose the first person obtains a "Rasch score" of logit=+1.5.

With the assumptions of the Rasch model met, one would interpret this as the first person having an ability level of 1.5 along the logit scale. Because this scale is at the interval level, the distance between a logit of 1.5 and 1.3 is the same as the distance between a logit of 1.3 and 1.1, correct?

Further, suppose the underlying construct reflects overall intelligence (e.g., IQ), and I want to determine how many times higher on intelligence a person with a logit of +1.5 is than a person with a logit of +1.0. I assume one would need to exponetiate these values as follows:

Person 1 = exp(+1.5) = 4.48 (rounded to the 2nd decimal)
Person 2 = exp(+1.0) = 2.72 (rounded to the 2nd decimal)

The ratio of these two odds would be: OR= 4.48 / 2.72 = 1.65.

Exactly how does one interpret this OR? Does this indicate that person 1 has 1.65 times higher intelligence than person 2?

Let's try another example:

Person 3 = exp(2.5) = 12.18 (rounded to the 2nd decimal)
Person 2 = exp(1.0) = 2.72 (rounded to the 2nd decimal)

The ratio of these two odds would be: OR=12.18 / 2.72 = 4.48.

Again, would this indicate that Person 3 has 4.48 times higher intelligence than person 2?

I suppose the fundamental question is how does one interpret a Scale of Ratios; that is, while there may not be an absolute zero (a.k.a. Ratio scale), is it possible to make a statement about how many times greater/higher one person is relative to another, and if so, how is it done?

Any thoughts would be most appreciated.


Rasch Modeler

Mike.Linacre: Thank you for your question and explanation, Rasch Modeler.

For questions like this, it is always useful to think about the same situation in physical measurement. Instead of measuring ability, let us measure mountains.

How do we decide that one mountain is twice as high as another?
If we measure mountain-heights from the sun, then the relative heights of the mountains are always changing. Meaningless!
If we measure mount-heights from the center of the earth, then all mountains are almost the same height. Useless!
No, mountain-heights are compared from an arbitrary, convenient, somewhat fictional, commonly-understood reference point: "sea-level". In my part of the world, the level of the sea is always changing. Someone, somewhere, must have defined "sea-level" as a constant value that everyone, everywhere, now uses as the mount-height reference point (except for mountains under the ocean).

It is exactly the same on a logit scale. Someone, somewhere, must define the arbitrary, convenient, somewhat fictional, commonly-understood reference point.

Let us define some possible reference-points:

Math ability: a convenient zero point could be the average math ability of children at the beginning of Grade 1.

Intelligence: a convenient zero-point could be the lowest intelligence level required for someone to function independently in society.

How does this sound to you, RM?

RaschModeler_2012: Dear Mike,

Thank you VERY much for responding. This dispels a myth of mine that raw scores which are converted into Rasch interval-level scores (on the logit scale) could be converted (by exponentiating the log odds) into a meaningful multiplicative scale (e.g., person a's ability is x times person b's ability) wihtout requiring a reference zero point.

At any rate, I'd like to follow through with this question to make sure I understand fully how to achieve what I desire. On a typical IQ test (based on CTT) for which the approximately normally distributed scores have been standardized to a mean of M=100 and standard deviation of SD=15, a score below 20 (~5.33 standard deviations below the mean) is considered profound mental retardation which requires constant supervision.

With that in mind, suppose we conducted a validation study of a new IRT-based IQ test based on a large representative sample of the U.S. (e.g., N=50,000), and assuming we obtain an approximately normal distribution of logit scores, we consider a cut-off of 5.33 standard deviations below the mean as the 0 reference point. What would be the steps to convert this interval-level measure to a ratio level measure such that I could make statements such as "Person A is x times more/less intelligent than person B?"

I could see how the first step would be to standardize the logit scores to a mean of 0 and standard deviation of 1. This would tell us precisely which scores fall at or below a z-score of -5.33, but what would be the next step (i.e., conversion formula) in making a z-score of -5.33 be the reference zero point such that I could create an approximate ratio scale? Again, the end goal would be for me to be able to say that Person A is x times more/less as intelligent than person B"

Sorry if the answer is obvious.

Thanks again for all your help.


RaschModeler_2012: As I think about this more, perhaps standardizing the scale to a SD of 1 is the wrong approach. I have a feeling that by doing this, I might remove the interval level property of the scores!

Clearly I neeed some guidance...

Thanks again,


RaschModeler_2012: I apologize for posting three times in a row, but there is an obvious error that I've made that really should be rectified before Mike (hopefully) responds. That is, the assumption that 5.33 standard deviations below the mean would reflect someone who has "profound mental retardation." If the goal is to create in an inteval level scale and then derive a zero reference point, this should NOT be informed by a CTT based test which is likely not on an interval level scale. As such, we would need to determine a cut-off point that constitudes someone has "profound mental retardation" based on our own investigation during the validation process. Since that cut-off point (in logits) is not going to be exactly 0.00, my question is how do we convert the measure such that that cut-off point becomes the zero-reference point, thereby allowing us to create an approximate ratio based scale.

Ok. No more comments until Mike responds. :-)



Mike.Linacre: Thank you for this exploration, RM2012. It highlights a fundamental difference between the conventional descriptive-statistics approach to solving a problem and a (Rasch) measurement approach.

Your description of IQs is norm-referenced. We choose a sample. Assume it has a certain distribution (usually normal). Choose a cut-off point on the distribution (usually expressed as n S.D.s or n percent).

The Rasch approach, which is similar to the physical measurement approach, is first to define the cut-off point substantively (e.g., "freezing point", "boing point").

"What are the indicators of: profound mental retardation which requires constant supervision?"

When we have defined these, we can draw a ruler for intelligence similar to these rulers for the WRAT - https://www.rasch.org/rmt/rmt84q.htm - We can then apply the usual conventions for measuring that we apply to rulers. "Twice" = "Twice as far from the point we have chosen to mark as zero on the ruler" (which may be in the middle of the ruler in some applications).

Now, how do we handle "twice as hot" for temperature? We usually take the reference point as "room temperature", even though there is also an absolute zero of temperature (which is never used in the course of ordinary activity). similar logic applies to psychological variables.

RaschModeler_2012: Mike,

Thank you so much for the clarification. Just to make sure I am understanding this concept in the context of the standard dichotomous Rash model:

logit_ij = log[(pij) / (1-pij)] = theta_j - beta_i

Suppose indicators of "profound mental retardation requiring constant care" are located approximately on the Wright map around -3.6 logits. So, we decide to make this value the zero point. Further, suppose that person A's estimated intelligence score is -2.6 logits and person B's estimated intelligence is -1.6 logits. Would I be correct in stating that "Person B is estimated to be twice as intelligent as person A, relative to an individual who has profound mental retardation requiring constant care."

Thanks again! Much appreciated!


Mike.Linacre: RM, exactly right!

Give me a place to stand and I will move the world.

RaschModeler_2012: Mike,

Illuminating! Thank you very much. I have a follow-up question, if I may. It pertains to the Rasch logit model I've presented a couple of times in this thread. In logistic regression, if one were to exponentiate a regression coefficient (log odds ratio), the result would be an odds ratio. My question is this...

Using the intelligence measure example I provided before with -3.6 as the "zero point," if we were to exponentatiate the difference in logits for person A and person B, would those estimates be interpretable? Concretely,

Person A: exp[-2.6 - (-3.6)] = exp(1) = 2.72
Person B: exp[-1.6 - (-3.6)] = exp(2) = 7.39

I'm not really sure if these estimates are interpretable/meaningful.

Thank you,


Mike.Linacre: RM, we have transformed from "1" and "2" into "x" and "x^2". We would see something similar if we exponentiated the heights of mountains above sea-level.

For the odds-ratio perspective, you may find "Log-Odds in Sherwood Forest" interesting: https://www.rasch.org/rmt/rmt53d.htm

RaschModeler_2012: Thanks, Mike. Based on your response and the website you linked to, it seems to me that staying on the original interval-level [logit] scale is far easier to interpret. One can easily convert the interval-level [logit] into an equal-interval ratio scale by determining a meaningful zero point. I do not see the utility in exponentiating the values.

Thanks again.


Mike.Linacre: Agree, RM. Physiologists tell us that our eyes see in ratios, but our brains convert those ratios to interval distances, which are much easier to think with.

345. test equating - concurrent and anchor methods

Bonnie May 10th, 2012, 1:17am: Hi Mike,
I am working on test equating using four alternate test forms. Each form has 10 common items and 30 unique items. The common items are the same across all 4 forms. Ive tried two different methods for equating the forms based on what Ive read in the literature: Concurrent calibration - running all 4 forms together as one test with empty spaces for missing items; and the anchor item method - first running the common (anchor) items together to get scores for those items, then running each form separately with the scores for the anchor items fixed. My question is related to the results. When looking at the person statistics, the results of the concurrent analysis are all about .39 logits higher than the results using the anchor method (The difference ranges by person from .34 to .44, and the results of the two methods correlate nearly perfectly at .999). The item statistics also show a difference in results between the two methods of about .40 logits and correlate nearly perfectly. My questions are:

Is it typical for one method to have slightly higher results than the other?

Also, Im assuming that as they correlate, both methods are equally good. You just have to pick one method, set your cut points and stay with it, is that correct?

Finally, I was also going to try the mean sigma method as an equating method, but it seems to be designed with the assumption that you have only 1 base form to refer back to. For the other methods, I have been putting all 4 forms together rather than using 1 form as a base. Is there a problem with analyzing 4 forms together to get the anchor measures rather than getting them from 1 base form?

Thank you.

Mike.Linacre: Thank you for your explorations, Bonnie.

Please verify that the shift of 0.39 is not merely a shift in the local zero.

What is the average difficulty of the common item difficulties of the "common items together" analysis. Probably zero.

What is the average difficulty of the common item difficulties of the "all items" analysis. Probably about 0.39 logits.

If so, then the choice of zero point is the difference.

In the "all items" analysis.
From the Specification menu, assuming items 1-10 are the common items
Output Tables
Table 3.1 Summary

This will tell you the mean item difficulty of the anchor items (expected to be 0.39 logits)
Specification menu
UMEAN = - (mean common-item difficulty)

The item difficulties should now agree.

Yes, the equating methods are equally good in principle. In practice, we look at the situation and see what part of the data we trust the most. Usually we trust the common items the most, but we may trust one test form the most, particularly if there is a master form with some minor variants.

timstoeckel: Hi Mike,

In a follow up to this thread on equating, I have a question about testing the equivalence of reliability coefficients for multiple test forms. In the literature on test equating, it is frequently mentioned that the reliability of each equated form should be the same. I have also come across some papers which explain how to test the equality of alpha coefficients. Would you be aware of a procedure for testing the equality of person reliability measures produced by Winsteps (Table 3.1)?

My 240-item bank has a real person reliabilty of .92, and the four 60-item forms have .88, .87, .86 and .92.

Thanks in advance for your help.


Mike.Linacre: Tim,

They say "the reliability of each equated form should be the same" - I wonder why? It is like saying that we can only equate the temperatures on two thermometers if the thermometers have equal precision (=reliability).

But, if we must compare multiple reliabilities, then please Google "Hakstian-Whalen test"

timstoeckel: Hi Mike,

Thanks for pointing me to the Hakstian-Whalen test. I was even able to dig up on online calculator posted by someone at Penn State.

About the equating criteria of equal reliability, it all depends on what the test is to be used for. If two thermometers are both used to find out if someone has a fever, their level of precision does not need to be equal, but each had better be accurate to within a degree (or less?) of true temperature, no?

Cheers and thanks again,


Mike.Linacre: Tim: your thermometer example is great! So we don't need "equal reliability", we need "reliable enough". And, of course, reliability (= precision) is internal to the instrument. Accuracy (comparison with an external standard) is external to the instrument. We usually consider "accuracy" to be addressed by "test validity".

timstoeckel: Mike,

Thanks as always for your insights. A colleague and I will be presenting on the validation and equating work we have done with our vocab tests, and your feedback inspires confidence. =)

346. Number of Person

davis August 25th, 2012, 6:04pm: There are 12 items in my test. I will use Partial Credit Model. Rasch model can be applied for at least how many people? Is there a limit?

Mike.Linacre: Davis, please see www.rasch.org/rmt/rmt74m.htm

The most important aspects of your design are the intermediate scores on your partial-credit items. You will need at least 10 observations of every score level on every item.

Assuming your person sample is reasonably targeted on the items, and that your intermediate score levels are reasonably wide on the latent variable, your minimum reasonable sample size will be about 100 students, but, if these assumptions are not met, then perhaps 1000 students. OK?

347. Facets Linking Questions

djanderson07 August 23rd, 2012, 4:12pm: Hi all, I have a couple questions regarding linking with Facets:

1) I have multiple test forms that were constructed to be of equivalent difficulty with a Rasch model. They are intact forms with no common items between forms. I am currently in the early stages of planning an alternate form reliability study and I am interested in the possibility of using Facets, with "Form" being one of the facets. However, because there are no common items I'm guessing I would need to embed a set of common items across all test forms in the study to get the analysis to be fully linked? I worry about embedding additional items within each form because then the "form" for the study will be different from the operational test form, which would obviously not include those additional embedded items. Is there any other way to make the analysis linked? For background, the design I was thinking would have 4 facets: Person, Item, Form, and Occasion (there will be two testing occasions spaced one week apart). Any recommendations?

2) I also am interested in evaluating whether the domain the item was written to is a meaningful facet. For instance, some of the items were written to measure Statistics and Probability skills, while others were written to measure Geometry skills. However, it seems to me that the item type (i.e., the domain) is inherently confounded with the item itself. That is, it seems that even if I were attempting to do a Facets analysis with only one test form, in a 3 Facet design (Person, Item, Item type) that I would end up with subsets for each group of item types. Is my thinking correct here? If so, do you have any recommendations on how to best investigate item type as a meaningful facet?

Thanks so much for your help

Mike.Linacre: dj, this is a tricky situation.

Unfortunately neither (1) nor (2) appear to be suitable for a Facets-style analysis.

Both seem better suited to a rectangular (person-by-item) analysis. The person labels would include classifying variables such as Form and Occasion. The item labels would include the classifying variable of Item type. The classifying variables could then be used for subtotals, differential-functioning analyses, and also for data selections for sub-analyses.

For equating the forms, there are two approaches:
a) assume the Persons are randomly assigned to each Form, and then match the person means and S.D.s for the two Forms.
b) use Virtual Equating to match the item-difficulty hierarchies for the two forms. https://www.rasch.org/rmt/rmt193a.htm
We would hope that (a) and (b) would produce statistically the same equating constants.

djanderson07: Thanks Mike, I appreciate the recommendations.

348. Error F31

drmattbarney August 20th, 2012, 8:02am: Dear Mike

I'm experiencing an error "F31 in line 10: Too many responses: excess: 17"

The file looks okay, and I can't see where it is trying to import too many responses. I checked the Facets manual, without success.

I appreciate any guidance you can give

Thanks in advance,


Mike.Linacre: Matt, what version of Facets are you running? Is it the latest version, 3.70.1?
If not, please email me directly for download instructions: mike \~\ winsteps.com

If you are, please email me your specification file to match the Excel file.

drmattbarney: yes, Mike it's the most recent version. But curiously, it ran this time around. either way, I send you both files just now.

Thanks as always


Mike.Linacre: Thanks, Matt.

Am investigating.

Mike L.

349. Items with zero loading on the 1st contrast

Raschmad August 17th, 2012, 12:25pm: Hi Mike,
In PCA of residuals what does it mean when some items have zero or near zero loadings on the first contrast?
Does it mean that they measure neither the target dimension nor the secondary dimension?


Mike.Linacre: Raschmad, when the data fit the Rasch model, then the response residuals will look like normally distributed noise. We expect the items to have loadings near zero on the first and subsequent contrasts.

Zero loading on a contrast means "the residuals for this item are not correlated with the hypothetical component underlying the contrast."

350. Separation and reliability

Deej August 15th, 2012, 5:36am: Hi,
From going through past discussion, I gathered that
Real person separation of 3.50 and reliability of .92
Real item separation of 4.60 and reliability of .95
are very high, which is a good sign?
Thank you.

Mike.Linacre: Deej, any reliability greater than 0.9 is very good.

There is some unmodeled noise in your data. Please look at person and items with mean-squares greater than 2.0. Are there some persons in your sample who are misbehaving?

Deej: Hi Mike,
Yes, there are a few respondents who are misbehaving. I've used "some reasonable item mean square range for infit and outfit" and went with the range of 0.6 to 1.4.

I would like to consult you with regards to how I am relying in Rasch measurements for my study.

(1) Used Winsteps to create the variable map.

(2) Calculated significant gaps because I wanted to model and measure developmental discontinuity. I followed the method reflected in Bond and Fox, p.125, t-test - ratio of the difference between adjacent item estimates and the sum of measurement errors for those 2 items. Here comes the first question, in example in Bond and Fox, a significant gap is when t>2.0. But in my data, the most significant gap is only t>0.96, assuming I calculated it correct because I did it manually. So, what I did was to find out the standard deviation of the 59 gaps and the values that were 2 SDs from the mean, i regarded them as significant gap. I could not find past research to recommend if I should rely on 1SD or 2SDs. In any case, does this look logical to you?

(3) Based on the method mentioned in (2), 5 gaps were identified. Effectively, the population could be categorised into 6 bands.

(4) Here is when I mixed in the qualitative data. My research questions were (a) What are the environmental attitudes of the population? (b) How were their environmental attitude influenced? Up to step (3), was my answer to research question (a). As for research question (b), I separated the varying reasons which each individual had cited into one of the 6 bands. I would like to find out if people are influenced differently with varying levels of environmental attitudes. I sort of hit a wall at this stage, because my sample is small, only 69, I could see differences in the reasons, for example, in the top band, there were 6 people, extended time spent in nature was mentioned 4 times. In the bottom band, there were 26 people, extended time in nature was cited 5 times. According to past research, people with better attitudes towards nature should be able to recall time spent in nature better because they simply spent more time in it and are more likely to have some form of gratifying experience which is crucial to the formation of favourable environmental attitudes. The problem, however, is, what if the people in the bottom band merely forgot to mention extended time spent in nature. The data was collected via a web survey, so there is not an interviewer to probe the respondents. In any case, I could mention this as a limitation in the study and leave it as such as a way to get around the wall?

Thank you.

Deej: After removing the misbehaving items, the item mean is 0. Is it a good sign?

Mike.Linacre: Thank you for asking for advice, Daojia.Ng.

The item mean is set to zero in every analysis, so this is neither good nor bad. If items are removed, all the item difficulties are changed, so that the mean stays at zero.

It looks like you are getting a lot of information from your small sample of 69. This is similar to a pilot study. There is nothing definite, but it may support or contradict previous research. This study has many limitations, such as self-reporting. In your report, you could state the limitations you think are most important.

Here are the questions you need to answer in your report:
Does this research confirm what other researchers have reported? Does this research contradict any findings by other researcher? Does this study suggest anything new that other researchers have not noticed?

Deej: Hi Mike,
This is gold. Thank you very much.
Now I can stop mumbling to myself when I meet my supervisors. :)

351. local independence and two-tier items

jack_b August 15th, 2012, 7:28pm: I am working on an analysis where some of the items are two-tiered, the first item asks for a response about a situation and the second asks for the reason behind choosing the first response. Looking at Table 23.99 in Winsteps, several of these tiered items exhibit high residual correlations (see below) indicating local dependence. No other items exhibit high residual correlations.

My question is, because these items are meant to be paired is this OK (i.e., NOT a violation of local independence that I need to worry about) or do I need to score the items as only being correct if an individual gets BOTH tiers correct?

Pair Correlation
7-8 = 0.82
10-11 = 0.86
12-13 = 0.76
18-19 = 0.53
20-21 = 0.16

This is my first post here on the forum but I have been reading them as a guest for awhile now.

Thank you in advance for any help you can provide.


Mike.Linacre: Jack, there are several approaches to this problem, depending on how the relationship between the items is conceptualized.
You do need to be concerned about local dependence.
Approach 1: code one of the two items as missing data if its value can be inferred from the other item's value
Approach 2: sum the two items together into a "partial credit" item.
Approach 2 usually works better unless the items have a very strong incremental pattern.

jack_b: Thanks Mike!

352. Table 1.0 for polytomies

Ida August 15th, 2012, 11:16am: The understanding is quite intuitive with dichotomies, is there some material on how to understand the graphics for polytomies. If they are on different sections of the scale is that then concludable that the questions are all too easy or too difficult?
Best regards

Mike.Linacre: Ida, for polytomies, please use a Winsteps Table like 2.2.
Table 1 is too cryptic until you are really familiar with interpreting polytomous-item functioning.

353. PCM analysis

Raschmad August 10th, 2012, 5:42pm: Mike,
I have analysed a scale with PCM. There are 30 items. Should I report the threshold estimates for all the items?
Only one item has disorderd thresholds. Should I recategorize the response options for all the scale now? Or should I recategorize the options for only that item?
Does the distance of 1.4 to 5 logit between thresholds also holds for individual items in PCM analysis?
The disorder is vey slight: -1.31, .69, .62. Is it very serious? It�s only for one item.



Mike.Linacre: Raschmad, if the items all have the same rating-scale category definitions, then please group the items together as much as possible (ISGROUPS=) to share the same rating-scale category definition. This simplifies communication with your audience.

The very slight disordering of thresholds is probably due to chance, but, even if it is not, your audience will be confused if you recode the rating scale for one item but not for the others. How will they (and you) compare outcomes between the recoded item and the other items?

Raschmad: Thanks Mike,
I meant partial-credit model. the reason why I used PCM instead of rating scale model is that it fits much better. Of course, communication of results to audience from RSM is easier. But in RSM several items misfit or overfit which I can't explain from a substantive point of veiw. But PCM with the same data provides excellent item fit values. the only problem is that each item has unique threshold estimates.

1. Should all threshold estimates be reported for every single item?
2. The distance between some thresholds in some items is .40. They are close but they are all ordered. this means that respondenets could distniguish among the category choices. Is the close distance a problem?


Mike.Linacre: Raschmad, Fit is a secondary consideration in the choice between PCM and RSM, particularly if items with similar threshold patterns are grouped together using ISGROUPS=. This is because differences in fit tend to be dominated by accidents in the data. Please see https://www.rasch.org/rmt/rmt143k.htm

Suggestion: model the data with PCM, output the person measures (PFILE=)
model the data with RSM
cross-plot the two sets of person measures
What difference has the choice made to the person measures?

If the difference makes no substantive difference, then RSM

1. If you decide on PCM, then communicate to your audience the substantive implications of the differences in the thresholds.

2.Are you sure that "threshold ordering = ability to distinguish between categories"?
Surely: "fit of responses within category = ability to distinguish between categories"
"threshold ordering -> width of category on the latent variable"
Where strong disordering = narrow category, and strong ordering = wide category.
(Remember that thresholds are pairwise between adjacent categories, not global across all categories. If you want to think about global thresholds then please use Rasch-Thurstone thresholds, instead of Rasch-Andrich thresholds.)

Raschmad: Thanks Mike,
I shifted to RSM. The threshold estimates are -1.46, .37, 1.09.
They are not within your guidelines of 1.4 to 5 logits. Shoudl I recategorize?


Mike.Linacre: Raschmad, these are wonderful thresholds.
Please look at my guidelines again. 1.4 logits is only if you want your rating scale to act like a set of dichotomous items. Do you really? That is an unusual requirement for a rating scale.

354. Winsteps - choosing colums in PFILE

moffgat August 10th, 2012, 7:28am: Hello,

is there any way that I can specify in the winsteps controll file what coloums are exportet into the PFILE? I can't find a way myself, I always get all the colums. Thank you all in advance.

Yours Frank

Mike.Linacre: Frank, please try this:
After a Winsteps analysis
Menu bar
Click on Output Files -
Click on Field selection
Click on desired fields
Set as default

moffgat: Thank you Mr. Linacre,

this way I was already familiar with. My problem is, I want to run a batch file where I need different pfiles to be output by different control files. Therefore I cannot set anything to default. So I am looking for a way to specify the colums in the pfile comand in the control file. Ofcourse its not a problem to delete the colums I dont need afterwards, I just wanted to make the processing afterwards a little easier.

Yours Frank

Mike.Linacre: Sorry, Frank. PFILE Column selection is not available in Batch mode.
Suggestion: Use the default column selection to eliminate all the columns you will never need. Then you will have fewer unwanted columns (but you have probably already done this).

The freeware version of http://www.parse-o-matic.com/parseomatic_free.aspx may do what you want.

moffgat: Thank you, now I know at least, that I don't need to continue looking for the right command ;) I will do as you recommended and try the software you proposed.

Yours Frank

Mike.Linacre: Frank: If you feel really bold, you could rewrite Winsteps.ini in your batch file.
Append the column selection line that you want each time.
1) Find the path to Winsteps.ini on your computer disk
For me it is:
"C:\Documents and Settings\Mike\Local Settings\Application Data\Winsteps.com\Winsteps\Winsteps.ini"
2) Launch Winsteps, analyze a data file,
3) Output Files menu: make the PFILE- field selection. Click on Make default
4) Open Winsteps.ini in NotePad. Copy out the OFSFIELDS=line
5) In your batch file, add this line before calling Winsteps in order to append the Field selection to the end of Winsteps.ini (overriding other field selections in Winsteps.ini)
6) back to 3) for the next field selection

moffgat: Oh Thank you, a little unconventional but if it works I am pleased :) I will try it this way.

Thank you and a nice weekend to you

Yours Frank

355. Rubric category stability

gjanssen August 8th, 2012, 1:05am: Good afternoon,

We are concerned about the rubric category stability of two different data sets (with the "same" test taker population), a speaking rubric and a writing rubric, and hope you might be able to share some insight.

CONTEXT. Across all six exam administration groups (each group being at least n = 50), the rater severity measures never vary by more than 0.15 logits, which we take to mean as raters being consistent in their severity across exam administrations. We see the test-taker scores vary (but this is also somewhat to be expected, as different test-taker administrations have targetted different departments in the university, which are anecdotally known to have different English language abilities).

QUESTION. However--and what we don't understand--is why the different individual rubric categories (e.g., writing-content; writing-organization; writing-language control, etc.) vary widely--as much as 2 logits--between the different administrations. Are we wrong in thinking that category severity--similar to item severity--should be stable? Also, why would there be this rubric category instability, while the rater severity remains constant?

Thank you very much in advance for your thoughts.

Gerriet Janssen and Valerie Meier

Mike.Linacre: Thank you for your question, Gerriet and Meier.

What are facets in your analysis? Which facet is non-centered? Do all rating-scale categories have at least 10 observations in Table 8 of Facets?

gjanssen: For our speaking test, the facets include: test taker ability, rater severity (2 raters), rubric categories (6 different categories), and question prompt (2 questions). We have have completed RASCH analyses for both the global performance of the exam (n= 468), as well as for 6 different administration groups. These smaller groups have a minimum n-size of 50.

So, yes, there are more than 10 observations in the Table 8 series (one table for each rubric category).

For our writing test, the facets include: test taker ability, rater severity (2 raters), and rubric categories (5 different categories). There was only one question prompt. The same global/ test sub-groups were used. Our non-centered facet was test taker ability.

Again, thank you in advance for your ideas.

Gerriet and Valerie

Mike.Linacre: Gerriet and Valerie, yes, you are trying to obtain a lot of information from a small dataset.

1) Remove any test-takers whose performance is completely irregular from the analyses.

2) Can your content experts give you what they think is the correct difficulty order for the rubrics? If so, you can compare the difficulty order from each administration with the experts' order. This may help you identify any problems.

3) Please analyze all the data together for each test, and then do an (administration x rubric) bias analysis. This will tell you which rubric-administration combinations are the most problematic.

valmeier: Thanks for your speedy reply!

I think Gerriet meant to write n=468, which is the total test taker population. But our individual administration groups range between 50 and 110, so we'll follow your advice and see what happens.

Just to clarify: for step three, you are suggesting running all the administrations together (i.e. all 468 test takers) and then doing a bias analysis looking for an interaction between administration and each of the rubric categories. Did I get that right?

Thanks again,

Valerie & Gerriet

Mike.Linacre: Valerie & Gerriet, yes, you got Step 3 right :-)

356. Infit and outfit MNSQ

Raschmad August 8th, 2012, 3:02pm: Dear Mike,
In WINSTEPS manual for interpreting infit and outfit mean square stats. you have written that:
>2.0 Distorts or degrades the measurement system.
1.5 - 2.0 Unproductive for construction of measurement, but not degrading.
0.5 - 1.5 Productive for measurement.
<0.5 Less productive for measurement, but not degrading. May produce misleadingly good reliabilities and separations.

However, this is very different from what is practiced. Many researchers say, .70 to 1.3 for binary items and .60 to 1.4 for Likert items. And they remove items on the basis of these criteria.
Your guidelines are different. You say that values above 2 should be deleted.
Can you please explain?


Mike.Linacre: Raschmad, please see www.rasch.org/rmt/rmt83b.htm

We must choose the fit criteria to match the situation. Let's use the analogy of a road. When is it smooth enough to drive along? If we are racing sports cars, then even undulations may make the cars "bottom out". If we are driving a bull-dozer, then big pot-holes are no problem. Only crevasses and swamps must be avoided.

So, when we have the luxury of being able to construct a well-formulated test-instrument and administer it under well-controlled conditions, then we can impose tight fit criteria. If we are trying to extract useful information from a poorly-controlled dataset, then our fit criteria will aim at removing the worst of the garbage from the dataset.

357. What type Output are suitable

iyas August 8th, 2012, 4:32am: Dear Rasch-stars,

i do research gender achievement on vector subject. i have no idea what type of output are suitable? anyone can advise me. thanks.

Mike.Linacre: Iyas, please ask this question to the Rasch Facebook group:

358. Winsteps or Facet

hm7523 August 6th, 2012, 2:15am: Hi, Mike,

I am conducting a standard setting involve items and raters. I used theYes/No Angoff method. Each rater was asked whether the items can be answered correctly by a borderline students. If this item can be answered correctly by the borderline student, then the rater wrote yes, otherwise, wrote no.

I want to evaluate both item difficulty and rater's severity. Thus, I think the two facet model is suitable for my study and the FACET software was used. My model is specified as follows:

ln(P1 /P0 )=B-D-W-T
P1= the probability of a Yes being rated on item i by panelist j.
P0= the probability of a No being rated on item i by panelist j.
B= panelists view of item difficulty,
D= difficulty of item i,
W= the severity of panelist j
T= difficulty of rating a Yes relative to No.

However, when I submit this paper, I get the reviewer's comment as follows:

"With two facets you are not really using the many-facet Rasch model but a simple rating scale Rasch model, where people answer to items using either 1 or 0 (i.e. yes at this level they can answer the item correctly or no they cannot answer correctly); this analysis could have been performed with Rasch computer programs such as Winsteps or RUMM, which do not implement the many-facet Rasch model. "

I am wondering whether the Winsteps can perform two facet analysis? In my case, can I still use FACET to conduct my analysis? or I have to use winsteps? Thanks!

Mike.Linacre: Hm7523, Facets can definitely analyze your data, It can analyze almost every dataset that is analyzable with RUMM or Winsteps. Winsteps or RUMM are used in preference to Facets, where possible, because conceptualizing the dataset and communicating findings are usually easier when the dataset is a rectangle.

But some details of your analysis need clarification.
1) With the usual definition of item difficulty, T is zero.
2) B and W appear to be the same thing, or else B and D are the same thing, or is B an interaction term between D and W?

In Winsteps, we would expect the main model to be:
ln(P1 /P0 )=B-D where B = rater leniency and D = item difficulty
Then we would investigate interactions between B and D with a secondary analysis.

hm7523: Hi, Mike,
Thanks for your quick response. You are right, it seems that B and D are the same thing. If I change B to be the interaction term between D and W, should I still use Winsteps?
I am just wondering whether there are any advantages using FACET over Winsteps in my case (so I can keep my original analysis results). Or maybe add one more facet (such as gender)?

Mike.Linacre: Hm7523, Winsteps and Facets give the same results for similarly configured analyses.
For interactions, both do a post-hoc analysis. In Facets, this is Table 13. In Winsteps, this is Table 30.

My advice: agree with your reviewer's statement that "this analysis could have been performed with ....", and then point out that your findings would have been the same.

359. Interpreting significance of interactions

windy August 4th, 2012, 2:48pm: Dear Rasch-stars,

I performed an interaction analysis using Facets and I am interested in determining whether rater severity interacts with item type. Based on my understanding of the interactions in Facets, the t-statistics are testing the null hypothesis that a rater's severity on a certain type of item is not significantly different than the overall rater severity on that type of item.

However,when I'm looking at t-tests for 15 raters on these two types of items, it seems like it might be appropriate to do a post-hoc adjustment for the t-statistics in order to avoid Type II error. Would a Bonferroni adjustment be appropriate?


Mike.Linacre: Stefanie: a Bonferroni adjustment is appropriate if your hypothesis is "These raters are unbiased." No adjustment is needed if your hypothesis is "This rater is unbiased." And, of course, we choose the hypothesis before looking at the results of the statistical tests.

360. "ITEM: dimensionality" question

qoowin August 3rd, 2012, 7:29am: Hello every one:

i need some help.
i use winstep function that called ITEM: dimensionality.
But i dont know how to set up the number of factor.
Could somebody help me?
Thank you very much.

Mike.Linacre: Thank you for your post, qoowin.

Are you asking about Winsteps Table 23? This always reports the first five components of an orthogonal, unrotated PCA.

If you would like to perform a different factor analysis, please output the correlation table (ICORFILE=). Then use a general-purpose statistical package, such as SAS or SPSS, to do the analysis using that correlation table.

qoowin: Thank you for your response, Mike.Linacre

yes, i use table 23.
i will try it (ICORFILE=).

so... "orthogonal, unrotated PCA" maybe have not enough "the total explanation of variance" (<50%)?

thank you a lot.

Mike.Linacre: Qoowin, please look at https://www.rasch.org/rmt/rmt221j.htm - this predicts the amount of variance in the data that will be explained by Rasch measures.

361. SPSS & Table 3.1 Comparison

uve July 23rd, 2012, 11:09pm: Mike,

I am continuing my comparison of the advantages of the Rasch model as compared to CTT using SPSS with the same survey data in my previous post. This time I ran 4 separate reliablity analyses for two comparisons:

1) Comparison of the control group versus the treatment group on all 17 items
2) Comparison of items 1-11 versus 12-17 on all persons

SPSS Cronbach's Alpha:

1) .81 .82
2) .83 .79

Winsteps Cronbach:

1) .97 .81
2) 1.00 1.00

Winsteps Model Person Reliability, All Persons

1) .85 .66
2) .81 .81

Again, I am wondering why there is such a great difference between the two. I realize that model reliability is a different calculation and so may not be directly comparable, but others are confusing me. Thanks for taking the time as always to look this over.

Mike.Linacre: Thank you for these computations, Uve.

Based on their theories, we expect Cronbach Alpha to be greater than Rasch reliability, see
- a situation many analysts have noticed in practice.

As an independent check on Cronbach Alpha (=KR-20 for dichotomies), I analyzed Guilford's Table 17.2. Here it is:

Title = "Guilford Table 17.2. His KR-20 = 0.81"

He reports KR-20 = 0.81. Winsteps reports Cronbach Alpha = 0.82. The difference is probably computational precision and rounding error.

Uve, what does SPSS report?

uve: Mike,

SPSS reports .815

To be honest, for my purposes I would not be that concerned over a .01 or .02 difference. What concerned me more was the that Winsteps reported a perfect correlation of 1.00 for items 1-11 as well as items 12-17 while SPSS reported these as .83 and .79 respectively, which is a significant difference. Even the overall Cronbach in SPSS of .88, which I didn't mention before, is significantly lower than the Winsteps Cronbach of .93. I can't help but think I've done something serioulsy wrong here or am interpreting a technique incorrectly.

Mike.Linacre: Uve, good. It looks like SPSS and Winsteps agree about the basic computation.
Here are some areas where there can be differences:
1. Missing data.
2. Polytomous data.
3. Weighted data.
4. Rescored/recounted data.
Do any of these apply to your situation?

uve: Yes: 1, 2 and 4. The data was attached in my previous SPSS and Table 23 comparison.

Several students did not responde to all data.

There are 4 options for each item.

None of the options are weighted differently

I received the initial file in SPSS with items 12-17 scored 1-4 for some reason while items 1-11 were scored 0-3. I simply changed all student responses to these last 6 items as opposed to rescoring them, which would probably have been easier.

Mike.Linacre: Ouch, Uve! Both polytomous and missing data! So we need to isolate the difference.

1) Please choose a subset of polytomous dataline with no missing data. How do Winsteps and SPSS compare? They should agree.

2) Please choose a dichotomous dataset, such as exam1.txt. Make some observations missing. How do Winsteps and SPSS compare? They may differ. Winsteps skips missing observations in its computation, but SPSS may use case-wise deletion.

uve: Thanks Mike. That must be it. I chose a sub-sample of 21 students who responded to all 17 items and compared both Winsteps and SPSS Cronbach. They were identical at .78.

I then took a closer look at the original SPSS output and found this message:

"Listwise deletion based on all variables in the procedure."

There were a total of 121 respondents, but 3 chose no options to any of the items. I deleted these in the Winsteps file but kept them in the SPSS file. SPSS deleted them as well along with 18 others for a total of 21. So I guess it was working off of 100 valid cases.

There were also a few students who responded to only two or three items and I deliberated for quite some time as to whether I should remove them from the Winsteps file but decided against it.

So with a listwise deletion process, are both SPSS and Winsteps treating missing data in the same manner?

Thanks again as always!

Mike.Linacre: Uve, listwise deletion drops the data record if even one observation is missing. Winsteps does not do this. Winsteps computes the Cronbach variance terms using all the non-missing observations. For example, on a CAT test, listwise deletion would omit every person. Winsteps keeps every person.

So, if SPSS does automatic listwise deletion, and we deliberately delete records with missing data from a Winsteps analysis (using IDELETE= etc.), then the two computations should be the same.

uve: Mike,

Continuing the discussion of missing data but for a different purpose, my data had about 4% missing responses from various respondents to various items. With 118 respondents and 17 items with 4 Likert options each, can one use a percentage as a rough guide for when it might be best to switch from Mantel-Heanszel probabilities to Welch t-tests for DIF?

Also, I understand that DIF is primarily intended to pick up on individual item idiosyncrasies and one can have DIF without multi-dimensionality. But is the reverse true?

Mike.Linacre: Uve, MH is based on cross-tabs with one cross-tab for each ability stratum. MH is implemented in Winsteps using "thin" slicing (one stratum for each raw-score level).

Look at Winsteps Table 20.2, this will show you how many respondents there are at each score-level (FREQUENCY). Generally we would need at least 10 respondents in each cell of the cross-tab, so that would be at least 40 respondents in each stratum. If many stratum have fewer than 40, then increase MHSLICE= to encompass two or more ability strata.

Multidimensional items (e.g. a geography item in an arithmetic test) without DIF would indicate that the ability distribution on geography matches the ability distribution on arithmetic for both the focal and reference groups.

362. Model Specification & Bias Adjustment

melissa7747 July 26th, 2012, 8:25pm: Hello,

I'm attempting to adjust for bias between rater-items as indicated on p.245-6 in the manual.

I have 5 facets: examinees, prompts, raters, bias adjustment & items.

There are 7 raters, 3 bias adjustments (1=adjust for rater 5; 2=adjustment for rater 6; and 3=Everyone Else, 0) and 6 items all scaled 1-6

I specify 3 models:

?,?,5,1,4,R6; adjust for bias for rater 5 on item 4
?,?,6,2,4,R6; adjust for bias for rater 6 on item 4
?,?,?,3,?,R6; no-adjust for everyone else - including raters 5 & 6 for all items other than item 4

The rater-item combination I've identified are being read, as are the data from other raters. However, the reported Responses not matched in Table 2 equals the number of responses for raters 5 & 6 for all other items (i.e., items 1,2,3,5,6).

I thought my final model above would include these items for these raters. I also specified the final model before the 'bias adjustment' models thinking the order might be incorrect. Do I need to specify a model for each rater-item combination for rater 5 and 6 even if they don't require adjustment?

Thank you,

Mike.Linacre: Thank you for your questions, Melissa.

Here are some suggestions:
1. Let's use the same rating scale structure for all three model specifications.
2. Let's put everything not in models 1 and 2 into model 3.

?,?,5,1,4,MyScale; adjust for bias for rater 5 on item 4
?,?,6,2,4,MyScale; adjust for bias for rater 6 on item 4
?,?,?,?,?,MyScale; no-adjust for everyone else - including raters 5 & 6 for all items other than item 4
Rating scale = MyScale, R6, General ; all three model specifications share the same rating scale structure

melissa7747: Hi Mike,

Thanks very much for your reply. I had to add an additional model (?,?,?,3,?,MyScale) after the 3 models listed below so all data were matched. After doing so, three disjointed sets emerged along the lines of the rater-bias adjustment combinations. I'm thinking that group anchoring raters by their bias-adjustment groups might aid this issue. Is this advisable?


Mike.Linacre: Melissa, please check that raters 5 and 6 rated all 6 items, and that their ratings on items 1,2,3,5,6 are in adjustment group 3. Otherwise those ratings will not match any model unless you specify

melissa7747: Hi Mike,

Yes, Raters 5 & 6 (as well as all other raters in the study) scored each student on all 6 items. However, your second comment seems to indicate the data need to be reformatted in order to apply the bias adjustment in this specific situation.

The data were originally (and currently) are in the following format: student id, task, rater #, bias adjustment and item scores. Since each essay was read twice, there are 2 lines of data per student. i.e.,


The bias example in the manual was between a student and rater, which works out nicely in that the bias adjustment need only applied to the corresponding line of data. It seems my example of the item-rater example requires a different data format because only item element 4 is being 'read' given the format above, while the remaining item scores are not matched (since there they are not explicitly assigned to group 3). It sounds like I need to reformat the data so only 1 item per line is listed. Using the first data line from above, I assume the following is way I'd want to reformat the data:


While this clearly assigns rater 5's scores on all traits, except 4, to adjustment group 3, how does Facets distinguish between the 6 individual writing traits despite the fact they are in one 'column?' Might dvalues= aid me here?

Thank you,

Mike.Linacre: Melissa: your data format is correct.
Your models must be:
?,?,5,1,4,MyScale; adjust for bias for rater 5 on item 4
?,?,6,2,4,MyScale; adjust for bias for rater 6 on item 4
?,?,?,?,?,MyScale; no-adjust for everyone else - including raters 5 & 6 for all items other than item 4
Rating scale = MyScale, R6, General ; all three model specifications share the same rating scale structure

melissa7747: Mike,

Do you mean my original data format is correct or the data re-format that I suggest above? I assume it's the latter of these two.

Additional question: I've been working with other data files where two separate types of bias emerged (i.e., prompt-rater and item-rater). Being new to adjusting for bias, I'm wondering exactly how this works. I assume I'd need to create 2 bias adjustment facets - one for each type and add additional models so all data might be read. Is this correct?

Many thanks,

melissa7747: Mike,

I've been thinking further about your earlier comment, and I think you meant my original data format is correct. I.e.,


Where I'm unclear is how exactly to group rater's items so that "their ratings on items 1,2,3,5,6 are in adjustment group 3." I think it might be a combination of grouping raters in labels=

3, raters
1=Melissa, ,1
2=Sue, , 2
3=Jane, 3,
4=Joe, 3

And then possibly using dvalues to assign an element from one facet (i.e., item 4) to selected groups in another facet (i.e., rater 5 & 6) (using pp. 110-111 in the manual as a guide). Yet, it seems the situation I have is not simply to specify that one facet element is to be assigned to selected groups in another facet but that a subset of those different groups is to be assigned to a 'new' group (i.e., specific items for selected raters are to grouped according to adjustment groups). How to do this smoothly is what is causing me some confusion.

Any clarification would be greatly appreciated,

Mike.Linacre: Melissa, start by making the data as simple as possible:
Facets = (how many?)
Data =
element of facet 1, element of facet 2, ....., element of last facet, observation

Then you can use short-cuts such as element ranges and dvalues=, but they do not change the models.

So, the first step is to define the facets
The second step is to define the elements in each facet
Third step is to construct the data:
Data =
element of facet 1, element of facet 2, ....., element of last facet, observation
The fourth step is to construct the Models= specifications
The fifth step is to define the rating scales
At this point, an analysis can be done ....

melissa7747: Thank you, Mike,

To ensure the bias adjustment has functioned correctly, is it best to rerun the bias analyses again and reexamine it, or is there another way?


Mike.Linacre: Melissa: after your Facets analysis,
Menu bar
"Output Tables" - https://www.winsteps.com/facetman/index.htm?outputtablesmenu.htm
"Table 12-13-14"
Bias between raters and items.

There should be no bias reported between raters 5, 6 and item 4.

363. control bands

davis July 23rd, 2012, 8:29am: Hi Mike,

How can I draw two standard errow control band in excel?
Could you purpose a way?

Mike.Linacre: Davis, Winsteps uses Excel to do this in "Plots", "Scatterplot".
1. Plot the 95% confidence points matching each plotted point
2. Tell Excel to draw a smooth curve through the 95% confidence points
This is done in the Worksheet and Plot produced by Winsteps (or Ministep). The worksheet is shown at the bottom of https://www.winsteps.com/winman/index.htm?comparestatistics.htm

364. SPSS & Table 23 Comparison

uve July 17th, 2012, 7:03pm: Mike,

I was recently given a 4-option 17-item survey to analyze. The question about multidimensionality came up. I was given the raw file in SPSS and so decided I would start with the more common factor analysis of raw scores, then gradually move into how the Rasch model is more effective. I generated the Winsteps residual matrix and had SPSS run it. I then ran table 23 in Winsteps as a double check and compared the two data sets. What is odd is that for the first 3 of the 4 components, both data sets are virtually identical except that the loading signs seem randomly switched for various items. The 4th component is completely different. I've included the residual matrix I used, the SPSS syntax and the SPSS output. I selected PCA and no rotation, so I'm puzzled. I'm wondering where I went wrong. I've attached the Winsteps control and data file as well. I'd greatly appreciate any help you could provide.

Mike.Linacre: Uve, thank you for sharing your investigation with us.

Factors 1 and 2. The usual convention in Factor Analysis is to report the biggest loading as positive. (The actual sign of the correlation with the hypothetical variable is unknown.) SPSS has coded them negative. Strange!

Factor 3. Agreement.

Factor 4. We may be experiencing rounding error. Suggestion: tell SPSS to extract 6 factors and see if Factor 4 changes.

uve: Thanks Mike. I was more concerned with having done something wrong with one or both analyses. I did try extracting 6 factors per your suggestion, but the 4th factor was still much different in SPSS than Winsteps.

Mike.Linacre: Thanks, Uve. This difference looks like a loss of precision perhaps in the 4-decimal-place correlations.

The Winsteps correlation matrix to 6 decimal places in an Excel file can be downloaded from: www.winsteps.com/a/icorfil6.zip

365. Converting Item Information to Error

uve February 10th, 2012, 11:35pm: Mike,

I still have a bit of difficulty conceptualizing the item and test information totals that appear on the y axis of the graphs. I'm not a big Bilog fan, but I do like how it shows information relative to the inverse of information which is I guess essentially standard error. I've attached a graph. Is there a way to display something similar in Winsteps or perhaps calculate and graph it?

Mike.Linacre: Yes, Uve. The standard error of an estimate is the inverse square-root of the statistical information at that estimate.

In Winsteps. output the SCOREFILE= to Excel and then plot S.E. against Measure.

uve: Mike,

I believe I replicated what I needed but I would like your opinion about something. I chose two approaches.

1) Select the TIF graph in Winsteps and copy the data to Excel, then calculate the inverse square root of the information column and produce the error for each point.

2) Do as you suggested and graph the data from the score file output of Winsteps

Both produce virtually identical data, except that the TIF data produce points beyond what we see students scoring. I assume this comes from the points used by the TIF to produce measures from -7 to 7 logits for a total of 201 points.The lowest and highest person measures was +/- 5.31. Mean person measures was .35 and stdev was 1.26. Mean and stdev error was .36 and .14.

What I really want to get from one or both of these charts if possible is the range of "acceptable" error. It's tempting to choose the point at which the curves cross, but as you can see, both graphs are different though both report almost the exact same information. Picking the second chart, at about -/+2.61, the error is about .53. Choosing where the curves cross seems arbitrary and the 2nd chart would give the impression of greater range of accuracy.

Perhaps I'm asking too much from this type of data. Can I use the mean and stdev error to produce error z-scores and plot this against the TIF instead of the error so as to better pick the range of acceptability or can I use what I have to pick the range?

Mike.Linacre: Uve, the observable standard errors (and the matching item information) are in Table 20. The TIF interpolates and extrapolates from the observable points. Differences are probably due to the precision of the computation. Please set these really tight to obtain identical numbers:

The acceptable precision (standard errors) depends on the intended use of the Rasch measures. It is the same situation with measuring length. We require less precision for carpentry than we do for metalwork. In general, 0.5 logits is a practical upper limit on precision. Any precision less than 0.1 logits is overly precise. It is almost certainly less than the natural variation in what is being measured.

Jing: I'm interested in comparing two sets of items on the same test using item information. I want to investigate whether one item type contributes more information than another item type. Has anyone done something like that on WINSTEPS?

Mike.Linacre: Jing, under the Rasch model, all items with the same number of scored categories have the same total information. So ...
All dichotomies (2 categories) have the same amount of item information
All trichotomies (3 categories) have the the same amount of item information
All tatrachotomies (4 categories) have the same amount of item information
and so on ....

However, at any point on the latent variable, the item information from each item differs depending on its location relative to the point and other factors. The sum of these item informations is the "test information".

Jing: Thank you , I wanted to try to find out if dichotomously scored innovative item types contributed more information than the traditional MC items administered on the same test, but as your answer states, all the items contribute the same amount of information under the Rasch model.
Any other way at getting at this(or something similar)under Rasch model?

Mike.Linacre: Jing, the usual comparison would be based on test reliability. Do you have the sample or equivalent samples respond to both traditional and innovative item types?

If so, find the "test" (= person sample) reliability of both tests. If the tests have different numbers of items, then adjust them to the same length using the Spearman-Brown Prophecy Formula. You can then compare the two reliabilities directly.

366. Levels and items

davis July 13th, 2012, 8:39am: Thank you for your question, Davis.

At pre-test, there are categories 0,1,2,3
but at post-test there are categories 0,1,2,3,4
This difference makes comparisons difficulty.

Please look at https://www.rasch.org/rmt/rmt101f.htm - that Research Note discusses this situation.

Mike.Linacre: Thank you for your plots, Davis.

It is easier to make inferences from this type of plot if they include the confidence bands. Winsteps can do this with the "Plots", "Scatterplot" function. https://www.winsteps.com/winman/index.htm?comparestatistics.htm

Looking at the plots, the points appear to be statistically collinear. To help with interpretation I have drawn in some lines.

PART 1: there is probably a ceiling effect on the Pre-test that causes the 3 points to the right and the weird x-y scaling. Suggestion: omit items with extreme scores from PART 1 plot and make the x-axis and y-axis ranges the same.

PART 2: we can see that there are two trend lines. This is typically a treatment effect. Items that are taught become relatively easier than the other items. We expect that items along the blue arrow were a focus of the treatment.

PART 3, 4, 6: the points approximate the expected trend (blue line). We would know better if we had confidence bands. I have circled two points in red. These may indicate floor and ceiling effects or merely be accidents.

PART 5: this shows a small change in discrimination (red arrow), probably not enough to have substantial impact on inferences. But it may require a choice. Which is decisive? The pre-test item difficulties (medical applications where treatment decisions are made at admission) or post-test item difficulties (educational applications where pass-fail decisions are made at the end of the "treatment").

Mike.Linacre: Davis: There are two considerations here.

1. How do the items interact with the treatment? This is what your plots show. We expect that items directly related to the treatment will get relatively easier. Do they?
Is the treatment intended to make your person performance more uniform across the items? Remediation does this. Then we expect the items to become closer together in difficulty at post-test.
Is the treatment intended to to make person performance more different across the items? Athletic training does this. Then we expect the item difficulties to spread out more.

2. If you want to measure how much the persons change due to the treatment, then you must decide which is the crucial time-point. Pre? or Post? Then anchor the items at the item difficulties for that time-point at both time-points. Measure everyone at the two time-points, and then use conventional plots and statistics to compare the two sets of measures.


davis: Mike,
I determined person levels for pre test. After then I determined person levels for post test. I used Andrich Thresholds for levels. intervention group and control group statistically same at pre-test. Intervention group is better than control group at post test. I focused on intervention group. Some items are relatively easier than pretest for intervention group.That's why item difficulties are different pre and post. (Not constant). Items are took place at different levels at pre and post.

I want to see how much the persons change due to the treatment, then I must decide which is the crucial time-point. Pre? or Post? I didn't decide. I don't know which is important (pre or post). This study about education on midde school student. I suppose we choosed one of them (pre or post). How to anchor the items for pre and post test. Will I do such as attached?

Mike.Linacre: Davis, for educational applications, post-test is usually the more important because we don't care where the students are when they start, but where they end is crucial.

So, in Winsteps, we would analyze the post-test data:
output IFILE=if.txt and SFILE=sf.txt

Then analyze the pre-test data:
input the anchor files: IAFILE=if.txt and SAFILE=sf.txt

367. table 3.2

davis July 3rd, 2012, 10:20pm: Davis, your attachment is Table 23.2.

Do you mean Table 3.2 or Table 23.2 ?

davis: I am sorry

I sent wrong file. I am sending again.

Mike.Linacre: Davis, Table 3.2 contains details of how the rating scale functions.

There is a brief description of each number in Winsteps Help: www.winsteps.com/winman/index.htm?table3_2.htm

If you need more understanding of the concepts, please look at the tutorials: www.winsteps.com/tutorials.htm - particularly Tutorial 2 and Tutorial 3.

Do those help?

davis: Dear Mike,

Thank you. I want to ask a question in attached file about table 3.2

Mike.Linacre: Davis, the "Structure Calibrations" are the Andrich Rating scale thresholds (parameters of the Rasch model).

The "Category Measures" are the best measures corresponding to an observation in each category. Add this to the item difficulty to answer the question: "What person ability measure corresponds to a score of 2 on item 6?"

"Structure Calibration is not ordered"
Andrich Thresholds relate directly to the probability of observing each category. The probability of observing each category relates directly to its frequency in the data.

So let's look at the frequencies. We see that category 3 has only 3 observations (very few) with high misfit (mean-square = 1.84). Category 4 has only 2 observations but with expected fit (mean-squares near 1.0).

If we do not have a strong substantive reason for maintaining a 5-category rating scale, then these data suggest that categories 3 and 4 should be collapsed into one 5-observation category.

BTW, my guideline, www.rasch.org/rn2.htm , is that we need at least 10 observations in a category to obtain estimates that are likely to be stable across datasets. This suggests that categories 2+3+4 should all be combined.

davis: Thank you Mike.
Can we use table 3.2 in partial credit model for determine levels?

Mike.Linacre: Yes, Davis. For each Partial Credit item, Winsteps reports a Table 3.xx. For item 4, there will be Table 3.5.

davis: Dear Mike,
I have two groups. I applied pre and post test. Test items are suitable PCM. I want to determine at pre and post test levels of intervention group. I want to see change in intervention group. There are six part of my test. I want to compare person each part with levels. I determined levels according to Andrich Thresholds pre and post test. Items are not constant pre and post test. How can I do?

Mike.Linacre: Davis: you have the usual situation. We expect things to change between pre-test and post-test. So we must choose. Which is most important for decision-making? Pre-test or post-test or the average of pre-test and post-test? We choose the most important set of estimates, and then anchor all the estimates at that important set.

Operationally with Winsteps:
from the "important" analysis, output IFILE=if.txt and SFILE=sf.txt

Then in all the analyses, anchor the items and the threshold structures:
IAFILE=if.txt and SAFILE=sf.txt

368. Table 20 values for non-existant raw scores

uve July 10th, 2012, 8:18pm: Mike,

How are measures and standard errors created in Table 20 for non-existing raw scores? For example, on one assessment we had 8 students get a raw score of 8 out of 50 which was a logit value of -1.80 and an SE of .40, then one student scored 6 which was a logit value of -2.16 and a SE of .45. So no one scored a 7, but in Table 20 the logit value for this score was -1.97 with a SE of .42.

Mike.Linacre: Uve, Table 20 is computed by applying https://www.rasch.org/rmt/rmt102t.htm with every possible raw score (R in the mathematical formula).

369. Separation

christianbrogers July 10th, 2012, 1:59am: Hello,
I am currently working on my dissertation data and have run into some confusion. I have two groups of 66 people who took an online training course. One group took it face-to-face and another online. At the completion of module 1, they took a 12 item survey. They did the same survey after four subsequent modules. The items are on a rating scale of four thresholds. I am measuring one construct and that is whether they were engagement in the course.

After reviewing the first module data my person separation is quite low at 1.54 and the reliability is .7. I looked at infit mean square with a range of .6 to 1.4 and ended up removing 24 people and the separation only went up to 1.66 with a reliability of .73.

I am wondering if I could be doing something wrong in that I cannot get the separation higher.

I was able to remove three items and increase my item separation to 2.5. Any help would be appreciated.

Mike.Linacre: Christian:

Person reliability and separation increase when there are more items, and also when there is a wider person ability range.

Item reliability and separation increase when there are more persons, and also when there is a wider item difficulty range.

Christian, why do you need a higher item separation? An item Separation of 2.5 indicates that, statistically, we can distinguish high difficulty items from middle difficulty items from low difficulty items. This is usually enough for most measurement purposes and also for confirming the construct validity (= item difficulty hierarchy) of the items.

Similarly, a person Separation of 2.5 indicates that, statistically, we can distinguish high ability persons from middle ability persons from low ability persons. This is usually enough for most measurement purposes and also for confirming the predictive validity (= person ability hierarchy) of the persons.

christianbrogers: Mike,

I believe that my item separation is where it should be but am concerned about my person separation. The study has already undergone and I am receiving my data as secondary. Thus, I cannot add more items to my survey to increase person separation. I may not need to be concerned about this but I wanted to confirm. This is my first use of Rasch.

Mike.Linacre: Christian, Rasch is probably the only psychometric methodology that reports Item Separation (-> Item Reliability). All other methodologies report Test Reliability = Person Reliability -> Person Separation.

The general rule is: if your item separation is low, then you need a bigger sample of persons. This is helpful for deciding whether we have collected enough data to produce statistically valid findings.

christianbrogers: Mike, This is helpful information. Obviously I cannot change the people I test but to know my Item Separation is high enough is helpful. Thank you Mike.

370. Invariance Viewed Thru 2.2, 2.5 & 2.7

uve July 9th, 2012, 8:46pm: Mike,

I'm hoping you could shed some light on the application of Table 2.7 versus 2.2. We've spoken about this before, but my concern is the stability of measures I am getting. One advantage of the Rasch model is the issue of invariance. Most of the samples I use range from about 1,000 to 7,000 respondents. I am always amazed at the significant difference between the sample free 2.2 and sample dependent 2.7 in all of my data. If I am to set cut points for performance levels, then I would have a hard time basing this on 2.2 and would look more to 2.7, which always seems to be much closer to the empirical data in 2.5. So, it seems if I assume the invariance of my measures, then I should be able to apply them to other samples knowing full well there will always be differences, but none to the extent that the fit to these different samples would reveal dramatic differences. Or put differently, what good are my item calibrations given a stable and large enough sample upon which they are based if I can't rely on their invariance across other populations? I've included a typical example.

If invariance has been achieved, it would seem to me that these would be much closer, especially if the empirical data in 2.5 are similar to 2.7.

I'd greatly appreciate your thoughts on this.

[justify]EXPECTED SCORE: MEAN (Rasch-score-point threshold, ":"
indicates Rasch-half-point threshold) (BY CATEGORY SCORE)
-3 -2 -1 0 1 2 3
1 1 : 2 : 3 : 4 4
� �
1 1 : 2 : 3 : 4 4
1 1 : 2 : 3 : 4 4
1 1 : 2 : 3 : 4 4
1 1 : 2 : 3 : 4 4
1 1 : 2 : 3 : 4 4
� �
1 1 : 2 : 3 : 4 4
1 1 : 2 : 3 : 4 4
� �
1 1 : 2 : 3 : 4 4
� �
1 1 : 2 : 3 : 4 4
1 1 : 2 : 3 : 4 4
1 1 : 2 : 3 : 4 4
� �
� �
1 1 : 2 : 3 : 4 4
� �
� �
1 1 : 2 : 3 : 4 4
-3 -2 -1 0 1 2 3

-3 -2 -1 0 1 2 3
� 1 2 3 4 �
� �
� 1 2 3 4 �
� 1 2 3 4 �
� 1 2 3 4 �
� 1 2 3 4 �
� 1 2 3 4 �
� �
� 1 2 3 4 �
� 1 2 3 4 �
� �
� 1 2 3 4 �
� �
� 1 2 3 4 �
� 1 2 3 4 �
� 1 2 3 4 �
� �
� �
� 1 2 3 4 �
� �
� �
� 1 2 3 4 �
-3 -2 -1 0 1 2 3

-3 -2 -1 0 1 2 3
� 1 2 3 4 �
� �
� 1 2 3 4 �
� 1 2 3 4 �
� 1 2 3 4 �
� 1 2 3 4 �
� 1 2 3 4 �
� �
� 1 2 3 4 �
� 1 2 3 4 �
� �
� 1 2 3 4 �
� �
� 1 2 3 4 �
� 1 2 3 4 �
� 1 2 3 4 �
� �
� �
� 1 2 3 4 �
� �
� �
� 1 2 3 4 �
-3 -2 -1 0 1 2 3[/justify]

Mike.Linacre: Correct, Uve. Table 2.7 is much closer to the empirical data.

The problem is inference ....

Past-sample-dependent inference: If you expect the future person to be from a person sample exactly like the past sample (distribution, ability level, misfit, etc.) , then base your decisions on Table 2.7.

Past-sample-distribution-dependent inference: If you expect the future person to be from a person sample similar to the past sample (distribution, ability level, but not misfit) , then base your decisions on Table 2.5.

Past-sample-independent-as-possible: If you have no information about the future person, except that the latent variable will be the same as for the past sample, then base your decisions on Table 2.2.

Does the future repeat the past? Or is the future something entirely new? The truth is usually somewhere between these extremes.

371. Structure Measure & Andrich Thresholds 3.2

uve July 8th, 2012, 9:53pm: Mike,

I understand that when GROUPS is selected, the Structure Measure for a category is the sum of the item measure and the Andrich Threshold for that category. I'm just not sure what the differences are in interpretation. For example, what information does Structure Measure provide that I can't glean from Andrich Threshold?

Also, when GROUPS is not selected then the two sets of data are identical. This doesn't make sense to me in light of the fact that Structure Measure for a category is a sum of two data sets. I'd greatly appreciate any additional information you could provide.

Mike.Linacre: Uve, Winsteps includes considerable duplication of numbers, but from slightly different perspectives. No one uses all the numbers. In fact, surely most people (including me) ignore most of the numbers. We focus on the numbers that are important to us.

Probably the focus here is on the different ways that the estimates for the Rating-Scale model and the Partial Credit model are presented. This is largely driven by historical accident. Initially, those models had different motivations and different audiences. They were conceptualized to be fundamentally different models. Only later was it realized that the differences between the models are algebraically trivial. But it was too late!

So now, for the Rating-Scale model, the convention is to report the Andrich thresholds relative to the item difficulties. Since these are the same for all items, only one set of numbers is reported, often called "taus". So "Structure measures = Andrich thresholds"

For the Partial Credit model, the Andrich thresholds, which differ for each item, the Andrich Thresholds are usually combined with the item difficulties and often called "deltas". Winsteps also reports the the thresholds relative to item difficulties. So "Structure measures" = (item difficulty + Andrich threshold).

Does this help?

uve: Thanks Mike, that helps. With a hammer I hammer nails and with a saw I cut wood. To me, each of these data sets is a tool. So with your explanation in mind, at this point I'm trying to distinquish the different uses/purpose of the Structure Measure tool from the Andrich Threshold tool when using the partial credit analysis.

Mike.Linacre: Uve, the difference between those two is really your audience. Do they think in terms of (item difficulty) + (Andrich threshold) [the original "Rating Scale" way of thinking] or (item difficulty + Andrich threshold) [the original "Partial Credit" way of thinking] ?

372. DIF contrast Interpretation

kulin June 7th, 2012, 12:39pm: Dear forum members,

I used WINSTEPS for purposes of estimating DIF contrasts for items from an achievement test (CONTEXT: I compared two groups of students; mean ability of students from group A was much lower than the mean ability of students from group B).


For item X, the obtained DIF contrast amounted to 1 logit (and favored students from group A).
Can we interpret the described finding as follows:

If we compare equally able students from group A and group B, then the odds of solving correctly the item X is 2.7 times higher for students from group A than for correspondent students from group B.

OR (the same):

The odds of solving correctly item X, is 2.7 times higher for students from group A than for students from group B (provided that we control for ability).

I would highly appreciate any help.


Mike.Linacre: Yes, that looks correct, Kulin.

Most audiences have trouble with this type of arithmetic, so it may be easier to say something like:

if a student from group B has a 50% probability of success, then an equally able student from group A will have a 73% probability fo success.

kulin: Thank you very much, Mike! :)

solenelcenit: Hi Mike... I'm trying to do the same than Kulin, however in my case I'm trying to compare differences between item location after item-linking. How can I write in terms of probability a difference of 0.44 logits, when group A item location :0.72 logits and groupB i loc: 0.28.

Is... Group A found 61% more complexity compared to the 61% of the groupB.
I reached this by: 0.72-0.28=0.44 (or 61%prob)



Mike.Linacre: Luis, imagine a person of overall ability 0.28 logits in each group. Then, assuming the items are dichotomies, the person in group B has a 50% chance of success (0.28 ability - 0.28 difficulty = 0 logits). The person in group A has a 39% chance of success (0.28 ability - 0.72 difficulty = -0.44 logits).

You may want to choose a person with ability at the overall average of everyone as the reference person, instead of 0.28 logits.

373. cross plots

Beag_air_Bheag June 7th, 2012, 1:31pm: I am following the 'advice to novice analysts' on this page:


where it says:
'Cross-plot the two sets of person measures. Are there any noticeable changes that matter to you?'

Is there an easy way to make Winsteps do this cross-plot, or is it case of export to excel?

Many thanks for any help you can give,


Mike.Linacre: BaB, please try the Winsteps "plots" menu, "Scatterplot". This produces an Excel scatterplot automatically. https://www.winsteps.com/winman/index.htm?comparestatistics.htm

Beag_air_Bheag: Thanks for your help Mike.

cwale: I'm sure this is a silly question, but what changes would I be looking for in the crossplot?

Mike.Linacre: Cwale, usually we expect the points on a cross-plot of Rasch statistics to lie along a diagonal line (with a little random scatter due to the probabilistic nature of the data).

So, if we notice that there are off-diagonal points, or that the diagonal is a curve, we usually want to investigate whether there is something happening that matters to us.

For instance, in cross-plotting item difficulties or person abilities between Time 1 and Time 2, we may see that some unexpected changes have occurred. For instance, an item may have become much easier or much harder. We would want to investigate: "What caused this change?" Is it a good change or a bad change?

cwale: Thanks again. I am trying to provide rationale for keeping or eliminating items. Although four items show misfit, the crossplot still falls along the diagonal. I have attached two of the crossplots (deleting one item or deleting 4 items) with all the items.

Is there anything else I can do to support deleting or keeping the items?

Mike.Linacre: Cwale, what has been cross-plotted? The x-axis and the y-axis points both appear to be based on a score of 18.

cwale: I was attempting to cross plot person measures of the full data (20 questions) with 18 items, 16 items, etc. I think this one should be the fulll data crossplotted with 18 items. Where do you see what score they are based on?

Mike.Linacre: Cwale: usually the number of strata, horizontally or vertically, corresponds to the maximum raw score 1, horizontally or vertically. So I expected to see different numbers of strata in the two directions. There appear to be 19 strata in both directions, which usually indicates 18 items.

But we can see that the measures for the 1145 persons are almost the same in the two directions, and that differences are much smaller than the standard errors of measurement. Statistically, the two sets of measures are the same.

In this situation, we need to focus on pass-fail decisions or on the highest and lowest performers. For instance, if the pass-fail point is 0 logits (the average item difficulty), Then we see (by looking at the Excel worksheet) that there are 10 persons who pass with one set of items but fail with the other set.

cwale: Mike,
Thank you so much for answering my elementary questions. Yes, I was able to see the 10 persons. Because the questions cover content needed in the math assessment, then could I be justified in keeping them? Is there anything else I should examine? Is the correlation on the Excel worksheet the correlation between the two measures?

Mike.Linacre: Cwale: if the item content matches the required content, then we need strong reasons for omitting items. For instance, on an arithmetic test, word problems misfit with the symbolic problems, but we must keep both.

The Excel worksheet shows the raw correlation between the two sets of measures, but this is somewhat misleading because they have many observations in common. We would really need to correlate the measures on the omitted items against the measures on the kept items, and then disattenuate for measurement error.

solenelcenit: Dear all:

This thread is very interesting. I'm messing trying to understand the excel output from the cross-plot menu.
After cross-plot measures of two sets of similar items I have obtained an empirical Cross plot with all points within the confidences lines, however on the identity plot all points are outside... how can I interpret it?

The t-statistic significant for the difference betwwen the two measures should be interpreted with degree's of fredoom... do I need to compute a Levene test to study the variance adn the choose between n1+n2-2 or the df formula for different variances?



Mike.Linacre: Luis: the identity-plot is investigating the hypothesis: "These pairs of measures are the same except for a constrnt difference". The empirical-plot is investigating the hypothesis: "These pairs of measure are the same except for a constant difference and a scaling constant."

The Excel plots are a convenient smoothed approximation. If you need exact t-tests, then please refer to the Tables of measures. Please make the adjustments that match the hypothesis you are testing. If your findings depend on those adjustments, then your findings are insecure. Findings really need to satisfy Joseph Berkson's Inter-Ocular Traumatic Test (as seen in the Excel plot). https://www.rasch.org/rmt/rmt32f.htm

solenelcenit: Mike:

I'll keep with me the quote from Joseph Berkson's Inter-Ocular Traumatic Test... for sure I'll use when I tell the MD and RN residents why we should pay attention to raw data structure and graphics before any complicated and not easy to understand statistical test.

Well... from the data I analysed could I write??: "The plot with Empirical lines shows there were a constant scaling and difference between all pair of items; all items were within the 95% confidence lines or almost on the empirical line; therefore the identity plot shows there were a constant difference in both constructs on all pairs of items; this was reached by accounting for both S.D. In this case as all cross points were outside the 95% confidence lines it was proved that there were differences on how each pair of items was understood by both samples. Therefore we can speculated if the latent trait construct has same characteristics or differs qualitatively for sample 1 and sample 2".

Is that right?... if it is I think there is a trouble, in other words: pseudo-item linking proved that no linking is affordable because construct difference.

Your comments will point the way.


Mike.Linacre: Luis, your reasoning may be correct, but I got lost :-(

Perhaps https://www.rasch.org/rmt/rmt163g.htm will help.

solenelcenit: Hi Mike:

Thanks for your help... I attach the Winsteps Plot. I'm sorry to get you lost... I think I'm going to make things simpler and keep me in firsts script I designed.


Mike.Linacre: Yes, Luis. We can see clearly that the Profs discriminate much more strongly between the items than the users.

solenelcenit: Dear Mike:

I thank you so much your comments.

You are right, professionals discriminate better than users; this is very visual on the Item Map, where I found floor and ceiling effect...


374. Item infit and outfit correction

Mokusmisi July 4th, 2012, 7:34pm: Dear Mike,
I'd like to ask you about item fit statistics. I understand that to arrive at infit and outfit values if done by hand, one would first need to calculate the difference between the observed score (which is a 0 or a 1 if it's a dichotomous item) and the modeled probability.
Now, my question concerns this, in order to arrive at the modeled probability, we'd need to know the true values of item difficulty and person ability. If I'm not mistaken, when calculating the fit statistics, Facets uses the parameter estimates. Is there a correction implemented?
It's not really a big deal, I'm just interested.

Mike.Linacre: Mokusmisi, unfortunately we never know the true values of anything. We can only use estimates. Facets uses the parameter estimates when computing fit statistics. These fit statistics are never exact, but they are good enough for practical purposes.

Mokusmisi: Thank you. Like I said, just out of curiosity. (I'm not a statistician anyway).

375. table 23.2

davis July 6th, 2012, 5:55pm: Dear Mike,

I can produce winsteps files but I need some examples for evaluate.
For example how can we use table 23.2 ?
What does the table say to us? Could you explain ?

Mike.Linacre: Davis, there is some useful material in Winsteps Help. For instance, https://www.winsteps.com/winman/index.htm?principalcomponents.htm

376. RSM step calibration

solenelcenit July 3rd, 2012, 9:45am: Dear colleagues:

Mike Linacre in "Optimazing rating scale categoires" after calculus demonstration advised that steps calibration in five categories politomous should be at least 1.0 logits to better and valid inferentially useful measures. Therefore this article is cited lots of times in studies which used a five points categories, but they took the advise given in the table of Linacre paper, which states is 1.4 logits of difference between three points categories. I'm not clear if this step categories are also valid to Rating scale model as it seems that the article is prepared to Partial Credit Model. Is this right?

Mike.Linacre: Luis, the mathematical properties of RSM and PCM are the same. They differ conceptually, historically. and functionally. Since the "optimizing" paper is about mathematical properties it applies equally to RSM, PCM and GRSM (the rating-scale model applied to groups of items, not to all the items).

solenelcenit: Hi Mike:

Thanks again!!


377. Test performance test

solenelcenit July 2nd, 2012, 2:13pm: Dear Mike:

After check non uniform DIF in a test validation I would like to check the overall test performance, in other words if there is a uniform dif between groups in all items. Like computing person locations for all groups (i.e. youngs, adults, elders) and then compute an ANOVA... is there any table in winsteps which informs me about this Differente Test Performance between groups?



Mike.Linacre: Luis, this is Differential Group Performance. Winsteps Table 33.3 - https://www.winsteps.com/winman/index.htm?table33_3.htm

Sorry, Winsteps doesn't do an ANOVA, but the numbers are all there.

solenelcenit: Thanks Mike:

I'm trying to understand table 33.3, in item class column appears a set of letters... where are come from?, and secondly... table 28.1 compares groups pairwise subtotals, is that a goodd way to inform of different group functioning once proved there is no DIF?


Mike.Linacre: Luis, in Table 33 and Table 28 and similar tables, the letters are the codes you place in the item and person labels to indicate the person or item grouping.

For example:
In the person labels, M = male, F = female
In the item labels, A = arithmetic items, G = Geography items

Then in Table 33, a Differential Group Functioning of "gender" x "item content" will have:
for the four gender-by-content DGF sizes

Table 28 reports, for example, how Males perform and how Females perform. These subtotals are often useful.

solenelcenit: Hi Mike:

Your answer is a light to my Winsteps understanding.


378. cannot copy the graphics to clipboard

javzan June 19th, 2012, 5:11pm: Hello, I have been working on my data right this moment then I couldn't copy the graphics to excel. What should I do ? Have I needed to update my programm or..? I bought winsteps two months ago.
I have got another request I am non -native speaker and it is very difficult understand your manual. And I couldn't find any exampales how to interprete data result (into langauge). Could you suggest me any relate recourse how to interprete data. I am currently interesting on the Rasch 1LP on dichotomous vairables. And It is truely hard to understand which one is Rasch score. Table10 is show all data in the table and variables generally named by enter, measure, score etc. As you named as measure, is it item difficulty score or not? I hope some one would help me and will give me answer. Thanks

Mike.Linacre: Thank you for your post, Javzan.

The "Rasch score" is the "Measure". It can be the person ability (in Table 6) or the item difficulty (in Table 10).

Your attachment is part of a Winsteps Output Table. It contains a huge amount of information. Sorry, the Winsteps manual explains how to use Winsteps. It does not explain how to interpret Rasch data results. There are many explanations on www.rasch.org/memos.htm and also I enthusiastically recommend the book "Applying the Rasch Model" by Bond & Fox.

What is your language? Perhaps there is also an explanation in your language.

You wrote: "I couldn't copy the graphics to excel"

Reply: This is usually a Windows problem, not a Winsteps problem.
First verify that the Clipboard is working correctly. Can you copy-and-paste from a .txt file into Excel?

javzan: Thank you very much.

javzan: Dear Prof Lincare, Thank you very much for getting back to my answer. I followed your suggestion and I am reading Bond and Crisitne book. I am very sorry that I have posted I couldn't underatdnd your winstep guidlines. Bond's book helped me understand about rasch analysis. By the way I am afraid that it is not too easy to find your software guidlines on Mongolian langauge.

Thank you

Mike.Linacre: Good, Javzan. Sorry, Mongolian will not be soon :-(

Perhaps some of these authors can help you: http://repository.ied.edu.hk/dspace/handle/2260.2/151

Also, have you seen the Winsteps tutorials at www.winsteps.com/tutorials.htm ?

379. Negative point-biserial correlation

miet1602 June 28th, 2012, 3:45pm: Hi, it's been a while since I've done any Rasch analyses or posted on here. I have now had to analyse a data file and am trying to get my head around as to why I got so many negative point biserial correlations.
The data are judgments of 6 experts on how demanding certain test items are on a scale of 1 (less demanding) to 5 (more demanding). They all rated all the items.
I analysed the data in Facets though it is only a two-facet design.
The fit stats, reliability etc. are generally fine, and point biserial correlations for the judges are fine. The rating scale seems to have functioned well too.
However, there are lots of negative correlations for the items. I don't think it is the reversal problem, but am not sure what else it could be.
I attach the output file so you can see for yourself.

Thanks very much for your suggestions!

Mike.Linacre: Yes, Milja, this is a surprising report. The judges and the rating scale are functioning well. The surprise is only in the items. The problem may be item MCQ2. This has huge misfit (mean-squares 2.7+).

1. Please rerun the analysis without this item (comment it out).

2. Please try Ptbiserial = measure. This will report both the point-measure correlation and its expectation. Then we can see whether the correlations are near to their expectations.

3. This is a small data set, so it could be analyzed with Ministep, the free version of Winsteps. Does that give the same report?

miet1602: Hi Mike,
Thanks for the speedy response.
I attach the facets analyis with MCQ excluded and with expected pt biserial. Correlations are a bit better without MCQ2 but there are still several negative ones...
Does this suggest anything to you?

I will do the Winsteps analysis tomorrow from work.

Best regards,

Mike.Linacre: Thank you, Milja. The output makes sense, surprisingly!

The problem is that the 3 of the judges have the same measure. 2 more judges also have the same measure, close to the 3. Only one judge is far from the other judges. Consequently, for the item point-correlations, the one outlying judge is very influential.

If outlying judge 6 to rates lower than the average of the other judges, then the item point-correlation will be positive. If Judge 6 rates higher than the average of the other judges, then the item point-correlation will be negative. We see that judge 6 has an outfit mean-square of 1.47, so gives unexpected ratings more often than expected.

How about trying a "judge style" model?
Model = #,?,DEMANDS,1
This will show how each judge used the rating scale.

miet1602: Thanks for this, Mike.
Please see attached output.
Not sure if this is important, but in the specification I have positive=2 (items), whereas judges should be negative.
I think here judge 6 is rating lower than average actually - so you would expect positive correlations then? Or am I misunderstanding something?

In any case, do you think this data set is too small to be analysed in this way without getting these sorts of anomalies? It is maybe unexpected to see that 3 judges had exactly the same measure (especially as they were not standardised in using this rating scale). So it is perhaps just by chance.

Thanks for all your help!

Mike.Linacre: Your # analysis is informative, Milja.

Facets reports correlations so that the expected correlations are always positive. We do not need to remember whether the underlying facet is positive or negative.

Looking at Table 8 for the raters. Please look at the "Average Measure" column for each judge. Judges 3, 5, 6 all have disordered average measures (average measure does not increase with rating scale category). This is probably partly due to the small sample size. Based on these data, only judge 2 is a reliable judge. The other 5 judges all have idiosyncrasies.

380. Separation and Reliability

cwale June 21st, 2012, 10:31pm: Thank you for your question, C.

"Reliability" means "Are the measures for the persons (or items) reliably different across the sample", or, in other words, "Can we reliably place the persons (or items) on the latent variable relative to each other."

We usually talk about "test reliability" = "person sample reliability". This is usually in the range 0.5 to 0.95.

In Rasch analysis, we also report the equivalent statistic for the items. The "item reliability". If the item reliability is less than 0.9, then it may indicate that the person sample size is too small for accurate placement of the items on the latent variable (= construct validity). Your item reliability is 1.00, so your person sample size (1145) is certainly big enough to reliably place the items on the latent variable.

cwale: Mike,

Thanks for your response. Would you be concerned about the person separation of 1.65?

Mike.Linacre: Cwale, a person separation of 1.65 indicates that the test is strong enough to discriminate between high and low performers in the person sample, but no more. This may be enough for what you want to do.

cwale: Thanks again.

381. CAT Itembank Calibration

bennynoise June 13th, 2012, 1:09pm: Hello everybody!

I am currently trying to implement a CAT for reasoning. After having read a lot, I am still not sure about some of the steps for item bank calibration, so I thought i could get some advice here.

What I did:
- developed items using an item generator
- set up a (sequential) online test to collect initial data from the target population
- collected data in 10 blocks each consisting of 16 different items and 4 anchor items
- cleaned the data (no missing values, no click-thorughs etc.)
--> for every block there is data from about 100 to 400 participants

What I plan on doing now:
- test for rasch model fit in every block
- test for DIF and remove bad items from every block
- chain-link the remaining items using a common-item non equivalent groups design

This procedure should result in a calibrated item bank that I could then feed to my CAT, right?
Would you do it the same way or is something wrong or missing? What would be suitable variables for testing for DIF? I thought about gender, age and level of education.

Thank you very much for your advice!

Mike.Linacre: Thank you for your post, bennynoise.

Your procedure is very thorough.

In your chain-link step, how about doing a concurrent calibration of all your data? The Winsteps MFORMS= command may help with this. You can then do an (item x form) DIF analysis to identify forms in which the relationship between the anchor items differs.

382. Z-score from bias/interaction analysis

suns74 June 13th, 2012, 3:26pm: Hello,
I'm struggling to know how to obtain Z-score from Bias Report. As far as I know, z-scores greater than +2 or smaller than -2 are considered to display significant bias. But, I can't find them in the Bias report other than Bias size and t-values. I'd really appreciate if you could tell me how to get Z-scores using Facets.

Mike.Linacre: Suns74, Z-scores are unit-normal deviates reporting the significance of a hypothesis test. They are calculated with the assumption that there are a very large number of observations. Since this is often not true, Winsteps and Facets compute t-statistics. These are more accurate than Z-scores for more small samples, and are the same as Z-scores for very large samples.

If you are not familiar with t-statistics, please see http://en.wikipedia.org/wiki/T-statistic

383. using Rasch person data as input to other tests

Michelle June 11th, 2012, 10:37pm: Hi there,

I'm undertaking Rasch analysis, using Winsteps, on two separate scales, using survey data. This is for a conference presentation next month (ACSPRI/RC33 in Sydney), and I hope to eventually publish the results in a journal article. The first scale I am working with is dichotomous true/false data with 14 items. It's going well, apart from I can't work back from the person measure to the expectation for each item (using the output observation file XFILE data). The person measure seems to be very close to the mean of the log(expectation/1-expectation) values for the items for each person, but not exactly the same.

I would love to be able to say which output values can be used to substitute in for the original True/False values to run further analyses, say structural equation modelling or factor analysis, on the results. But these require item-level results for each person and not a total score. Do I just divide the log(expectation/1-expectation) by the number of items to get the score to use for each individual item?

Any comments appreciated. This is just for extension for the audience, and in case I get questions.

Mike.Linacre: Michelle, the Winsteps XFILE= contains lots of information about each observation.

It sounds like you want a Rasch person measure corresponding to each observation, this is XFILE column PPMEAS "Predicted person measure from this response alone"

Michelle: Thanks for the reply Mike, I have been pondering it.

I've gone through the person measures for 4 subjects, and I can see that the value for a "correct" response on the same item is the same for each person, and the value for an "incorrect" response is also the same.

Is there somewhere I can read up on how one gets from the Measure Difference to the Predicted Person Measure? There appears to be a negative correlation between the two, from the small sample of data I have looked at.

I've attached a subset of my data, the first 5 of 1127 subjects. I've used colour coding so I can see where subject sets of data end and start.

Mike.Linacre: Michelle, the "measure difference" is the difference in logits between the ability of the person and difficulty of the item.

The "predicted person measure" is the same for every person who responds in the same way to the same item. It is an estimate of the ability of the person based on the response to that one item.

If you are looking for the person's ability based on the responses to all the items, then that is the "MEASURE" in Table 18. If you are looking for the person's ability based on only one item, then that is the "predicted person measure....".

Are you looking for something else?

384. Specification file

lostangel June 6th, 2012, 10:18pm: Dear Prof. Linacre,
I am a facets beginner and working on analyzing rater agreement for a project. The main purpose of my facets analysis is to calculate a single composite score out of the four ratings. Below is my input file. But I failed to run in on FACETS program. Can you help me take a look at it to find out the problems? Thanks!

Title = Oral; the report heading line
Facets = 3 ; examinees, Dimensions, raters
Converge = .5, .1 ; converged when biggest score residual is .5 and biggest change is .1
Positive = 1, 3 ; examinees are able, raters are lenient
Noncenter = 1 ; examinee measures relative to dimensions, raters, etc.
Arrange = F,m ; tables output in Fit-ascending (F) and Measure-descending (m) order
Usort = U ; sort residuals by unexpectedness
Vertical = 1A, 3A, 2A ; Display examinees, raters, texts by name
Model =
?,?,?,Game ; basic model: raters share the 9-point rating scale
Rating scale = Game,R9; 9 category scale
1 = did not complete at all
9 = complete all steps;
1, Examinees
1, G-21_03-06-2012_06-32
13, G-13_03-06-2012_06-06
2, Dimensions
1, Integrity
5, Adherence
3, Judges
1, Evaluator 1
2, Evaluator 2
3, Evaluator 3
4, Evaluator 4

Mike.Linacre: Lost angel, that looks good so far, but what does your data look like? We expect to see something like:

2, 1-5. 4, 5, 7, 3, 9, 8 ; Examinee is rated on 5 dimensions by evaluator 4. Dimension 1 was rated 5; dimension 2, 7; dimension 3, 3; dimension 4, 9; dimension 5, 8

lostangel: Dear Mike,
My data looks like:
1,1,1-4,2,4,5,7; Evaluators 1-4 awarded scores 2,4,5,7 to Examinee 1 on Dimension 1
1,2,1-4,3,4,7,6,;Evaluators 1-4 awarded scores 2,4,5,7 to Examinee 1 on Dimension 2


Mike.Linacre: Lostangel, that looks correct.

What was the Facets error message or symptom that Facets failed?

lostangel: The error message is like this:
>Duplicate specification ignored: Labels=
Specification not understood:

Specification is: 1, Examinees
Error F1 in line 17: "Specification=value" expected
Execution halted
Analyzed in time of 0: 0:50

Mike.Linacre: Lostangel, Facets is telling us that there are two specification lines starting Labels=

There should only be one Labels= line in your Facets specification file. There is only one Labels= in your first post in this topic.

lostangel: Mike, but in my input file I have only have one Labels= line. I can't figure out what's wrong with my file.

Mike.Linacre: Lostangel, please email your Facets specification and data file to me, so I can investigate: mike~winsteps.com

385. Sample size requirements

ajimenez June 8th, 2012, 5:37pm: Everyone,

Here is my situation and I would love some advice or direction. A test was given this year and had three forms with say 50 items, 12 of which were common items. The total sample size was around 2800. Moving forward, and assuming the total sample size will remain stable, can the next administration have four, instead of three, forms? I read that for high stakes testing the sample size recommendation is between 250 and 20*Number of items. I would not be near the upper level, but rather somewhere near the middle of that range. Does anyone know of a study that I could draw from further or have any suggestions of what you may do?

Thanks for looking.

Mike.Linacre: Ajiminez, you are probably looking at https://www.rasch.org/rmt/rmt74m.htm

You will probably need to do a simulation study that matches your test design in order to discover the minimum samples sizes, etc. for your situation.

386. Two dimension questionnaire

max May 23rd, 2012, 4:03pm: Hi Mike,

How would you advise analyzing a questionnaire that is intended to give two different scores? I'm looking at a modified version of the brief pain inventory (BPI), which has 4 pain severity questions and 7 pain interference questions, and if I do an analysis of all 11 questions together the principal analysis shows 2 very clear dimensions corresponding to the 2 suggested domains. So would you suggest simply analyzing each domain separately? Is it worth formally confirming the two dimensions through Rasch analysis before splitting them?

Mike.Linacre: Max, this is a challenging situation. The two components of the BPI only have a few questions each , so we really don't want to split them.

It is always a good idea to Rasch-analyze all the data together, then look at Winsteps Table 23.

In Table 23.2, do the items stratify vertically in accordance with the suggested 2 domains? In Table 23.0, is the eigenvalue of the first contrast noticeably above 2.0 ? If so, two separate analyses are indicated. Then cross-plot the two sets of person measures (Winsteps Plots menu, Scatter plot). Are the points trending along the diagonal (indicating two strands of one domain) or are the points a cloud (indicating two dimensions)?

max: Hi Mike,

I already did a summary analysis showing that the two domains do seem separate; specifically, the items stratify incredibly clearly on table 23.2 based on the suggested domains (the 4 in one domain all load >=0.55, the 7 in the other all load <=-0.06), and the eigenvalue in table 23.0 is 2.7. I suspect (though I haven't looked yet) that the two domains are positively correlated, but should I really analyze them as two "strands" of one domain just because they're correlated? The 4 question domain also seems to be "easier", and I also noticed that people with higher overall scores tend to misfit more, so I worry that analyzing them together will give me less accurate measures for precision, targeting, fit, threshold order, etc, due to a mixing of two measures that not only have different measurement properties, but that are also specifically stated in the user manual to be different domains...


Mike.Linacre: Max, yes, the domains will be correlated. With only 4 items in one domain, accuracy and precision are weak. The best that this instrument can do is to give strong indications.

The decision becomes a qualitative one. In practice, are two measures more useful than one? Think of arithmetic. School administrators only want one "arithmetic" measure for each child, but school psychologists may want separate measures for "addition" and "subtraction". "Subtraction" is known to interact with the socio-economic status of the child's family.

max: Hi Mike,

How do you perform the principal component analysis? When I use tables 23.x I get more contrasting results than if I perform a principal component analysis on the residuals extracted from the observation file (XFILE=) in SAS, though they have the same eigenvalue. In Winsteps one dimension has loadings from 0.41 to 0.74 and the other has from -0.07 to -0.64, but in SAS they range from 0.27 to 0.49 and from -0.05 to -0.42, respectively. The results from Winsteps look better (worse for the validity of the whole measure, but more consistent with the stated 2 dimensions), but I'm hesitant to use them without being able to replicate the results. I've tried both raw and standardized residuals without them making much difference at all, and haven't played too much with different options in SAS yet.


Mike.Linacre: Max, your PCA results from Winsteps and SAS have the same pattern, so there is probably some technical difference.

Compare the Winsteps PCA of the raw observations (PRCOMP=Obs) with an equivalent R analysis (see the attachments). The sizes of the PCA Components (Contrasts) are the same. The R loadings are standardized , but the Winsteps loadings are not.

Please try the same analysis with SAS. What do you see?

max: Hi Mike,

I get the exact same results as R when I use that example in SAS. Do you think it is valid to investigate the dimensionality of the measure via cluster analysis of the principal component scores? That's why I'd like to be able to do this in SAS instead of just using the output from Winsteps... In this case it's really obvious, but if I had 3 clusters of loadings for instance (highly positive, highly negative, and near zero), would it be reasonable to read that as 3 apparent dimensions?


Mike.Linacre: Thank you, Max.

So the difference between SAS and Winsteps is that SAS is using standardized loadings, but Winsteps is using raw loadings. From a Rasch perspective, standardized loadings are misleading because they give the impression that every component is equally influential.

I will amend the Winsteps output to clarify that Winsteps output "raw" loadings.

Winsteps also does a cluster analysis of the loadings. it is the second column of numbers under the heading "Contrast". See https://www.winsteps.com/winman/index.htm?table23_3.htm

In each contrast (component) plot, the top and bottom are at opposite ends of the same dimension (or different dimensions if you rotate). Between the top and the bottom are usually a composite of all the other dimensions, not a single other dimension.

max: Hi Mike,

I reproduced Winsteps' results perfectly now when I multiplied the item loadings by the standard deviation of the person loadings.

Regarding the cluster analysis, when I run your program for the tree data it shows 5 clusters, but if I close the table then reopen it manually by selecting table 23 it shows 2 clusters. Is there a reason these are different? Everything else is identical other than the program definition stuff at the very bottom of the file. When I do a cluster analysis myself on my own data there are several suggested cutpoints I could choose based on pseudo F and t^2 criteria, which is why I'd like to be able to do it myself so that I know when the cutpoints are more (prederably) or less (as in my data) well defined.


Mike.Linacre: Max, please do your own specialized cluster analysis.

I will investigate the Winsteps cluster analysis. This is only intended to report two clusters as an aid to those who want to do a high-low split on the item loadings.

387. DIF contrast and equivalent Mplus estimate

Jane_Jin June 3rd, 2012, 5:46am: Hello Dr. Linacre,

I am trying to run DIF analysis under the Rasch model using Mplus. Under Mplus, some of the thresholds parameters have to be constrained equal for model identification purpose (especially when there is group mean difference). I also ran my data using Winsteps. It seemed that Winsteps didn't put constrains (or maybe some constrains behind the scene that I am not aware of). If there is any, would you please tell me what are they?

The constrains I have for my Mplus code are:
1. factor loadings are equal across groups.
2. group one's factor mean is fixed to 0.
3. group two's factor mean is freely estimated.
4. I have 10 items, the first nine items' thresholds are all estimated freely for both groups.
5. The last item's thresholds are constrained equal between the two groups.
This is more like a multiple group CFA. The last item servers as an anchor item. I guess what I want to figure out eventually is how similarly Mplus and Winsteps estimate DIF contrast under the Rasch model?

Thank you for your time.


Mike.Linacre: This is intricate, Jane.

In its main estimation, Winsteps constrains the mean item threshold (= mean item difficulty) to be zero. Group membership is ignored. No constraints are placed on the estimates of the individual members of each group.

Then in the DIF analysis, for each item (called the target item), all the other items are anchored (fixed) at their estimates from the main analysis, and also all the individual members of each group are anchored at their estimates from the main analysis. The target item is split into separate items for each group, and the threshold (difficulty) of the target item for each group is estimated.

Thus for each item, we have two thresholds (with standard errors). So,
DIF size = difference between the thresholds
DIF significance at a student-t deviate = DIF size / (joint standard error of the pair of thresholds).

It looks like Mplus is actually doing a Differential Test Functioning (DTF) analysis rather than a Differential Item Functioning (DIF) analysis. For a DTF analysis in Winsteps, we would do two separate analyses, one for each group, and then compare the item thresholds. See https://www.rasch.org/rmt/rmt163g.htm

Jane_Jin: Thank you, Dr. Linacre, for the explanations. I understand better now.
So if I freely estimate one item at each time (fixing the other 9 items equal between groups) in Mplus, then Mplus is somewhat similar to Winsteps with respect to DIF contrast estimate?


Mike.Linacre: Yes, close enough, Jane.

As a general rule, we expect different, but reasonable, methods of detecting DIF to report the same items as having DIF. If the methods report different items, then the DIF for some items is method-dependent, and so likely to be due to statistical accidents.

Jane_Jin: Thank you.


388. Paired Criterion

uve March 24th, 2012, 11:24pm: Mike,

I've attached a modification I created from the Winsteps paired Excel analysis output of a test containing 45 items and over 800 students. While examining a pair of examinees who had 20 identical wrong items incorrect for choosing the same distractor options (Table 35.4) I discovered that the pair did not have the same teacher. How would it be possible for two different students to take the same test in two different classrooms and get almost half of the questions wrong by marking the same distractors? :o

Perhaps they were communicating by means of some electronic device, like a phone. Teachers are extremely vigilant during testing to check for this and monitor the class, but I think it is very easy to underestimate the ability of some students to utilize them even under the most strict circumstances. I know of one situation where the match occured because the site data entry person scored the same test under two different student ID numbers by mistake. So there can be many other factors that might explain something like this. :-/

However, if we assume no data errors occured and no communication occured between these two examinees, then it seems we are only left with the conclusion that this was mere chance. This chance was extremely low--they likey had a better chance of winning the lottery. So I know that I can ignore this data because the examinees did not have the same teacher. It was then just a freak circumstance of chance--someone was struck by lightning or a meteorite.

Now, if you choose the Same Teacher = Yes in the table filter, you'll notice that there is one pair of examinees that has 22 matching incorrect items. I think the next question is obvious: how sure can I be that this isn't chance as well? ??)

I guess I'm wanting to be more sure about avoiding false postives, but I'm a bit shaken by the data that I see when I choose Same Teacher = No. Perhaps my probability calculations used to determine the odds of the matches occuring are incorrect in some manner. Could you confirm that my math is right? I would rather this be result of poor math than the nature of probability, which I have a feeling is how you're likley to answer. :)

I had to use Excel 2010 because there are almost over 90,000 rows. If you are using 2003, it will open but you will not be able to manipulate the pivot table and it will clip off all rows after 65,000.

Thanks as always for your comments.

Mike.Linacre: Uve, please try this: https://www.rasch.org/rmt/rmt61d.htm - when we first did that analysis, it was ground-breaking :-)

uve: Yes, I do like this resource and have read through it before.

DNA evidence in a murder trial usually supports conviction because the odds exceed Earth's population, i.e. 1 in 60 billion match. Evidence tampering, identical twin, lab errors, etc. ruled out, it is impossible for someone else to have the identical DNA.

Now suppose the Earth's population exceeds 100 trillion, the same odds could no longer help support conclusive conviction and the test would carry much less weight.

I guess that's where I'm going with my question. Given the large number of items and the very large number of respondents, a much greater than 1 in 1 million chance in my data seems to be a very freak, yet fairly conclusive false positive if we rule out communication between classrooms or a clerical error.

Like my example of the DNA test, perhaps the emprical data needs another measure of some kind. Something that better balances odds with total population.

Mike.Linacre: Yes, we have to be careful about being "fooled by randomness" (Nassim Nicholas Taleb). The "Hitchhiker's Guide to the Galaxy" is based on the premise that, if we could discover the exact probability of an unlikely event, then we could engineer the situation where that probability becomes reality!

This was part of the original motivation for https://www.rasch.org/rmt/rmt61d.htm . Another statistical consultant presented a probability analysis, but it was found to be confusing, rather than convincing. Expert witnesses could not refute the argument: "this is an unlikely event, but it can happen, and this is the instance where it did happen." A non-probabilistic approach was required.

As you remark, DNA evidence may also find itself in this same situation before too much longer. My own analysis of DNA datasets makes me skeptical of the huge improbabilities of coincidences that are claimed. My suspicion is that the DNA sample spaces for the probability computations include many DNA combinations that cannot be observed.

uve: Mike,

I was trying to find a response to an earlier thread I posted related to this one but couldn't. I just wanted to confirm the calcuations for determining probability given a situation in which two respondents have 12 items marked wrong using the same distractor (TWOLOW in the paired comparison Excel output table) on a MC test with four distractors total . Here's an example. Is the formula correct?

((0.25^2)+(0.25^2)+(0.25^2))^12 = 0.000000001888058

Also, if all is correct, I could interpret the output as less than 1 in 10 million of being chance. Would that also be correct?

Mike.Linacre: Corrected: Uve, assuming the distractors are equally probable, then the chances that two persons would choose the same distractor for one item = 4*(0.25)^2, and so for 12 items = (4*(0.25^2))^12

uve: Thanks!

Mike.Linacre: Uve, please see my correction. I was thinking of selecting the same specific option, e.g., Option 1, but, of course, the agreement can be between any of the 4 options for any of the 12 items.

Probabilities are slippery to compute. It is easy to imagine the wrong sample space :-(

389. Help with conversion from likert scale

willsurg May 30th, 2012, 5:01pm: Hi I need help understanding if I can use Rasch to convert my likert items/scales data.
If I understand this correctly, I have an instrument with 42 items, divided into 12 dimensions(scales). 5 point likert scale
My n= approx 23,000

I given all the other constraints, I should be able to have my data which looks like the following

Record# Q1 Q2 Q3 Q4 etc up to 42
A 1 3 2 5
B 2 2 3 4
C 1 2 4 1

Convert to look something like the following
Record# Q1 Q2 Q3 Q4 etc up to 42
A 1.2 3.1 2.3 5
B 2.2 2.1 14 5
C 1.3 3.1 1.3 5

Correct? If so, is there a paper or something that describes this so that a non-statician can understand?

thanks in advance

Mike.Linacre: Thank you for your post, Willsurg. The process you want was done for the Functional Independence Measure (FIM) and then the numbers plotted in https://www.rasch.org/memo60.htm - the underlying math requires Rasch software.

willsurg: Thank you for getting back about my post in helping convert from a likert scale, unfortunately it seems if I have misunderstood Rasch. I thought it could take a likert scale into interval data so that mean, SD, correlations etc. could be run. I don't follow what you reference, the Instantaneous Measurement and Diagnosis, as being anything similar to what I was hoping to be able to do with my data (I apologize, statistics is not speciality).

If I may trouble you once again, perhaps I did not explain my problem correctly? I want to transform each respondents survey from it;s current likert scale into interval data. It currently is in the form that most medical researchers use of values between 1 -5 corresponding to definitely disagree all the way to definitely agree. Instead of using the standard substitution values, I want to base the substitution on a "better statistical principal" such as Rasch.

Thank you for trying to help!


Mike.Linacre: Thank you for your post, Willsurg.

Rasch does not convert individual observations from their ordinal values into interval values. Rasch infers interval measures on the latent variable for each respondent and each Likert item that would produce the observed ordinal Likert responses. The values for each Likert category plotted in the "Instantaneous Measurement" Figures are the nearest approximation to interval measures corresponding to the Likert ratings, but the interval value for each Likert category is only approximate (good enough to put on a picture, but not good enough to use for computations such as correlations).

If you want to do your computations at the level of individual items, then the Likert-scale observations, though ordinal, are usually interval enough for practical purposes.

390. Table 20 Conversion Data

uve May 23rd, 2012, 10:11pm: Mike,

I followed the UMEAN and USCALE data reported in Table 20 for rescaling the logits to appear closer to the raw scores. I'm assuming there are some limitations to this, but as you can see the person raw score and scale score vary significantly (Total Score & Measure respectively). The same thing happens when I attempt to rescale the test to 0-100. This was a 14-item non-anchored 4-option survey scored 1-4, so the lowest score would be 14 and the highest would be 56. Could the low cap of 14 instead of the usual zero score for lowest response category have something to do with this?

1 45 14 57.96 3.691.02 .21.20 .6 .22 .33 50.0 38.1
2 44 14 56.91 3.582.07 2.62.01 2.1 .16 .34 14.3 35.2
3 39 14 52.31 3.27 .24 -4.0 .27 -3.3 .64 .39 57.1 30.0

Mike.Linacre: Uve, please use the Winsteps Help menu, Scaling Calculator.


Enter the two of the reported Table 20 measures as the Current Measures. Then enter the Desired measures you want to see. Click on Compute New. The revised values of UIMEAN= and USCALE= will display. Click on Specify New to adjust the measures in the current analysis.

The Output Tables menu, Table 14, should show the measure range you want.

uve: :-/

Here is the Table 20 output:


I entered these values in Current, but I must be misunderstanding something very basic because I'm not sure what I would put in Desired in order for the measures to go from 0-100. I would assume the new Desired mean would be 50, but not sure what to do with the new deviation.

Mike.Linacre: Uve, those UMEAN= and USCALE= values set the measure range for Table 20 to 0-100.

UMEAN= (same as UIMEAN=) sets the average item difficulty at 49.0793
USCALE=12.5664 specifies that one logit = 12.5664 user-scaled units

391. Table 30.1 & 30.4

uve May 20th, 2012, 2:04am: Mike,

I've attached a Word document detailing the first two items of a 50 item high school earth science exam. I'm particularly puzzled by the fact that Table 30.1 does not seem to reflect any gender DIF for item 1, while Table 30.4 clearly reports this exists. Perhaps 30.4 is over sensitive to the possible non-uniform functioning of item 1 for lower performers, while 30.1 does not seem to see this as significant. What are your thoughts? I've included item 2 only for contrast as an item for which both tables seem to agree.

Mike.Linacre: Uve, Table 30.1 is well authenticated.

In Table 30.4, we are applying large-sample statistics to only two data-points. The standard errors of the reported values are probably large. This is a problem with nearly all reported statistical results. The precision of the computation (often reported to 4 or 6 decimal places) is much higher than the precision of the statistical estimates (their standard errors).

If in doubt, always take the more conservative finding, i.e., that the null hypothesis is not rejected. In this case, Table 30.1 is more conservative.

Looking at Table 14.1, we can diagnose a likely problem. The item difficulty of Item 1 is anchored with a displacement of 0.8 logits. Our fit statistics, DIF computations, etc. are formulated based on free (unanchored) estimates. So please do fit and DIF investigations before anchoring item difficulties or person abilities.

uve: Unanchoring did the trick. Both tables now report no significant DIF. Thanks!

Mike.Linacre: Thanks for telling us, Uve. It is good to know that theory and practice agree in this case :-)

392. FACET

hm7523 May 10th, 2012, 3:55am: Hi,

I am learning many-facet model now.

In my model, I have items, raters, rater's gender, rater's occupation. Can I put all of these factors into my Many-faceted model? Some references said the background information such as gender cannot be put into the model, since it refers to the groups instead of the facet. Is that true?

I want to know whether the male rater and female rater are significantly different or not. If I didn't put this facet into my model, how can I know ?

In addition, what kind of "factor" can be put into the many-faceted model?


Mike.Linacre: Welcome, Hm7523.

In a Facets model there are two types of facet.
1. Facets hypothesized to generate the observation.
Examples: person ability, item difficulty, rater severity

2. "Dummy" facets hypothesized to be a cause of interactions and/or to be used for model selection.
Examples: gender, ethnicity, grade level

So, in your situation:
Facets = 5 ; students, items raters, rater's gender, rater's occupation
Models = ?,?,?,?,?,R ; all facets can interact
Labels =
1, Students
1 = Jose
2 = Maria
2, Items
1 = My day at the zoo
2 = Looking at the Moon
3, Rater
1 = Mr. Smith
2 = Ms. Aquano
4, Rater gender, D ; notice the D for dummy or demographic facet
1 = Female
2 = Male
5, Rater occupation, D
1 = Teacher
2 = Specialist

Anything can be a "factor" but it must be identified with a list of elements. So, if "age" is a factor then

6, Age, D
1 = 1-20 ; age ranges identified as elements
2 = 21-40


hm7523: Hi, Mike,

I followed your instruction and set up my program like this, but the output show
Warning (6)! There may be 11 disjoint subsets and it keeps iterating unless I click finish iterating button.
What does that mean and how to solve this issue? Thanks.

; Angoff.txt
Title = Angoffround3
Score file = Angoffround3
Facets = 6 ; facets: judges, sex, location, status, round, items
Inter-rater = 1 ; facet 1 is the rater facet
Arrange = m,2N,0f ; arrange tables by measure-descending for all facets,
; 2N = element number-ascending for facet 2,
; and 0f = Z-score-descending for facet 0 (bias interactions)
Negative =1 ; the examinees have greater creativity with greater score
Non-centered = 0 ; examinees and items are centered on 0, judges are allowed to float
Unexpected = 2 ; report ratings if standardized residual >=|2|
Usort = (5,1,2,3,4,6)
Vertical = 2A,3A,4A,1A,5A,6* ;define rulers to display and position facet elements
Zscore = 1,2 ;report biases greater in size than 1 logit or with z>2
Model = ?,?,?,?,?,?,R

Labels= ;to name the components
1,judge ;name of first facet
1-32 ;names of elements within facet
;these must be named, or they are treated as missing

6, items
Data= round3.xls ; Facets can read in an Excel data file

Mike.Linacre: hm7523, assuming you are using a recent version of Facets, please code the demographic "Dummy" facets with D -
2,gender, D
3,location, D
4,status, D
5,level, D

393. Dimensionality

student May 16th, 2012, 3:31pm: Hello Prof Linacre

I am validating a scale measuring emotional intelligence using Rasch measurement model. This scale consists of seven dimensions. The analysis included the test of dimensionality, item fit statistics, response category and reliability. This analysis applied for each subscale and for overall scale.
The result of dimensionality showed that each subscale indicates unidimensionality. However, the test of dimensionality of overall scale indicated also unidimensionality.
How can this result interpret? Is this scale a multi dimension or unidimension?
Can this result interpret similar to the result that obtained from factor analysis?

Mike.Linacre: Student, what do you mean by "dimension"?

For instance, in an arithmetic test, there are often 4 dimensions: addition, subtraction, multiplication and division. For an arithmetic teacher, these are clearly different dimensions, each requiring the students to learn different cognitive operations. For an educational administrator, these are not dimensions, they are 4 strands of the arithmetic dimension.

It looks like you have the same situation. Whatever statistical technique you choose, the decision about dimensionality requires the same thought-process by the analyst. What do these data mean?

student: Thank you Prof Linacre for your clear answer

394. EAP Ability Estimates in ConQuest

May 12th, 2012, 4:15am: The columns in the output file of EAP ability estimates on p. 70 of the ConQuest 2.0 manual, Figure 7-5, are:

ID (or sequence number if non provided)
EAP estimate
EAP Posterior SD (effectively = S.E.)
EAP Reliability (see http://www.ets.org/Media/Research/pdf/RR-10-11.pdf )

Courtesy of Ray Adams

395. Simulated Dta

uve May 11th, 2012, 7:06pm: As I mentioned in the other post, I still have question or two about the simulated data and thought it would be better to start a separate post. Here is a quote of yours:

[quote=Mike.Linacre]Uve, the simulated responses are generated according to the Rasch model with generating parameter values equal to the Rasch estimates from the current data. Consequently, the simulated data are what the current data might look like if the current data fit the Rasch model perfectly.

This makes sense to me. However, what I am not understanding is how the random generator can create responses such that these random responses fit the model as perfectly as possible given the measures for the persons and the items.

Mike.Linacre: Uve, please see https://www.rasch.org/rmt/rmt213a.htm

Remember that "perfect fit" to a Rasch model, is not "observed data = expected data". Perfect fit is "observed data = expected data + expected randomness".

396. Between Class MNSQ & ZSTD

uve May 5th, 2012, 10:35pm: Mike,

Are Between Class MNSQ and ZSTD in Table 30.4 to be interpreted the same as Outfit MNSQ and ZSTD found in Table 13? In other words, would a Between Class ZSTD above 2 be significant for underfit and less than -2 be significant for overfit?

For MNSQ, it seems the values are between 0 and 1 for all the items on all the dichotomous tests I've investigated so far. It may be a coincidence, but since the MNSQ values are so much different across classes than for individual items, I am second guessing my interpretation of them and the usual very general guidelines I see given at times for establishing acceptable limits. Thanks again as always.

Mike.Linacre: Thank you for your question, Uve.

ZSTD statistics (z-scores) are unit-normal deviates. So they are always hypothesized to have two-sided 95% confidence intervals at +-1.96. In other words, values outside 1.96 are considered to be statistically significant. 1.96 is rounded up to 2.0 due to the inevitable imprecision in the parameter estimation and subsequent computation.

MNSQ is a chi-square statistic divided by its degrees of freedom. Values less that 1.0 indicate overfit of the model to the data. This is probably due to over-parameterization. In other words, the Rasch model parameterized the data correctly, so that the addition of grouping parameters over-parameterizes the model.

But, but, but I am not an expert on Between Class statistics. These are advocated by Richard M. Smith and David Andrich. Please see their writings for more accurate information about Between Class statistics.

uve: Yes, Richard Smith & Christie Plackner have a very nice article I just read in Advances in Rasch Measurement Vol I, "The Family Approach to Assessing Fit in Rasch Measurement." Very nicely done. Intepretation appears to be the same as total fit statistics but in Winsteps the Between Class fit output looks very different than total fit. This just may be a function of the formula, but the article doesn't elaborate.

Mike.Linacre: Yes, Uve. Richard Smith is an enthusiastic advocate for Between-Class fit statistics. He has found them very usefull. The Between-Class statistic should summarize the changes in slope of the empirical ICC (Winsteps Graphs menu). Do you see that in your analyses?

uve: Mike,

I'm not quite sure how to relate the visual of the slope to the Between Fit data. In fact the Between Fit data look very different than the total item fit data. So I'm not sure how to summarize the data in the manner you mentione.

However, I did attach two item ICC's and fit data. What I typically do with ICC's is find a couple of items that have a MNSQ as close to 1 and ZSTD as close to zero as possible. I make the assumption I should be able to find empirical data that follow for one of those curves very closely. I'll work with the empirical interval adjustment until I find that best fit visual. This becomes my standard from which to compare all other ICC's.

Using that as my guide, you can see that Item 22's fit stats are reflected nicely in the visual. Though not as strong, Item 5's fit stats are nice as well, but the visual does not seem to confirm this. Moving to the Between Fit for Item 5, the ZSTD is below -2. I'm assuming then that Between Fit is revealing the potential misfit that the total fit is not detecting but which is detected visually in the ICC.

The problem is that with total fit, a negative ZSTD would indicate significance of overfit. All of the 50 items for the 93 persons in this test had Between Fit MNSQ's between 0-1 and all ZSTD's were negative. In fact, the lower the MNSQ the further the ZSTD's are from zero, which appears to be the opposite when looking at total fit. So I'm a bit uncomfortable making the statement that the Between Fit ZSTD is indicating significant misfit between classes, though the ICC seems to confirm this somewhat. It's clear that the Between Fit ZSTD is not indicating overfit, or perhpas so--please let me know what you think.

Again, I'm not sure how to interpret these stats. By the way, I noticed that for most of the dichotomous tests so far, all the Between Fit MNSQ's are also between 0-1 and negative ZSTD. However, for polytomous instruments it's different. Very odd. :-/

Mike.Linacre: Thank you for your report, Uve.

Item 22 appears to have considerable overfit from every perspective. The empirical ICC (blue line) is tracking much too close to the red line (model ICC). I am amazed that mean-squares are reported slightly above 1.00. Do you have LOCAL=YES ?

Item 5: the empirical ICC is much as expected. It is a little noisy (blue line a little outside the gray confidence intervals) so mean-squares of 1.2 make sense.

For the between-fit statistic, we need to adjust the empirical interval so that the number of x's on the blue-line matches the number of class-intervals for the between-fit statistic. So, for instance, if your instruction is
DIF = $MA3
to stratify the sample into three ability groups, then we would set the empirical interval to produce 3 x's

uve: Mike,

Local is set to N. There were five classes for Between Fit. I ran the ICC's again and adjusted the interval until I have the smallest that will provide 5 points just before going to 6. I've added these to the original for your review. Should I switch Local to Yes and start this over?

Mike.Linacre: Uve, Local=Yes will do the opposite of what we want!

The numbers and pictures continue to eb contraidctory ...

Suggestion: use the Winsteps "simulate data" function. Then analyze the simulated data. How do the numbers and plots for items 5 and 22 compare with the original numbers?

uve: Mike,

Yes, I kept Local = N.

I created the simulated file from my dataset. This seems to have reversed things somewhat. I kept the interval the same and added this to the existing document. What is the intent of running a simulated file in this case? Also, how does matching the number of points (5) on the ICC relate to the 5 classes for Between Fit?

Mike.Linacre: Thank you, Uve.

The reason for simulating data is so that we know what numbers to expect when the data do fit the Rasch model.So the simulation provides numbers and pictures with which to compare our empirical findings.

My understanding is the the Between Fit statistic stratifies the sample by ability into Classes. The x's on the empirical ICC do the same thing.

If the numbers produced by Winsteps continue to be inexplicatble, please email your Winsteps control and data file to me, so that I can verify the computations.
mike /-/ winsteps.com

uve: Thanks Mike. I greatly appreciate your willingness to take a closer look and will send the file off to you.

I am still a bit confused by your comment, "the simulation provides numbers and pictures with which to compare our empirical findings." Perhpas I misunderstand how the simulation file works. If we use the measures from the original file, then I thought the simulation run creates random responses starting with the seed number. These random responses are then compared to the probability of failure given the ability and item difficulty. For dichotomous items, if the randomly generated response is below the probability of failure, then it is considered an incorrect response and coded wrong or zeroand 1 otherwise. How can random responses provide us a picture of what to expect? To me that sounds like a best case scenario when it would be closer to a worst case--random noise etc. I thought the simulation file was random data and that in order for the empirical data to be meaningful it should be significantly different from the simulation file, otherwise the empirical data is no better than random. But apparently I am off in my thinking about this. Could you please elaborate a bit more or perhpas point me in the direction of some more resources?

Mike.Linacre: Uve, the simulated responses are generated according to the Rasch model with generating parameter values equal to the Rasch estimates from the current data. Consequently, the simulated data are what the current data might look like if the current data fit the Rasch model perfectly.

Thank you for emailing your Winsteps control and data file to me. I notice that the dataset is relatively small. Most Rasch fit statistics are based on asymptotic theory (the values become more accurate as the size of the dataset increases). This may explain some of the discrepancies. More soon :-)

Mike.Linacre: Uve, the Winsteps analysis reports coherent findings for your items.

Using DIF=$MA4

1) The worst two fitting items, 27 and 15 in Table 10.1, have two of the worst between-class fits in Table 30.4

2) The worst between-class fit is for item 42. It has overfit in Table 10.1, but a distinctive empirical ICC. For the lower half of the ability sample, the item is highly discriminating between ability levels. For the upper half, that item does not discriminate between ability levels. This also explains the overfit. Item 42 is behaving like a composite of two items.

3) The smallest between-class mean-square is for item 25. It has average infit and outfit mean-squares, In the Graphs window, its empirical ICC tracks the model ICC closely except at the top end where there are only 3 of the sample of 93 persons.

In the Excel DIF plot for "DIF Size", we can see the relative difficulty of each item for each class. The between-class fit statistics summarize this plot for each item.

uve: Thanks for your great insight Mike. I've attached the output for you and others to review if needed.

I do have a few clarifying questions as usual. :)

In regard to 1: Item 15 does seem off, but item 16 seems to be the second worst next to item 27. Did you mean 16 or is there something about 15 that makes it worse that I'm not seeing? Also, items 27 and 16 do have the worst Summary DIF Chi-Square, but not the worst between class fit. Were you meaning this, or again, is there something about the between fit I'm not seeing?

In regard to 2: Yes, I do see what you mean.

In regard to 3: Yes, I do see what you mean, but how did you determine the number included in the top end?

Finally, you say the between class summarizes the DIF Size, but between class is not a DIF calculation to my understanding. On the last page of the attached document are fit formulas I created in Word from the Smith article. Again, these don't appear to be DIF formulas. Were you referring to Summary DIF Chi-Square in Table 30.4? If I remember correclty, those are the Table 30.2 DIF t-values squared and summed. If so, then it would seem to me that the graph we would want would be the DIF t-value that would provide the data for the DIF Summary Chi-Square we see in Table 30.4.

I still have a question or two about the simulation data but will post that separately.

I greatly appreciate all your help.

Mike.Linacre: Uve, sorry for the typo. Item 16 not item 15.

Between-Class fit is conceptually a non-uniform DIF computation. DIF=$MA4 defines four focal groups, one for each ability level.

Yes, the t-values are in the Excel "DIF t" plot.

To see the number in the top end, I looked at the ability distribution in Winsteps Table 1. There are 3 outliers at the top.

397. Weighted Sub-Strands

djanderson07 May 10th, 2012, 6:07pm: Hi, I have a bit of a conundrum that I'm hoping someone out there can help me solve. I'm working with a test that has sub-strands. Each of these sub-strands has an unequal number of items, yet the agency for whom the test has been developed for would like each strand to contribute equally to the respondents total score.

In the past, this issue has been dealt with by simply applying a weighting factor to the total raw sub-strand score (e.g., if Strand A had 5 items and Strand B had 10 items, then the total sub-strand score of Strand A would be multiplied by 2). However, we have since moved away from the raw score reporting and are trying to move toward a scale score via the Rasch model.

So, that leads me to my question... Is there a way we can use the Rasch model to scale persons and items, yet still weight these sub-strands so they each contribute equally to the respondents' total scale score despite the unequal number of items within each sub-strand?

Thanks in advance to any who have thoughts on the issue.

Mike.Linacre: Sure yes, dja.

First, do the unweighted analysis. This verifies that everything is working correctly. This analysis also reports the true fit statistics and reliabilities.

Second, weight the items to match the test plan. In Winsteps this would look like:
1 2 ; item 1 is in sub-strand A, so weighted 2
4 2 ; item 4 is also in sub-strand A
This will give the weighted raw score, and measures corresponding to it.

djanderson07: Thank you Mike! I can't tell you how much this helps. You are a life saver.

398. reporting results

student May 7th, 2012, 3:34pm: Hello Prof Linacre

I am doing a study to examine the psychometric properties of achievement test, self efficacy scale and emotional intelligence scale using Rasch measurement model. I conducted four types of analysis: dimensionality, fit statistics, and response category and item and person reliability. I applied these analyses for each subscales in these instruments and for the overall of each scale. When I reported the results I found my self repeating the same tables and the same information in each instrument. My question is how can I improve the reporting and avoid the replicating of the results.

Mike.Linacre: Student, is this the layout of your report?

Type of analysis
Type of analysis

If so, why not:

Types of analysis
Types of analysis

Remember your audience! If you find yourself repeating something, then you have already made your point. No need to say it again. :-)

dachengruoque: Dr Linacre, could you please leave a citation of a reserach paper serving as a classic example of presening results applying Rasch method? Thanks a lot!

Mike.Linacre: Dachengruoque, Rasch experts have been unable to agree on a "classic example". Rasch-based research reports are heavily content-specific and audience-orientated. However, there are good Rasch-based papers published in most commonly-encountered content areas. For instance, in Physical Therapy https://www.rasch.org/rmt/rmt102c.htm

dachengruoque: I got your point. Probabaly it is because I was distracted by those SPSS manual and books instructing researchers to report this and that. Thanks a lot!

Mike.Linacre: Dachengruoque, you have encountered the quantitative vs. qualitative chasm. The quantitative folks (SPSS) focus on what numbers to report. The qualitative guys focus on what meaning to report. Rasch tries to straddle the chasm.

399. Unexplained Variance Question

rab7454 May 7th, 2012, 7:56pm: Hi Mike,

I'm confused by the following paragraph from the WINSTEPS website regarding dimensionality:

There is a paradox: "more variance explained" "more unidimensional" in the Guttman sense - where all unexplained variance is viewed as degrading the perfect Guttman uni-dimension. But "more unidimensional" (in the stochastic Rasch sense) depends on the size of the second dimension in the data, not on the variance explained by the first (Rasch) dimension. This is because most unexplained variance is hypothesized to be the random noise predicted by the Rasch model, rather than a degradation of the unidimensionality of the Rasch measures

I'm specifically confused by the statement, "...most unexplained variance is hypothesized to be the random noise predicted by the Rasch model..."

How is this possible?

Also, from the WINSTEPS output, how does one obtain an estimate of the % variance that is explained by random noise?

Any thoughts would be most appreciated.

Best wishes,


rab7454: ****Correction to my second question***

My second question should read:

Also, from the WINSTEPS output, how does one obtain an estimate of the % of the unexplained variance that is hypothesized to be the random noise predicted by the Rasch model?

Thanks again,


Mike.Linacre: Thank you for your questions, Ryan.

The Rasch model is a probabilistic model. It requires randomness in the data. The Rasch model is not a deterministic model. The Guttman model is a deterministic model.

The randomness in the data supports the construction of additive measures from ordinal data. The deterministic Guttman model constructs ordinal "measures" from ordinal data.

In Winsteps Table 23.0 https://www.winsteps.com/winman/index.htm?table23_0.htm there is a column labeled "Modeled". This shows the expected variances if the data fit the Rasch model (in the way the Rasch model expects).

rab7454: Mike,

Thank you for your quick and informative response. Out of curiosity, how would you interpret the following results with respect to dimensionality?:

Table of STANDARDIZED RESIDUAL variance (in Eigenvalue units)
-- Empirical -- Modeled
Total raw variance in observations = 103.4 100.0% 100.0%
Raw variance explained by measures = 53.4 51.6% 51.6%
Raw variance explained by persons = 18.1 17.5% 17.5%
Raw Variance explained by items = 35.2 34.1% 34.0%
Raw unexplained variance (total) = 50.0 48.4% 100.0% 48.4%
Unexplned variance in 1st contrast = 1.5 1.4% 2.9%
Unexplned variance in 2nd contrast = 1.4 1.4% 2.9%
Unexplned variance in 3rd contrast = 1.4 1.3% 2.8%
Unexplned variance in 4th contrast = 1.4 1.3% 2.7%

Again, thank you for sharing your thoughts with me.


Mike.Linacre: Ryan, multidimensionality is usually indicated by the size (eigenvalue) of the "Unexplned variance in 1st contrast" being greater than 2.0. Here it is 1.5, about what we would expect for data that fit the Rasch model:

dachengruoque: Is the eigen value threshold (2.0) the only criterion for detecting dimensionality of data set? Or is there any other indicators of dimensionality that are in Winsteps? Thanks a lot!

Mike.Linacre: Dachengruoque, " the only criterion for detecting dimensionality"?

This depends on how we define "dimensionality". For instance, guessing on a mutliple-choice test is off the Rasch dimension. But is it on another dimension?

If "multi-dimensional" means "clusters of inter-item correlations in the Rasch residuals", then "eigenvalue" is the place to look. This aligns with McDonald's "independent clusters" http://nth.wpi.edu/PapersByOthers/multidimensoal-item-repoce%20theory.pdf

But if "dimension" means something else, then we must look elsewhere.

400. Curve Differences and PCM

uve April 24th, 2012, 7:18pm: Mike,

If I recall correctly, you mentioned that a good way to determine if PCM should be used or whether items can all share the same rating scale, is to group the items according to what you think best using PCM, then pick an item from each group and compare the ICC's. Items from those curves that don't differ greatly could likley all be grouped together. I've attached a plot. One curve represents 6 items and the other 4. Using PCM I picked one item from each group. The low rated and mid rated respondents seem to perform slightly differently while high rated respondents seem to rate identical. The problem I have is making a judgment about whether the differences are significant or not. There were about 560 respondents.

Is there a way to determine the differences between the two curves using perhaps the data from the plots which are also in the attachment? If so, what criteria would I be using to support the visual?

Thanks as always.

Mike.Linacre: Uve, if everyone responds to every item, then the different curvatures of the PCM ICCs will have a slight caterpillar effect on the person measures on the latent variable, but will not alter the rank order of the persons.

In general, we would be concerned if the differences in the curvature of the ICCs caused a change in the rank-order of the item difficulties that impacted construct validity. We would need to ask ourselves, "Which do we trust more: the current data or the construct theory?" Physical scientists would reply "The construct theory, unless the data clearly and convincingly contradict it. In which case, the construct theory needs to be revised."

Suggestion: analyze the items using group-level rating scales as far as possible. Then simulate a dataset of the same size as the current dataset. Analyze the simulated dataset with PCM. How does the variation in ICCs compare with a PCM analysis of the original data?

uve: Mike,

I did as you said. Item 1 of my survey had three choices and the other 13 items had 4. Though category 2 occupied an exteremely vary narrow range for the latter group of 13 items and I did collapse it with category 1, I still ran the analysis giving item 1 its own group and 2-14 the second group. I then created a simulated file and ran that at the lowest PCM level I felt was reasonable, 4 groups in all. I then picked 1 item from each of those groups from the simulated file and compared their order to the same 4 items from the first file. Though I did notice curve shape changes, the items seemed to be in the same order when choosing the Absolute X-axis option.

Since the order seems to be the same, I'm assuming that the original 2 group option is viable and that the 4 groups is not needed. However, had the 4 group PCM showed a different order, then I'm assuming it would have been the better choice. Would that be correct?

Mike.Linacre: Uve, "had the 4 group PCM showed a different order, then I'm assuming it would have been the better choice. Would that be correct?"

The quantitative guys would say, "Of course!" The qualitative guys would say, "But that's only one piece of the evidence."

There's an interesting lesson in the recent Kaggle-Grockit competition (predicting the correctness of students responses to test items). The winner of the competition reported that he looked at the meaning behind his predictions. The runner-up reported that he merely looked at the numbers and ignored their meaning. This is the same challenge that the data-miners are facing: http://gizmodo.com/5906204/the-problem-with-big-data-is-that-nobody-understands-it

uve: Thanks Mike.

Always so much more to think about :)

I wonder if PCM and RSM would be analogous to viewing A Sunday Afternoon on the Island of La Grande Jatte by Georges Seurat ( http://www.most-famous-paintings.org/A-Sunday-Afternoon-on-the-Island-of-La-Grande-Jatte---2-large.html ) a mere inch away, then 5 feet away. The meaning of the painting would certainly change.

401. DIF contrast

ani May 2nd, 2012, 2:15pm: Hi Mike,

how can I explain in terms of raw data what it means a contrast of 0.5 logits or another one of -0.70 logits after the DIF analysis?


Mike.Linacre: Ani, if you are looking at Winsteps output, then the DIF sizes are also shown as the "Obs-Exp Average" difference in the observations for this sample of persons. https://www.winsteps.com/winman/index.htm?table30_1.htm

Generally, compute the expected score by a person at the mean ability of the sample for (mean ability - item difficulty + half the contrast) - (mean ability - item difficulty - half the contrast) using a logit-to-expected-score table such as https://www.winsteps.com/winman/index.htm?whatisalogit.htm

402. Differing Group Dimensionality

uve April 24th, 2012, 7:32pm: Mike,

We recently received the results from a survey attempting to measure parent satisfaction with their child's school. I did not design the survey but have been asked to report on the findings. About 80% of our students are classified as economically disadvantaged and about 35% have a primary language other than English spoken at home. Our communications department wanted to provide parents and guardians with the option to take the survey online or use a paper scan sheet because many of our families simply can't afford a computer.

I've attached the results with the items in misfit order as well as some very puzzling results with regard to dimensionality. At the end of the attachment are the results from the t-test. With extreme scores included, there is clearly a difference in responses between those families using online versus paper. But once the extreme scores are removed, there does not seem to be a significant difference.

Wanting to investigate further, I calibrated the items using all students (566), then only those that took it online (193) then those that handed in paper scan sheets (373). The dimensionality tests for each run are very different, though all three seem to indicate a 2nd dimension of some kind. But each reports this as something different.

For example, when all parents were used or only those using paper scans, the 2nd dimension seems to have something to do with "This School" items. However, when only the online parent data were used, the 2nd dimension seems to be centered on staff.

I would very interested in your thoughts about this.

Mike.Linacre: Uve, this is an interesting sociological investigation :-)

Let me speculate ....

Thinking of parents I know, the online version would appeal to younger, more technological, more affluent, more "English" families. The paper-and-pencil version would appeal to parents with opposite characteristics.

From your findings, this suggests that "paper" parents see the "School" as a special topic, but the online parents see the "Staff". Perhaps the difference is that more traditional parents want their children to go to a good school (the teacher is secondary), but more modern parents want their children to be taught by a good teacher (the school is secondary).

My own parents were "modern" for their time, and they focused on good teachers ahead of good schools. Their choice for me has been justified. The school with good teachers is now ranked ahead of the good school they would have chosen!

Uve, undoubtedly you can do better than these speculations ....

uve: Thanks Mike. For me personally, it's often hard to be objective about my own data which comes from the world in which I'm surrounded. But yes, I agree, with your observations.

For me there's an awkward crossroads of sorts. When we engage in dimensionality analysis and get different results depending on whether we are looking at all respondents or primary subgroups, then I want to make sure as best as is possible that I am not running down too many rabbit holes. In other words, in this example, I feel reality changes significantly depending on how the data is being disaggregated. Decisions about what to do or how to modify the survey could be contradictory based which hole I chase the rabbit down. :)

Again, thanks for taking your valuable time and going over this.

404. CFILE shortcuts

uve April 24th, 2012, 9:08pm: Mike,

Is there a shorcut of sorts in Winsteps for setting up labels in CFILE for PCM options? For example, if items 1-6 used Sometimes, Often and Always and items 7-18 used Maybe, Agree, Strongly Agree. Something like this maybe:

1-6A Sometimes
1-6+B Often
1-6+C Always
7-18+A Maybe
7-18+B Agree
7-18+C Strongly Agree

In a situation like this, here's what I've been doing:

A Maybe
B Agree
C Srongly Agree
1+A Sometimes
1+B Often
1+C Always
Then I repeat what I did for item 1 for items 2-6. By allowing the first A-C to be for items 7-18 it allows me not to have to enter them for the majority of the items. Then I have fewer items to do one by one that use other categories not as common. Hope that makes senes. Thanks.

Mike.Linacre: Uve, if you are using
ISGROUPS = 111111222222222222
then you only need CFILE= for one item in each item group.

Your shortcut approach is the best we can do at present.

Allowing an item range is a useful suggestion. I will investigate its feasibility
1-6+A Sometimes

uve: Mike,

Yes, I use ISGROUPS in conjunction with CFILE. Thanks for considering adding more flexibility to CFILE.

405. item-measure correlations - corrected

oosta April 23rd, 2012, 12:06am: Mike:

Several months ago, I posted a thread (which I cannot find now) about item-total correlations. You said that Winsteps can compute item-remainder correlations for the total score (number correct) but not for the item-measure correlations. As you know, in an item-remainder correlation, the item's score is excluded when computing the total score. You also said that you would consider adding to Winsteps the capability to compute item-remainder correlations for the item-measure correlations.

Are you planning to do so? Do you have any idea when it might be implemented in a new version of Winsteps?

Mike.Linacre: Thank you for your question, Oosta.

Here are correlations that Winsteps can compute.
They can be for each person across the item scores, or each item across the person scores:

PTBIS = Yes or Exclude
Correlation between marginal (total scores) less the current response with the current responses

PTBIS = All or Include
Correlation between marginal (total scores) including the current response with the current responses

PTBIS = No or Measure
Correlation between the item or person measure and the current response. For items the sign of the difficulty measure is reversed.

oosta: Okay. So, it looks like the item-measure correlations always include the current item in the computation of the person's measure (total) score. Would you consider putting it on your program enhancements wish list to compute the item-measure correlation where the current item is NOT included in computation of the person's measure? I have no idea whether this is possible, easy to program, or difficult to program.

Mike.Linacre: Oosta, now I remember your earlier post.

A point-measure correlation for an item, omitting the responses to the item from the person measures, can be estimated:

For each item, correlate the observations with (person measure - item observation*(person S.E.)^2) in logits.

This has not been included in Winsteps because computing a more exact observed value of this correlation, along with its expected value, requires considerable additional computational effort. But I will be happy to include this computation if you can report that the benefit outweighs the cost in a meaningful way.

oosta: Thanks. your formula helps me tremendously.

I assume that "computational effort" refers to the CPU time rather than the time it would take you to write the program code. If that's the case, then I certainly don't mind waiting a few seconds (up to 30 seconds, for example) for it to run. Currently, all my analyses take no more than a few seconds.

The *expected* value of the correlation is not useful to me. However, the value of the correlation would be extremely useful. Perhaps excluding the expected correlation reduces computational effort considerably.

The uncorrected correlations can be misleadingly high and also leads to unfair comparisons of item-total correlations between tests of different lengths. That's probably why it's common practice (in the test development literature and reports that I read) to report only corrected item-total correlations.

Therefore, I have resorted to using the corrected point-biserial correlations rather than the (uncorrected) item-measure correlations. I would much rather use corrected item-measure correlations (than corrected point-biserial correlations)--particularly when there is missing data.

So, it would benefit me a great deal if Winsteps could output corrected item-measure correlations, but I have no use for the expected values.

Mike.Linacre: Thanks for the input, Oosta.

The "expected correlations" were introduced because novices were making obviously incorrect inferences based on the point-biserial correlations. For instance, they were eliminating items with low point-biserial correlations in situations where it was impossible for the point-biserial correlations to be high. In fact, without an "expected" reference point, it can be impossible to identify whether a reported correlation is too high, about right or too low.

It would seem that for tests of different lengths we would first need to compare the items' "expected" correlation values, no matter which correlation we chose to report. This would provide the baseline for discussion about the idiosyncrasies of each item in each test.

The computational effort for exact point-measure correlations (omitting the current observation) is several more iterations through the data and a memory array. Yes, for small datasets this is trivial. The challenge lies in scaling this up for the ever-larger datasets that are being analyzed with Winsteps.

oosta: Okay, so that's what the expected correlations are. They are very useful, then.

Could one approach be to provide the option but warn the user that it will take a long time to run for large data sets-- or perhaps might not run at all due to memory limitations? Then you would not have to figure out a complex algorithm for large data sets that overcomes the memory/speed limitations.

Mike.Linacre: Thank you for your suggestion, Oosta. I have noted it in my Winsteps wish-list. :-)

406. Simple FACETS specifications problem..

ImogeneR April 19th, 2012, 12:48am: Hello,
Have created a number of Facets control file using this format of specifications, am not sure why I am getting:

Specification is: 2, Station; station numbers
Error F1 in line 152: "Specification=value" expected

Clearly something wrong with file but it is beyond me at the moment.. any tips please?

Mike.Linacre: Thank you for your question, Imogene.

Please remove the line "CO" before 2, Station

Facets thinks that "CO" ends the Labels= specification, so that it expects another specification immediately after CO.

ImogeneR: Ah. many thanks Mike. I now have the problem that my model statements include a rating scale for each item (e.g.R7 or R21) but the data contains numbers with decimal points eg 5.33 and I am getting an error message: "Check (2)? Invalid datum location: 60,141,111013194,6.33 in line 9. Datum "6.33" is too big or not a positive integer, treated as missing." and it's missing about half the data...

Mike.Linacre: Yes, Imogene, Facets requires integer data.

It looks like your data have 1/3 and 2/3

1. In Excel, please multiply all your data by 3 and convert to integers.
2. Adjust the range of the R to the new highest data values.
3. Weight each model statement 0.33333 to return the data to its original values:

Models = ?,?,?,R60,0.3333

ImogeneR: Thanks Mike, trying it now. Also noticed there will be scores in the scale not represented so is the 'K' in the model statement advised?

ImogeneR: Hi again Mike,
I used the model as suggested although it appears to me that from Table 6.0 the data don't seem to have been returned to the original values. I am also very curious about the poor reliability for Examiners and Students and non significant chi square result for student Facet. ..

Mike.Linacre: Imogene,

Good news: the data are being analyzed as fractional scores.

1) please only multiply by 3 and weight by .3333 those items which have fractional scores, such as 171, and not those that do not, such as 272.

2) please be sure to tell Facets that unobserved fractional scores are real scores using the Keep instruction:
For example:
Model = ?,171,?,R21K, 0.3333

407. Checking Unidimensionality unsing Winsteps

Thomas April 19th, 2012, 10:31pm: I developed a scale with 21 items to measure study skills of students. I wish to check the unidimensionality of the scale. I judged from the following facts:

1. After deleting two items, all items had infit and outfit statistics within the reasonable range for
observations (ie, between 0.6 V1.4).
2. The PCA analysis listed below. (Unexplned variance in 1st contrast = 2.7 )

Does it means that the unexplained variance in 1st contrast is too large and implied that there is another dimension in the scale (ie. not unidimensional)?

could anyone please advise the steps of checking unidimensionality? Thanks.

TABLE 23.0 2012 Study skills ZOU675WS.TXT Apr 19 22:03 2012
INPUT: 65 Person 21 Item REPORTED: 64 Person 19 Item 76 CATS WINSTEPS 3.74.0

Table of STANDARDIZED RESIDUAL variance (in Eigenvalue units)
-- Empirical -- Modeled
Total raw variance in observations = 30.2 100.0% 100.0%
Raw variance explained by measures = 11.2 37.1% 37.2%
Raw variance explained by persons = 4.5 14.7% 14.8%
Raw Variance explained by items = 6.8 22.4% 22.5%
Raw unexplained variance (total) = 19.0 62.9% 100.0% 62.8%
Unexplned variance in 1st contrast = 2.7 8.9% 14.1%
Unexplned variance in 2nd contrast = 2.2 7.2% 11.5%
Unexplned variance in 3rd contrast = 1.9 6.4% 10.2%
Unexplned variance in 4th contrast = 1.6 5.3% 8.4%
Unexplned variance in 5th contrast = 1.5 5.0% 8.0%

2012 Study skills
| MEAN 52.3 18.9 .58 .42 1.01 -.2 1.01 -.2|
| S.D. 9.1 .5 1.42 .28 .69 1.9 .68 1.8|
| MEAN 176.2 63.6 .00 .19 1.00 .0 1.01 .1|
| S.D. 18.7 .6 .54 .02 .15 .8 .16 .8|

Mike.Linacre: Thomas, let's see what your Table tells us.

Your scale has 19 active items. 63% of the variance is unexplained. This suggests that the item difficulties and person abilities are central. In fact, the items (22%) are more dispersed than the person (15%). This may be good or it may be bad.

For instance, for nurses at the end of their training, we expect them to have very similar competencies. We want to see very littlle dispersion in the nurse ability estimates. So, if the study skills of your sample are good, we are happy to see that there is little dispersion of the skill "abilities".

The eigenvalue of the 1st contrast is 2.7. This suggests that there is a secondary artifact with the strength of about 3 items (out of 19) in your scale. Is this a dimension or an accident or what? To investigate, we need to look at the content of the items.

Your Table 23.2 has an interesting vertical distribution in its Figure. We can see that items A,B,C are off on their own. Which items are they? Table 23.3 tells us ss9, ss10, ss5. Now we need the text of those items. (It is easier when the text is summarized in the item labels). What do those items have in common that the other items don't have? If there is something. Then it is a secondary dimension or merely a content strand (like addition vs. subtraction in arithmetic). If there is nothing, then the eignevalue of 2.7 may be an accident.


408. calculate Item-trait interaction using Winsteps

Thomas April 14th, 2012, 10:33am: I wish to check the unidimensionality of a new scale. Many years ago, I was taught to check this by using the item-trait interaction (of Rumm2020). If the total chi square probability of item-trait interaction is larger than 0.01, that means, there is no significance interaction between the items and the trait, therefore, unidimensionality is confirmed.

Now, I am learning to use winsteps. Can someone tell me how to obtain such information (item-trait interaction) to check unidimensionality? Thanks.

Mike.Linacre: Thomas, how about this ... www.rasch.org/rmt/rmt211k.htm

Thomas: Mike, Thank you.
And, how can I find out the probability after adding the square of t together (ChiSq), say ChiSq=20.05 at DF = 3?
Thanks :)

Mike.Linacre: Thomas, you can use a chi-square calculator such as http://www.swogstat.org/stat/public/chisq_calculator.htm

Mike, Thanks a lot.

409. CBT and Rasch

harmony April 14th, 2012, 5:20am: Hi all:

I am currently working on a computer based test (CBT) and have had trouble finding authoring software that is compatible with Rasch. It seems that most readily available applications do not produce raw data files and even though they give many CTT test statistics, they fail to provide needed information about option responses that is crucial for test development. Any recommendations?

Also, in the case of adaptive testing, the question of how large of a sample of examinees is needed for stable item parameters has arisen. I have heard anywhere from 200 to over a 1000 are needed. Can someone provide some insight?


Mike.Linacre: Thank you for your questions, Harmony.

It is amazing that CBT software would not produce output that shows examinee id, item id and response. This is usually required for legal reasons, if nothing else. This is also suitable for Winsteps (EDFILE=) or Facets (Data=). FastTEST claim to be good.

For sample sizes, https://www.rasch.org/rmt/rmt74m.htm

harmony: Once again, thanks for your prompt and informative reply!

The information provided in your link was both informative and clear. It is also much as I had understood it to be. However, I have a colleague who insists that CAT's require 1000 examinees for proper calibration. Is there ever a situation in which 1000 examinees are required for proper claibration, and , if so, what would it be?

Mike.Linacre: Harmony, some IRT models (not Rasch) require 1,000 or more examinees. Another reason for many examinees is to verify that the items are DIF-free for important target groups (genders, ethnicities, etc.)

harmony: Many thanks Mike. You may be the most helpful expert on the net! :)

410. Selecting Multiple Sites

uve April 4th, 2012, 8:24pm: Mike,

I've gone through the Winsteps Help menu but can't seem to find a solution that is close to my need at the moment. I am trying to select all students who took a survey at only some of our school sites. Let's say the sites are coded with three digits and there are 4 sites out of, say, 20 that I want to select:


How would I select just those 4 sites after calibrating the data for all 20? I assume I would use PSELECT in Specifications. Will your solution also work in Extra Specifications so that I could calibrate items based only on those sites?

Mike.Linacre: Uve, PSELECT= will work at Extra Specifications or in your control file, as well as for report selection through the Specification menu box.

PSELECT= is not intelligent, but it can do your choice:

PSELECT= has trouble selecting lists of 2 digits, so how about adding an extra one-character code to the person labels? For instance, a single-letter code for each of the 20 sites?

uve: I oversimplified my example. The site codes come from our database. I suppose I could convert them, but there are over 50 sites and we give many assessments, so it would be a bit impractical.

A site code could be something like 123 or 731, etc. Your solution would work if the sites all had the same last two digits but that's often not the case.

Mike.Linacre: Yes, Uve, the multi-digit codes are impractical for PSELECT=

But putting a one-character site code in the person label (in addition to the multi-digit site code) is practical. I have often done it. There are 200+ one-character codes available, e.g., the ranges from "A" to "Z" and "a" to "z" are 52 codes. http://www.asciitable.com/

uve: Mike,

I found a rather simple solution using Excel.

1) Copy the data portion of the file into Excel
2) Use Text to Columns option to separate the variable of choice into its own column while keeping the original pasted data as is.
3) Add the filter option and use it to select all the groups for the variable in question
4) While filtering, the original data pasted is also filtered
5) Simply copy the original pasted data into a text file
6) Copy the control file information from the original text file

That's it. Took me about 2 minutes.

Mike.Linacre: Well done, Uve. Thank you for telling us this procedure.

If we use DATA= in the control file, then there is only one control instruction to change, which can be done at Extra Specifications. So there would be no need to copy the control information.

411. DIF Analysis

tmhill April 4th, 2012, 3:49pm: Hi Mike and Rasch professionals,

Is there a tutorial or instructions on how to run and evaluate the outputs for a DIF analysis?


Mike.Linacre: Tara, Winsteps Help as several topics about DIF, such as https://www.winsteps.com/winman/index.htm?difconcepts.htm

If you want something more detailed, then a text book such as:
Differential item functioning by Paul W. Holland, Howard Wainer

tmhill: Thank you!

412. Category Probabilities (Table 3.2)

Kognos April 4th, 2012, 2:37pm: I am working with a rating scale where each item is scored from 0-10. The character graph of the category probilities in Table 3.2 is impossible to read, as there are too many curves for the available resolution. Is it possible to get at the data behind the graph, so I can plot the curves as lines in my graphics package?


Mike.Linacre: Certainly, Kognos.

1. the Winsteps Graphs menu.
Category probability curves
Copy data
Paste into Excel (or your graphs package)

2. GRFILE= output file

413. Noise increased after anchoring items

chong April 3rd, 2012, 3:41am: Dear Mike,

I've been dealing with the misfitting responses (and items) seriously flagged by the fit indicators. Their presence seem to shorten the item hierarchy in such a way that some items expected to be easier become harder while some challenging items become relatively easier. After several checks, I was confirmed that most of these underfitting problems emanated from the (random) guessing made on the less- targeted hard items.

So I've followed the three-stage procedure suggested in https://www.rasch.org/rmt/rmt244m.htm to trim off the misfitting responses by performing CUTLO = -1.4 at stage II. Now I can see items are well separated in the Wright map and the theoretically easier (harder) items are indeed easier (harder). I save these 'good' estimates to be anchored at stage III.

At the end of stage III, while I believed the 'good' item calibrations have been used to measure the persons more appropriately, the following changes are observed:
(1) the person reliability increases by 0.03 (which is good) but the item reliability decreases by 0.02, and S.E.s of both person and item measures increase.

(2) both the item and person MNSQs inflate considerably and the difference between empirical and modeled explained variance also increases:

Original version
Variance explained by measure = 65% (empirical) vs 69% (modeled)

After anchoring
Variance explained by measure = 70% (empirical) vs 83% (modeled)

(3) the magnitude of DISPLACE value ranges from 0.03 to 1.48 with 32% of these values larger than 0.5

My questions on the post-anchoring procedure:
(1)Could all these be the signs of introducing more noise to the original data set?

(2) What should we do about it when reporting the findings (final set of measures) to our audience? Or would we feel safe because the better estimates of person measures have been attained?

Thanks for always helping me out.

Mike.Linacre: Chong, anchoring the items is expected to make the fit of the current data worse.

If your purpose is to optimize the fit of the model to your data, then please use an estimation method that matches your target fit statistic. For instance, a minimum-chi-square estimation method would optimize fit as reported by a global chi-square statistic.

Usually our purpose is to optimize the meaning of the parameter estimates, so that the Rasch measures have the maximum construct validity and predictive validity. This requires us to make the data fit the model in the most meaningful way. You did this by trimming off the bad data (misfitting data). But we are often required to report all the data. So we keep (anchor) the meaningful parameter estimates and reinstate the bad data. The misfit of the data to the model almost always increases because the bad data contradicts the anchored estimates.

414. Observed Scores Table 2.5

uve March 30th, 2012, 8:31pm: Mike,

In a week or so I will be meeting with a team to go over some survey results. I'm going to give them my rationale for setting performance levels. I will likely recommend using the Thurstone thresholds but I want to explain several possibilities to them using the tables under 2.0. I need to do this in a way that makes sense to them. I'm now realizing that I have a poor conception of Table 2.5 and the observed scores.

I'm tempted to say that this represents the average measure of respondents who chose the category in question. But that would be sum(B)/observed scores in the category, and that's wrong. It is sum(B - D)/observed scores in the category. It's the "D" in the equation that's throwing me off a bit. I know that this refers to item difficulty but I am at a loss as to what B-D means when put on the logit scale in Table 2.5. I conceptualize B-D as a probability, not a logit measure.

If in Table 2.5 I see category "B" under logit value 1, can I say that on average respondents with a score of 1 chose category B?

I have read through the help guide but I'm still struggling.

Thanks as always.

Mike.Linacre: Uve, in Table 2.5 it would be sum(B)/N for each category. This shows the average ability of the persons who chose each category for each item. This is highly sample dependent, but it should confirm our theory that "higher ability => higher category" and "higher category => higher ability".

uve: Thanks Mike. I think I am confusing several things. Here is the excerpt from the Help page:

"OBSVD AVERGE is the average of the measures that are model led to produce the responses observed in the category. The average measure is expected to increase with category value. Disordering is marked by "*". This is a description of the sample, not the estimate of a parameter. For each observation in category k, there is a person of measure Bn and an item of measure Di. Then: average measure = sum( Bn - Di ) / count of observations in category."


"OBSVD AVERGE is the average of the (person measures - item difficulties) that are modeled to produce the responses observed in the category. The average measure is expected to increase with category value."

The title for table 2.5 is Observed Average Measures for the which the equation above in the help menu seems to contradict the formula you gave in your reply.

So it appears I'm trying to reconcile several things:

1)Average Measure (sum( Bn - Di ) / count of observations in category) same as average ability?
2)Average Ability (Tables 10.3, 13.3, 14.3 and 15.3) ?
3)Observed Average from your reply (sum(B)/N for each category) or from Winsteps Help average measure = sum( Bn - Di ) / count of observations in category." Is this Table 2.5?

1 and 2 appear to be the same thing to me. If so, what is the application of interpretation versus #3?

Thanks again for your patience and help.

Mike.Linacre: Sorry, Uve.

The numbers reported in Table 3.2 (describing the rating scale) are somewhat different from the numbers reported in Table 2.5 and 14.3 (describing the items). In Table 2.5 and 14.3, sum(Bn)/count(Bn)

Am correcting Winsteps Help to clarify this.

In Table 2.5, 14.3 and similar Tables describing the items, the "observed average" measure is: sum (person abilities) / count (person abilities) for each response option or rating-scale category.

In Table 3.2 and similar Tables describing the response structures, the "observed average" measure is: sum (person abilities - item difficulties) / count (person abilities) for each response option or rating-scale category.

uve: Mike,

Thanks for the clarification.

So it seems to me that Table 2.5/14.3 is saying that if our data fit the model, we expect to see persons of higher ability choosing higher cateogries.

Table 3.2 seems to be saying that if the categories function according to the model, then we expect to see higher measures in the higher categories. My problem with the term "measures" here is that I think of this as either Bn or Di. When the two are subtracted from each other, I think of probability. Again, terminology seems to get me every time :B

What I noticed is that when ISGROUPS=0 is selected, the middle cateogries in Table 3.2 are identical to 14.3. The extreme categories still differ. Interesting.

Mike.Linacre: Yes, Uve. When ISGROUPS=0, there is only one item for the rating scale, so this is applied in Table 3.2 etc., but it is noted at the top of each subtable:

uve: Mike,

So to summarize:

Table 3.2 OBSVD AVRG = model observed average measures
Table 3.2 SAMPLE EXPECT = model expected average measures
Table 2.2 = model expected responses and sample-free
Table 2.5 = observed responses and sample dependent
Table 2.7 = expected responses and sample dependent

Please correct me if I am off here or feel free to provide more detail if needed.

1) Winsteps Help seems a bit obscure as to how Table 2.7 is calculated. Could you please elaborate?
2) Trying to reconcile 2.7 and 2.2. What are the applications differences between the two?

Thanks as always

Mike.Linacre: Uve, yes, unfortunately some of the Winsteps documentation is skimpy.

Table 2.2 is as independent as possible of the person sample distribution. It is computed directly from the parameter estimates.

Table 2.7 is Table 2.2 applied to the observed distribution of person measures. We could say that it is Table 2.2 with a Bayesian prior of the person sample distribution.

415. separation index in Faceted

hm7523 March 26th, 2012, 4:57am: Hi,

I have a question related to the separation index in FACETED.
In my specified model, I have three facets: students, rater and items.
there are 3 raters, and 4 constructed response math items. Each item is 3 point items which could be rated as 0, 1, 2 or 3. Each item is rated by three raters.
The separation for students and raters are around 2~3 which are reasonble for me. But for the item facet, the separation is around 20. I am not sure how to interpret the following output.

Model, Populn: RMSE .07 Adj (True) S.D. 1.31 Separation 17.92 Reliability (not inter-rater) 1.00
Model, Sample: RMSE .07 Adj (True) S.D. 1.52 Separation 20.70 Reliability (not inter-rater) 1.00

I have 4 items, each of them is 3-point items. So the maximum should be 16 levels. But why my separation would be 20? How to interprete this result?


Mike.Linacre: Thank you for your email, hm7523.

The "separation" index (like most statistical indices) is a number based on assumptions. For "separation", an assumption is that the observed items are a random sample from a normal distribution of items with the same characteristics as the observed items. "Separation" then tells us how many statistically distinguishable strata of items could exist in this normal distribution. Consequently, it is quite possible for a facet with only two elements (such as "male" and "female") to have a separation of 20.

If you need to know how many strata exist empirically, then please do pairwise t-tests between the items. Essentially, this means:
Start with the easiest item. This is the first strata. Add 3 times its standard error to its item difficulty.
The next more difficult item above that is the second strata. Add 3 times this item's standard error to its item difficulty.
The next more difficult item above that is the third strata. (And so on ....)

416. Response Probability of 67%

uve March 12th, 2012, 12:12am: Mike,

In reviewing a lot of literature recently on cut point setting for standard setting, it appears that many testing contractors provide the hierarchy of item b-values based on successful response probability for dichotomous items at 67%, not necessarily 50%. I�m not sure how this is handled in constructed response or polytomous items. Anyway, can Winsteps adjust and report item difficulty values based this?

Mike.Linacre: Yes, uve. This is called "pivot anchoring". We choose our own pivot point for the item difficulty instead of the default.

In Winsteps:
0 0
1 -0.71 ; logit equivalent to 67%

uve: Mike,

Do you see anything wrong with calibrating items in the normal default manner but still using the response probability of 67% for cut point setting?

Mike.Linacre: Uve, different probability levels are used for different purposes :-)

For instance, in a CAT test, 50% gives the feeling of failing to examinees. 75% probability of success gives examinees a better feeling and also reasonably efficient measurement.

In mastery testing, the value may be 80% or more.

uve: I guess my question has more to do with matching calibration of the measures to the mastery level criterion, not so much which mastery criterion to use. In other words, if I instruct a team to use, say, 67% as the mastery criterion for setting proficiency, and provide them with the items in order of measure difficulty, should the calibration of those items have been created using the process you provided:

0 0
1 -0.71 ; logit equivalent to 67%

or can I simply have Winsteps calibrate items in the usual manner while still instructing staff to use the 67% criterion?

Mike.Linacre: Uve, the results are the same from a Rasch perspective, so please choose the approach that will work best for your audience.

1, Move the item difficulties
0 0
1 -0.71 ; logit equivalent to 67%
then an item shown adjacent to a person on the person-item map will be at the 67% probability level

2. Standard item difficulties
then an item 0.71 logits above the person on the person-item map will be at the 67% probability level

417. equating subtests

Gary_Buck March 21st, 2012, 7:27pm: I wonder whether I could have the benefit of your Winsteps wisdom.


I have to develop a number of short tests on each one of the 23 languages of the European Union (4 to 6 tests in each language for a total of 83 tests). My task is the same for each language, and the procedure will be repeated for each language.

There are four short tests in Language X. These tests are designed to the same specification, and are meant to be alternative forms. All four tests have been developed with 16 items each. Each test is an item set--one passage with 16 comprehension items--so that items cannot be moved form one test to another. The tests were piloted on a small sample of test takers, with all test takers taking all tests (there is some missing data, but this is the pilot study design). I carried out a concurrent analysis of these four tests, 64 items, and deleted the 4 poorest items on each test, and kept the 12 best ones, as per the spec.

My Task

I will re-run the concurrent calibration of the Language X item pool including only the 48 'good' items. I need to get two things from that:

1. I need to get descriptive statistics for each of the separate tests, but with all the items being calibrated on the same scale. It would be nice to be able to do that in one analysis, but if I delete items from, say, 3 tests and leave just one of the tests, I assume that they will be deleted from the calibration.

2. I need to determine raw score cut points for each of these tests, again such that they are all equated on the same scale. I have enough information to determine a cut point on the Rasch scale for the whole pool of Language X (by comparing 'known' people to items). The Winsteps table that gives a raw score to measure conversion (Table 20) would give me the raw score equivalent of that point with all the items, but I need it separately for each test of 12 items. Again, I would like to get all that from one analysis.

Perhaps I could run a concurrent calibration on for the whole language, get the item measures, and then run a separate analysis for each test but with the items anchored on the common scale; I think that would give me what I need. But that seems like a lot of work; for the whole project of 23 languages, that would be 106 analysis rather than just 23.

Any suggestions would be greatly appreciated.

Kind regards,

Mike.Linacre: Gary, let's start from here:
1. code each item with its test code (let's call these tests "testlets").
2. analyze all the testlets together
3. Winsteps item subtotals (Table 27) gives summary statistics for the items in each testlet
4. Select the items in a testlet
Winsteps "specification" dialog box
ISELECT = (testlet code)
5. Produce a score-to-measure table for the testlet: Winsteps Table 20

This should give you a good start :-)

Gary_Buck: MIke, thank you, this is very helpful.

Can I just confirm one thing. I just want to make sure that when I specify ISELECT it will NOT recalibrate the test and hence set the item mean at 0, but will give me the score-to-measuree conversion with the testlets all calibrated on the original whole-pool calibration? I.e. I want to make sure that the testlets will be equated through this procedure?

And does this happen when i run it in a command file, or must this be done in the interactive mode? (Since I have to repeat this many time, the command file would be more efficient).

Kind regards,

lostandfound: Hi Mike.
What do you mean by "1. code each item with its test code (let's call these tests "testlets").".
What is the control variable to do this?

Mike.Linacre: Gary:

Instructions given to Winsteps from the "Specification" dialog box do not change the calibrations. They only change the reporting.

"1. code each item with its test code" means "put a code in each item's label to specify which test it belongs to. For instance, if the tests are A, B, C, D, then the Winsteps control file will contain something like:
A item 1
A item 2
A item 3
A item 4
B item 5
B item 6

Gary_Buck: Thank you Mike, the coding is very clear.

And can I just confirm that when I specify ISELECT at step 4 of your earlier answer, it will NOT recalibrate the test and hence set the item mean at 0, but will give me the score-to-measuree conversion with the testlets all calibrated on the original whole-pool calibration?

I.e. I want to make sure that the testlets will be equated through this procedure?

Kind regards,

Mike.Linacre: Yes, Gary. ISELECT= in the "Specification" dialog box is after estimation/calibration has been done, so it will not change the calibrations.

418. Vertical Linking

uve March 11th, 2012, 11:49pm: Mike,
I would like to begin reporting linked vertical cohort growth for our students both within a given year and across multiple years. But the more I think of this the more complex it becomes. Im hoping you can guide me in the right direction.

Limiting the question to English in grades 3, 4 and 5 here would be my approach: each of the elementary grades takes 6 unit assessments throughout the year, each having 50 items. My thoughts were to have 5 Unit1 questions on Unit2, and 5 Unit2 questions on Unit 3 that were not on Unit1, etc. to accommodate within year growth. This gives me 5 linking items between any two units in order to measure growth from unit to subsequent unit, or the first unit to the last for one full within year of growth. In addition, each Unit6 test would have 5 questions from corresponding Unit6 from the previous grade, i.e. Grade 3 Unit6 would have 5 questions from Grade 2 Unit6, and Grade 4 Unit6 would have 5 questions from Grade 3 Unit 6 not found on Grade 2 Unit6, etc. to accommodate growth across multiple years and grade spans. So I would have a total 5 items with which to link growth across these spans.

I assume I would be using concurrent groups design to achieve the linkage in Winsteps.

1) Is the number of linking items enough?

2) How can I still maintain and report score information for each test? For example, when linking Unit1 and 2 together, I would have 100 items; however, when reporting score information, Table 20, how do I maintain the calibration of items to the 50 on just that one particular test?

3) How do I also incorporate anchoring items? Each unit test of 50 may get replacement items, so while Im linking 100 items, I have to be able to anchor common items for the individual 50-item unit tests. I imagine this is a multi-step process.
This is very complicated for me, but the pressure to begin showing true growth is becoming greater. I need to be able to provide my supervisors with a manageable solution.

Thanks as always for your invaluable assistance and guidance.

Mike.Linacre: Uve, the general answer to this is:
1. Have a separate data file for each test administration
2. Put in the person labels codes to identify everything for which you may want a report
3. Put into the item labels codes to identify the tests to which they belong.
5. Conceptualize a combined data file in which every different item has its own unique column.
5. Use MFORMS= to map the items in the separate data files onto the conceptual combined file of all the items
6. Do the combined analysis based on the MFORMS= file
7. Use PSELECT= to select the group of persons to report.
8. Use ISELECT= to select the items for each Table 20

Linking items: 5 is the bare minimum but there appear to be enough cross-links to prevent the estimates from becoming misleading.

uve: Mike,

At what point in the process are the multiple unit tests anchored? For example, at the beginning of the year we give Unit1, we then anchor the common items to Unit1 last year and produce our measuers. Then 6 weeks later, we give Unit2 which has 5 items from Unit1.

I'm assuming we anchor Unit2 this year to Unit2 last year exactly as is done with Unit1, but then how is this incorporated into the step where the two tests are combined using MFORMS?

Or is the anchoring done after Unit1 and Unit2 are combined using an anchor file that is a combination of all the common items from the two units?

Mike.Linacre: Uve, the anchoring decision-point is decided by the reporting point. As soon as we have reported some numbers, then we have decided that those item difficulties are definitive. Accordingly we anchor them at those definitive values for subsequent analyses. If the item difficulty of an anchored item drifts too much, then it has become a new item and is analyzed as such (new item bank number, unanchored).

uve: Mike,

What about linking item choice? Should linking items be selected randomly, or based upon a wide dispersion of difficulty range, or should they be items that are within 1 standard error of a key point of the test such as proficiency?


Mike.Linacre: Uve, linking items in vertical equating are a challenge. We need them to be within the ability range at each level, but spanning ability levels. Generally speaking, one year's growth = 1 logit, so there is about a 2 logit span at each level that is on-target (from success rate of about 45% to success rate of about 90%.


uve: Thanks for the resources Mike.

"I would only use the students' answers to items that were within a reasonable range of each student's estimated position on the scale."

This statement seems to suggest thinking about well defined boundaries of sorts. We test thousands of students on each of over 100 assessments, so there are practical limitations. However, we have identified 4 critical cut points dividing 5 performance levels for each assessment. My thoughts were in addition to the statement above, I could use items that are very close to the two most important levels: Basic and Proficient. Basic seems to be the largest area under the normal distribution. In addition, including items that represent difficulty levels would seem to also make sense.

Forgive me if I have wrongly used some terminology, but the "linking" items I'm suggesting are not anchor items but merely items common to both different grade level assessments. The idea would be that these items should appear with lower calibrations on the next higher grade level assessment which would be an attempt to measure growth.

So I have two complex issues occuring at the same time. I must equate Grade 4 Form A last year to Grade 4 Form B this year so that we can compare how our grade 4 program is doing each year (knowing full well some of the change can be attributed to different student populations) using common item anchoring . The other is that I must be able to longitudinally track the improvement of a single cohort using the Grade 4 assessment last year and the Grade 5 assessment this year through the use of very carefully chosen linking items that are not among the items used in anchoring.

Based on this would the process you detailed earlier still apply?

Mike.Linacre: Yes, Uve. I was understanding "anchoring" to mean "linking". In practice, linking items used across years become anchored items, because we want to use their item difficulty calibrations from last year wherever possible. If we don't, we find ourselves having to re-report last year's results. This is often an impossibility.

Your method of item selection looks good. We are not so concerned about the person distribution (norm-referenced testing). We are much more interested in the crucial points on the latent variable (criterion-referenced testing). My suggestion of a two-logit range may be too narrow for your application. You can see the reasonable range by looking at the standard errors of the item difficulties. We want those standard errors to be much less than the acceptable drift (change in item difficulty) from test session to test session. Then we can separate real item drift from measurement error.

uve: Thanks Mike. I guess I'm still a bit unclear if the items common to both years should retain their measures (anchored) or not. Am I really trying to put the Grade 5 items on the Grade 4 test? It seems not. If so, that would mean Grade 6 would be anchored to Grade 5 which is anchored to Grade 4 which then puts Grade 6 on Grade 4's scale, etc. That doesn't make sense to me.

In my mind, if there are vertical links connecting each grade level assessment to the one below, then all of the items can be calibrated at the same time providing a "super test" if you will. Then I can see if the percentage of persons scoring at or above the Proficient mark has increased for the persons and the items they took on the Grade 4 test versus that same group the next year on Grade 5. If so, growth towards proficiency may have occured.

If I'm not mistaken, the California English Language Development Test (CELDT) is one such test that is vertically scaled. So there are testing vendors doing this, I just don't know their processes.

Mike.Linacre: Uve, perhaps I have misunderstood. My perception was that you are doing something like: https://www.rasch.org/rmt/rmt61e.htm

This is used to track student progress across years and compare grade-level performance, all on one measurement scale. In principle, the student abilities become the same as the student heights. They can be compared and tracked across all grade levels for all years.

uve: My apologies. My lack of clarity is the result of my unfamiliarity with common person equating.

Yes, the link you provided explains much of what I wish to do. Are there more resources of any kind on this process your would recommend? If we use my situation, Form 7 could be the first quarter exam, say English. The CPS90 could be the 2nd quarter perhaps. So providing this common person link would hopefully allow me to show growth from the beginning to mid way through the school year. In Figure 1 this would be the horizontal links. For example, going from B to C.

But far more important is showing growth across years for a group. In Figure 1 this would be the diagnol links, B to D for example.

What I have to be able to continue to do is the non-equivalent groups equating design so that, say, B this year can be compared to B last year.

Please correct me if I'm wrong here, but using the item anchoring process I must continue to provide common items of B this year and B last year to compare how different groups are performing because item replacement occurs frequently on our exams. This allows us to assess performance of a specific grade level over time.

I must then also anchor common items between B and C to allow me to show growth within the year for a common group of students, and finally, I must anchor common items between B and D to show growth of those students between grade levels (across years).

If I am correct up to this point, then I need to make sure I follow the correct steps in Winsteps. I assume your directions posted on the 12th are still valid. Please let me know if I need to modity this.

Thanks again as always for your help and patience. This is a mammoth undertaking for me :B

Mike.Linacre: Yes, Uve. We expect item replacement. Portland (Oregon) Public Schools (POPS) have been following this process for about 30 years. Every student is measured in one common frame of reference. Students can be tracked across years. Cohorts can be compared across grade-levels.

In this type of design, usually all the equating is done with common items, because students are too variable. If there are common students, then these can be used as a cross-check. The items are conceptualized as an "item bank" from which the items in each test form are selected.

For a resource, "Probability in the measure of achievement" by George Ingebo. This is based on his experience with POPS. www.rasch.org/ingebo.htm

419. Winsteps recoding question

Saturn95 March 20th, 2012, 6:51pm: Greetings members,

I have a recoding question. My data were collected in two parts, such that each item has a score of 00 (neither part correct), 01 (only the second part correct), 10 (only the first part correct), or 11 (both parts correct). I want to rescore this data so that only scores of 11 are "correct," and all other possibilities are "incorrect." In other words: 00, 01. and 10 should become 0, and 11 should become 1.

I have tried unsuccessfully to do this with the rescoring commands in Winsteps. I've also tried introducing a scoring key, but this didn't work either. Most of the rescoring examples in Winsteps use different keystrokes for the different score categories (AABBCCDD or 01020304), rather than my codes of 00, 01, 10, and 11.

Is this possible? I have specified XWIDE=2.

Thanks in advance!

Mike.Linacre: Thank you for your diligence, Saturn 95.

This should work:
NI = (half the original number of items)
CODES = 00011011 ; all possible responses to 2-column item pairs
NEWSCORE = 00000001 ; scores for the item pairs

Saturn95: Hi Mike,

Thank you so much for your reply. I had tried that combination of commands, but here is the message I got:


I'm using CODES and NEWSCORE, but none of the other variables. Any ideas why it is not working?

Thanks so much! :)

Mike.Linacre: Saturn95, it appears that the Winsteps control instructions do not match the data file. Please email me your Winsteps control and data file(s): mike \at\ winsteps.com

420. FACETS does not read some of the data

bahrouni March 17th, 2012, 3:58am: I am using FACETS to find out which of 2 sets of rating scales yields more reliable valid results to assess students' writing.
Six (6) raters scored 10 reports twice with a week in between the 2 rating sessions. The first scoring was done using the existing set in current use in the institution, while the second scoring was done using a new set that I have just developed. There are 4 categories on each of the rating scales set:
Set 1: Content, Organization, Grammar, and Resources, all carry 10 points each.
The model I have used for this set is: ?,?.#, R10
?,?.#, R10
?,?.#, R10
?,?.#, R10
This is Ok.
Set 2: Content (12 pnts), Organization (12 pnts), Grammar (12 pnts), and Resources (4 pnts). As you see, the weight is different and unequal for the last category.
The model I have used for this set is: ?,?,#.R12
Two problems:
1) When I run FACETS with these models for the second set, I get correct count but the distribution curves are reversed, that is the high abilities are in the negative side of the logits while the low abilities are in the positive side

2) I changed the model to: ?,?,#,R
the scale on the last category changes from 4 to 12. Secondly, the observed count is much less than the entered data, which means that some data are not read by the software
Any idea to solve this problem?

Mike.Linacre: Thank you for your question, Bahrouni

1) Please be sure that Positive= is used for the ability facet.

2) ?,?,#,R is the same as ?,?,#,R9 so this is not correct for your data.

bahrouni: Thank you very much, Mike.
Since ?,?,#,R = 9, any score above that is not recognized. Therefore, I have no choice but to specify my scales and compare the 2 sets of rubrics on their categories one by one, commenting out what is not needed every time I run FACETS.
As for the Positive, it is indeed for the abilities (facet 1) but still the high values are on the left of the probability curves and at the bottom of the ogive curves as well. I could not fix this.

Mike.Linacre: Bahrouni:

Please remember that Facets matches models sequentially.

Everything matches the first model, so this is the same as:

What are you trying to do?

bahrouni: Thank you for your prompt reply. Please bear with me.
Here's what I'm trying to do. I am trying to find out which of 2 rubric sets is functioning better and yields more objective results in writing assessment. The 1st set is the old one; it has 4 categories: Content, Oraganization, Grammar and Sources. Each of these categories has 10 points. I've had no problem running FACETS with this, and the results are crystal clear.
The 2nd set, now, also has the same 4 categories with different number points and with unequal scale for the last category: Content 12 pnts, Organization 12 pnts, Grammar 12 pnts, while Sources only 4 pnts thinking that this last category can't be as important as the other 3. The results I get from this model
are not accurate.
I'm thinking of multiplying the 4 by 3 to make it equal with the other categories just for the statistics, on the other hand I'm worried about distorting raters' intentions.

Mike.Linacre: Bahrouni:

This does not function as you intend:

Facets understands this as:
?,?,#.R12 ; everything matches this model

Please be specific about your models, for instance:
?,1,#.R12 ; only data with element 1 of facet 2 matches this model
?,2,#,R12 ; only data with element 2 of facet 2 matches this model


Mike L.

bahrouni: Wonderful! I didn't know 'ce truc' (Fr.=this hint).
Thank you very much for your patience and for the tremendous assistance you have provided.
Farah Bahrouni

421. Exact Match Exp

uve March 18th, 2012, 11:49pm: Mike,

Looking at Table 13.1, I think I understand "Exact Match: OBS%." This is the percent of observations that matched the expectations based on the model. However, I'm not quite sure I get "Exact Match: EXP%" even after reading about it in the Winsteps Help menu.

Would this be the percent of observations we have the possiblity of matching given how the items are scored? If, so I'm not sure what that means exactly or what I would do with that information other than it might define a boundary of sorts on the limitations of what the model can do for me given my data.

Mike.Linacre: Uve, the OBS% reports what happened with these data. The EXP% is the equivalent number we would expect to see if the data fitted the Rasch model perfectly. The EXP% values often surprise people who are new to Rasch analysis. They think that the EXP% should be 100% for perfect model fit.

The logic is the same as the two columns: point-biserial correlation and its expectation.

uve: Yes, the logic seems to dictate that if all was perfect and the model is in fact the ideal and the data fit this ideal, then the model should account for all the data. If EXP refers to this ideal, then yes, I am confused that it would not read 100%. Since this is obviously not the case then EXP is not referring to the ideal and is also not referring to the empirical results. Therefore, I assume it is the ideal that has been affected by error. But I must be wrong about this also since the Winsteps help mentions nothing about error in the explanation of EXP--a matter made more complicated for me by the fact that EXP can be less than OBS in some cases. So I am still scratching my head. :-/

Mike.Linacre: Wait, wait, Uve. :-)

Rasch is a PROBABILISTIC model, not a DETERMINISTIC model.

For a deterministic model, such as Guttman's, 100% match between observed and expected is the ideal.

For a probabilistic model, such as tossing a coin, 50% match between observed and expected may be the ideal.

This was Georg Rasch's huge insight. The construction of additive measurement requires probabilistic models. Deterministic models construct rank ordering at best.

Let's think about coin-tossing. If we can predict the outcome of a coin-toss better than 50%, then we wonder about the coin. If we predict less than 50%, then we wonder about our guessing. In fact, one way that casinos make money is that many gamblers are bad at guessing.

uve: Thanks Mike. So if I understand you correctly and Winsteps knew I was tossing a coin one time for heads, then the EXP would be 50%. If I were rolling a dice one time for a two, then EXP would be 17%.

Mike.Linacre: Yes, that's the idea, Uve :-)
Winsteps knows what your "for", by the way the observation is scored. For your dice, "2"=1, everything else =0.

422. Link Evaluation

timstoeckel March 17th, 2012, 2:22am: Mike,

In a test-equating project I'm working on (the same one you helped me with a few days ago), I am attempting to follow the "Link Evaluation" steps outlined by Edward Wolfe in "Equating and Item Banking with the Rasch Model".

Wolfe describes item-within-link fit, item-between-link fit, link-within-bank fit, and form-within-bank fit.

For the first of these (item-within-link), Wolfe offers a formula but also indicates that this refers to infit, which I can find in Winsteps output.

Do you (or anyone else on this forum) happen to be familiar with the other three forms of link evaluation described above, and could you tell me whether Winsteps provides information as to the quality of these links?

Wolfe provides rather complicated (for me!) formulae to calculate each form of link, but it would sure be nice if Winsteps could do it for me. ;)

Or, perhaps there are alternatives to the types of link evaluations suggested by Wolfe that you might know of(?)

Your thoughts would sure be appreciated.

Mike.Linacre: Timstoeckel, there are many different link-evaluation methods. Frederic Lord's method (now amended by others) is probably the best known.

My own experiments with automatic methods of link evaluation indicate that they are generally successful, but must always be verified with crossplots.

423. Common person equating 4-axis chart

OlivierMairesse March 17th, 2012, 4:44pm: Hi all,

How about a common person equating/linking chart that includes measures and raw scores on one single plot?

The beauty of this plot is that if you posses two tests that measure roughly similar things (such as fatigue and sleepiness), you can define if a person is significantly more sleepy than fatigued and vice versa using raw scores on both scales only.

To do so you actually need to plot measures on a linear scale and raw scores on a unevenly spaced axis for both scales. Raw scores coordinates falling outside confidence limits allow you then to decide if the person is significantly more affected by one construct or the other. This can be very useful in clinical settings or other applied settings.

The plotting procedure is somewhat complicated in Excel, but I believe the results looks great.

Anyway, this is my way of giving a little back from all Mike has giving me before :)

Procedure to get this kind of graph:

0) Rasch-analyze both scale using Winsteps and go over all the necessary procedures to have well calibrated scales. I am not going to cover this in detail here :)
1) cross-plot person measures from both scales and draw confidence limits and identity line (Y=X) according to the common person linking procedure described in Bond & Fox (2007).
2) define minimum and maximum range for the X and Y axis (e.g. -5; +5 will do for most cases)
3) make X axis cross with Y axis at Y= -5 and and Y axis cross with X axis at X=5
4) adjust labels to make them appear outside the axes
Now comes the creative part
5) extract score table SCFILE= for both scales and paste raw score and measures in excel
6) to construct the fake top X axis, add series with measures of the first scale on the X axis and a constant (the Y axis max) on the Y axis. In my case it was the FSS measures and +5
7) add data lables and change them manually, i.e. change measures with their respective raw scores.
8) add down vertical error bars with a fixed value (in this case "10")
9) format the error bars to your preference, and format data series as "+"es. This should give you the raw scores of the first scale spaced at the Rasch-calibrated measure values.
10) to construct the fake left Y scale, add series with the measures of second scale on the Y axis and a constant (the X axis min) on the X axis. In my case it was the ESS measures and -5
11) add data lables and change them manually, i.e. change measures with their respective raw scores.
12) add right horizontal error bars with a fixed value (in this case "10")
13) format the error bars to your preference, and format data series as "+"es. This should give you the raw scores of the second spaced at the Rasch-calibrated measure values.
14) format graph shape outline to make it look like a four-axis plot
15) add to more series to create central axes (e.g. with -3;3 on the X axis and 0;0 on the Y axis, repeat but vice-versa; then add linear trend lines with forward and backward trends)
16) add titles and so on

I hope this helps. If someone needs a template for this, they can contact me personally omairess@vub.ac.be



424. Scale expansion?

Saturn95 March 14th, 2012, 7:58pm: Greetings,

I have noticed that when I run an analysis in Winsteps with the most misfitting responses removed (using EDFILE), and then compare the results to the full run (i.e., the run that includes all responses), my subset analysis produces a scale that is expanded at both ends. The mean person ability estimate doesn't change much between the full and subset runs, but the person SD does increase. This is a bit counterintuitive to me.

As an example, in my full analysis (with all responses included), I get a scale range of 303-628 (on my user-specified scale); when I remove the most unexpected responses using EDFILE (which amounts to about 3% of the total responses, many of which are likely data entry errors) and re-run, I get a scale range of 220-689. Can anyone explain why this happens?

Mike.Linacre: Saturn95, that is correct. Randomness (unexpectedness, misfit) in the data compresses the measurement range. When the data are all random, the measurement range is almost zero. When the data have no randomness (Guttman data), the measurement range is infinitely wide.

So, removing misfitting responses reduces the randomness in the data and usually increases the measurement range.

425. ICC Graph - Measure Relative Itm Diff range

NothingFancy March 14th, 2012, 1:10pm: When I reviewed some test data and plotted the ICCs, I noticed that some items have empirical data points located around -4 (see attached Item1 graph), while other items have data points cut off at around -2.50 or so. (see attached graphs).

Shouldn't that range of the data points on the Measure Relative to Item Difficulty be the same for all items (since its based off the person ability); or what am I misunderstanding?

When I look at my range of Person ability measures, the lowest one is -2.64. Shouldn't that the be lowest range on my ICC graphs, at least with the empirical data points?

Mike.Linacre: Nothingfancy, please click on the "Absolute" button on the Graphs screen. That will give you the curves plotted the way you want :-)

427. Alternate Equating Methods

uve January 12th, 2012, 5:24am: Mike,

I've doing quite a bit of research lately and came upon some papers that discussed some alternative methods for the common item non-equivalent groups design. In particular, the Skaggs & Wolfe paper reprinted in "Criterion Refrenced Testing: Practice Analysis to Score Reporting Using Rasch Measuresment Models," Smith & Stone 2009 critiques the mean/sigma, Stocking & Lord, and the fixed parameter methods--the latter being the method used in Winsteps when we anchor common items to old form values.

The paper mentions some freeware that will perform mean/sigma, Stocking & Lord as well as Haebara and mean/mean methods. It's available at the University of Iowa College of Education--Center for Advanced Studies in Measurement & Assessment:


Of course you first need a program like Winsteps to generate the item measures. The program I downloaded was ST for PC Console. It accommodates 2 and 3 parameter models so when using Rasch measurement, you need to set all the item slopes to .5882 and lower asymptotes to zero. To do the Stocking & Lord and Haebara methods, it requires either 40 or more quadrature points, which you can get from the test information function, or it will accept person measures.

Below is the output. The idea is that you now transform all new form item measures using the slope and intercept, but I would like to do this in Winsteps so that the new form measuers have already been transformed. Would this be USCALE? If so, how? Below is the output from ST. Thanks again as always.

Stocking-Lord Haebara Mean/Mean Mean/Sigma
Intercept -0.024277 -0.024335 -0.025106 -0.025139
Slope 1.000316 1.000541 1.000000 1.001441

Mike.Linacre: Thank you for this research, Uve. It looks like all these are "Fahrenheit-Celsius" equating methods, choosing different lines of commonality.
So, the baseline logits would be unchanged.
The equated logits would be rescaled using UIMEAN= and USCALE=.

In this example, it looks like we need, for Stocking-Lord:
USCALE = 1 / 1.000316
UIMEAN = +0.024277 / 1.000316

Suggestion: rescale the equated logits, then reapply the Stocking-Lord method. If everything is correct, it will report Intercept 0, Slope 1.

uve: Mike,

Clarification: why did you change the sign on the intercept used to calcualte UIMEAN?

Original: -0.024277 from Stocking-Lord

UIMEAN = +0.024277 / 1.000316

Mike.Linacre: Uve, signs and slopes depend on whether we are equating "x to y" or "y to x". Perhaps I became confused about this while doing the calculation :-(

Because it is so easy to become confused, please rerun the equating using the equated values instead of the original values in order to verify that the equating constants in the re-equating analysis are reported as intercept 0 and slope 1. :-)

uve: Mike,

I was reading your response again and thought the following might clarify. The PIE program needs all the new form item parameters (X) and the old form item parameters (Y) that are common to both. I'm assuming this will create the slope and intercepts that will put new form X on the old form Y scale. So based on the Stocking-Lord formula I assume in Winsteps this would be:

USCALE = 1 / 1.000316
UIMEAN = -0.024277 / 1.000316

Mike.Linacre: Uve, you are probably correct.
Suggestion: apply
USCALE = 1 / 1.000316
UIMEAN = -0.024277 / 1.000316
to the analysis of the X data. Then the newly computed equating constants should be 0 and 1.

uve: Mike,

I've attached a spreadsheet graphing the different linear equating methods and plotting the new item measures. I've also compared the linear equations against each other. I've been reading a lot lately and there seems to be an argument against the rationale of using the fixed item method. Based on what you see here, do you believe any of the linear methods would provide better calibration?

Also, is there a way to possibly generate a function in Winsteps that comes close to estimatng the fixed item method? The reason I ask is that I was hoping to somehow graph the fixed item method and compare it graphically to the 4 linear methods. Right now I only have the item measures as a visual comparison.

Mike.Linacre: Uve, is the "fixed item" method the same as the "anchored item" method?

Anchored items are supported in Winsteps using IAFILE=. The "displacement" column in Table 14 indicates how closely the anchored (fixed) item values match the current data set.

This paper may be helpful; http://www.ncbi.nlm.nih.gov/pubmed/17215567

uve: Mike,

Yes, by fixed I mean anchored. By the way I left off the most importamt part of the spreadsheet. I've attached an updated version. The reason why I am questioning the methods has to do with decisions that tneed to be made for cut points. I have set our cut points to equal those of the state assessments. For Proficient, this is 350.

If you select the Scoring tab and scroll down to line 42, you will see that a score of 40 on this test in the base year equated to 350. This year using the common item anchoring, virtually nothing has changed. However, as you look at each method, suddenly this changes. Had I used the Mean/Mean or Mean/Sigma methods, I would have adjusted the raw cut point from 40 down to 38 because this raw score is the closest to 350. I can tell you from experience that a one raw score point adjustment can greately change the number of students classified. Two points really makes a very large difference.

I guess I was shocked by the significant difference in the last two methods. I thought there would be more stability. The Mean/Sigma method is used by ETS to equate the exams. I use the anchoring method in Winsteps because it is very convenient. To use the others, I have to supply item calibrations from both years, as well as quadrature points from each year and input that into ST. That's a lot of work considering all the assessments we give.

I'll read through the ariticle, but my initial response would be to disagree with any statement that it really doesn't matter what method you use. My data shows quite the contrary.

Mike.Linacre: Uve, the history of scientific advancement indicates that alternative reasonable methods should report equivalent findings. When they disagree, then there is definitely something else going on that interacts with the method chosen.

"quadrature points from each year" - this sounds like a step backward toward the equipercentile equating method of Classical Test Theory. If so, this approach might appeal to more traditionally-minded testing agencies.

So the differences between the methods may be: do we assume that the items have stayed the same? Or do we assume that the examinee ability distribution has stayed the same in some way?

BTW, I prefer "anchored" to "fixed", because "fixed" can sound like the items have been repaired (fixing a leaking tap) or tampered with (fixing a horse race).

uve: Mike,

Yes, I agree. The different methods should agree. BTW: I dislike the use of subjunctive cases when dealing with investigations of this nature because "should" and reality are often at odds.

Quadrature points and their weights are needed in ST for the Stocking-Lord and Haebara calculations. I simply used the TCC data from Winsteps, which if I'm correct, generate about 200 of these points and weights. So the idea in the final version of my Excel file was to see how the different linear functions affected overall score tables.

The outcome for a raw score of 40 was a range from .51 - .68, which when transformed to the state assessment scale went from 350 to 358. The raw score of 40 shifts down to 39 starting with the Haebara method, then down two for both Mean methods. I know only 11 questions were replaced, but virtually all items were shuffled (much to my dismay and though I have warned staff against this, it continues). There were also an additional 6 items which could not be considered common any longer. Still, out of 65 questions I would think the impact would have little affect and the anchored method confirms this. But as I mentioned previously, I was shocked by the results of the other methods.

Additional info: Person mean and stdev were -.03 and .94 respectively in the first year. Unanchored in year 2 they were: .40 and .96. Year 1, 807 examinees and year 2 it was 762. Using p-values they were .49 and .58 respectively. So had I used the Mean/Sigma method, I would have lowered the cut score to 38 for a test that apparently was easier for students in the 2nd year. So how could 11 items influence the need to lower the cut score while at the same time, students are doing much better on the exam? Very odd, but of course I'm not ruling out that somehow my process was flawed in how I captured the data and ran it through ST properly.

I have a tendancy to mix "fixed" and "anchored" too much but my references here are always meant to mean "anchored". I also make this mistake with theta, which is supposed to refer to ability while "b" is supposed to refer to item difficulty or measures. But I am learning :)

Mike.Linacre: Uve, the Stocking-Lord Procedure matches the TCCs of the common items in the two tests. This is complex with 3-PL IRT (hence the quadrature points), but much simpler with Rasch. With Rasch it is effectively the same as the Fahrenheit-Celsius equating procedure applied to the difficulties of the common items.

Please use whatever terminology, symbols, etc. that will communicate your message to your audience most effectively :-)

428. 2nd Dimension Mystery Pt. 2

uve March 9th, 2012, 12:20am: Mike,

I was asked to analyze a survey created at one of our sites and given to the teachers. If I had to explain it I would guess it is attempting to measure the level of teacher satisfaction for its student-achievement-focused school culture. I was surprised to see 7.9 for the 1st contrast unexplained variance eigenvalue. I ran a simulated data set and an acceptable value was 2.1.

I've attached the PCA of residuals and a bit of other information. I had to do a PCM because there were 3 distinct sets of categories depending on the question. It's obvious that the 2nd component explains something that divides the issue between teachers and administration, but after the Rasch component is removed, I'm still struggling to define what this might be. I'd greatly appreciate any of your comments.

I would gladly be open to comments from anyone else out there as well. Thanks!

Mike.Linacre: Uve, the "How many teachers ..." items form a strong cluster of about equal "difficulty". My first suspicion is there may be some response deficiency (such as range restriction). Please try analyzing those items by themselves. They may tell a story.

uve: Mike,

Attached is the ISFILE. The three highest loading items certainly appear to have a narrower range than most but other items with even lower range are not reported loading as high.

Mike.Linacre: Uve, ISFILE= may indicate range-restriction indirectly. The category frequencies in Table 14.3 would be more informative. But I am only guessing at a diagnosis of why the "How many teachers ..." cluster together. Perhaps a subset of the respondents have a "response set" to them.

uve: Thanks Mike,

I've attached 14.3. There does seem to be a great deal of centralized response around items 12, 13 and 14, which were the three highest loading items. But I'm not sure how range restriction would lead me to a 2nd dimension or construct. It seems that lies more in the wording of the items than the frequencies of each response of the items themselves.

Mike.Linacre: Yes, Uve, definitely the wording of the items is the commonality, but I was guessing about how the similarity in the item wording would show itself in the response strings.

429. Polytomous Example

uve February 17th, 2012, 6:14pm: Mike,

I have been attempting to better understand the math behind how probabilities are calculated for polytomous items. I exported the Observation File for a survey we recently gave and was attempting to see if I could recreate the expectation value output for one of the items. However, I can't seem to do this. I was wondering if you could direct me to an example in which the step calibrations, ability level and item difficulty are given and plugged into the model so I can see where I've gone wrong. I've attached the spreadsheet I've been using in case you're interested.

Mike.Linacre: Uve, here is a spreadsheet that does polytomous estimation for a dataset: Winsteps done by an Excel spreadsheet!


In your computation, the Rasch-Andrich threshold values are omitted.

uve: :-/
Columns A8-A11 of my spreadsheet contain the step calibrations from Table 3.2. I thought these were the threshold values. I do have the poly spreadsheet, but at this point I am more interested in the final calibrations than I am in the iterative process. So once I have the final measures/calibrations, I'm still not sure what version of the model will allow me to insert these. I guess I'm what I'm asking for is an expanded version of the model that will allow me to plug in the final calibrations of my measures so I can see how the elements work together to produce the probabilities for each of the 4 categories of my survey for this particular item. Then, when I multiply each probability by each category and sum them, it should equal the expected value produced by the Winsteps observation file for that item. I was using the versions of the model given on the bottom of page two of the attachment (5.5, 5.6, 5.7), but I need to be able to modify it for a 4 category scale.

In case you haven't figured it out by now, I"m a very slow learner. The simpler and more explicit the example, the better.

Mike.Linacre: Uve, the formulas are all in the spreadsheet. The spreadsheet is for a 3 category rating scale. Type in your values as the initial estimates. All the numbers (probabilities, expected values, etc.) will update. This will show you how the computation works.

uve: Mike,

I think I'm getting closer. So using the poly spreadsheet as the guide, I've pasted the equations from it and my attempt at coming up with the 4th category by adding "EXP(3*($F29-J$19))" to each of the original 3 equations and modifying the 3rd to coming up with the 4th. I've attached my revised spreadsheet just in case you want to see how my formulas function.

F29 = ability measure
J19 = item difficulty
B24= first step difficulty

0 1/(1+EXP($F29-J$19-$B$24)+EXP(2*($F29-J$19))+EXP(3*($F29-J$19)))
1 EXP($F29-J$19-$B$24)/(1+EXP($F29-J$19-$B$24)+EXP(2*($F29-J$19))+EXP(3*($F29-J$19)))
2 EXP(2*($F29-J$19))/(1+EXP($F29-J$19-$B$24)+EXP(2*($F29-J$19))+EXP(3*($F29-J$19)))
3 EXP(3*($F29-J$19))/(1+EXP($F29-J$19-$B$24)+EXP(2*($F29-J$19))+EXP(3*($F29-J$19)))

Original formulas from Poly

0 1/(1+EXP($F29-J$19-$B$24)+EXP(2*($F29-J$19)))
1 EXP($F29-J$19-$B$24)/(1+EXP($F29-J$19-$B$24)+EXP(2*($F29-J$19)))
2 EXP(2*($F29-J$19))/(1+EXP($F29-J$19-$B$24)+EXP(2*($F29-J$19)))

Mike.Linacre: Uve, good going. Here is the algorithm:

B = Ability
D = Difficulty
Thresholds for categories 0,m are: 0. F1, F2, F3, .., Fk, ..., Fm

T0 = exp(0) = 1
T1 = exp((B-D) - F1)
Tk = exp( k(B-D) - (sum(Fj) j=1,k))
Tm = exp( m(B-D) - (sum(Fj) j=1,m))

P0 = T0 / (sum(Tj) j=0,m)
Pk = Tk / (sum(Tj) j=0,m)
Pm = Tm / (sum(Tj) j=0,m)

Expected value:
E = (sum(jPj) j=0,m)

Model variance:
V = (sum(j**2 Pj) j=0,m) - (sum(jPj) j=0,m)**2

These are equivalent to https://www.rasch.org/rmt/rmt122q.htm equations 3, 4, 5

uve: Thanks Mike. This worked out very well. In addition to the problems I initially had with the calculations, I had also mistakenly used the category measures instead of the thresholds. I've made the corrections and attached the file. I'm assuming my residual and zscore don't match the xfile output due to rounding errors.

Also, could you point me in the direction of how the variance in the xfile is calculated for a dichotomous item? Thanks again as always.

Mike.Linacre: Uve, did you notice www.winsteps.com/a/polyprob.xls ?
For a dichotomy the thresholds are 0 and 0.

The dichotomous variance simplifies to prob*(1-prob)

uve: Thanks Mike. This is a fantastic resource, much better for my needs than the polyprob spreadsheet.

430. Scoring 1st Category

uve March 8th, 2012, 7:02pm: Mike,

I've been a bit confused as to the best way to score the first category on a survey. For example:

Valid Codes: ABCD
Option 1 1234
Option 2 0123

I know the 2nd option is the more popular, but I've always had a hard time with zero being a score for choosing a distractor--just doesn't make much sense in my brain :)

Then I began to wonder: In opton 2, if an item receives all A's, then naturally it will be removed from the process of calibration. Would the same thing happen in option 1?

Mike.Linacre: Uve, in Winsteps and Facets, the lowest category number is the lowest value observed in the data (unless ISRANGE= is active).

If the original data are, say, 1-6, and then we transform them to 0-5, the Rasch measures and fit statistics for the two datasets will be the same. Only the reported raw scores will differ.

Recommendation: use the category-scoring that will make the most sense to your audience.

uve: Thanks Mike. Going with 0 on the lowest category has a tendancy to confuse our staff as well, so the lowest categories will be scored 1 from now on.

431. Changing Table sort order

lostandfound March 7th, 2012, 7:34pm: I am trying to change the sort order of table 10.3: ITEM CATEGORY/OPTION/DISTRACTOR FREQUENCIES: MISFIT ORDER.

I want this table to be displayed in ITEM entry order.

My task involves generating many tables via the data control file.


Looking the ISORT Control variable documentation it seems I can change the sort for just table 10.3 by editing to the TFILE to:


Can someone please give me some more insight into alternative methods for sorting or what I am doing wrong with this method? Thanks.

Mike.Linacre: lostandfound, Winsteps Table 10 in item-entry order is Table 14.

Table 14.3 is Table 10.3 in item entry order.

Or are you looking for something else?

lostandfound: Thanks Mike. That was exactly what I was looking for. I don't know how i missed it.

Also, just out of curiosity, is it possible to resort the item data using something like:

Mike.Linacre: Lost and found:

10.3,-,-,-,96 ; Sorry, parameter not supported here, see www.winsteps.com/winman/index.htm?tfile.htm
* ; on a separate line at the end of the list

lostandfound: Thank you Mike

432. Distractor analysis in multiple-choice model

chong March 7th, 2012, 9:23am: Hi Mike,

I recently read an article about using the multiple-choice model to analyze the distractors of multiple-choice items and my attention is quickly drawn to the ICC, where I found the response function is generated and displayed for each of the distractors besides the correct option. The consequence is that I can simultaneously see in what way the distractors may behave with respect to the correct response and hence: (a) it helps explain the unexpectedness (e.g., sudden dip in performance) of the response made by examinees, and (b) further actions can be taken to revise the item by replacing or modifying poor distractors.

Since the ICC in Rasch model only shows the probabilities of getting the correct response, I wonder if it is legitimate (in Rasch model) to graph the probabilities (mean score) of choosing the distractors in a similar manner (though the jagged curves would be jumbled in the screen), or is there any way that Winsteps could offer to graphically display the response pattern for each distractor of every item?

Mike.Linacre: Thank you for your question, chong.

A smoothed graph of the frequencies of the distractors, stratified by ability level, may be what you want. In Winsteps, these are shown in the "Graphs" window, "Empirical option curves".

chong: I got it. Thanks a lot, Mike.

433. Thresholds and Structures in 3.2

uve March 5th, 2012, 2:43am: Mike,
I am a bit puzzled by the fact that when the standard rating scale is used, or even the partial credit model with items put into different groups, the Andrich Thresholds reported in Table 3.2 of Winsteps are identical to the Structure Measures. I don�t understand this given the fact by what I see in Table 2.1 and 2.4. In these tables, the items have identical spacing but are shifted depending on their difficulty. It seems to me that the Threshold would be the same for every item when setting the upper and lower category at zero, but the Structure would be different because it adds the Threshold to the item measure. So why are the Thresholds and Structures reported the same when the standard rating scale is used?

Mike.Linacre: Uve, with the Rating Scale Model, Winsteps Table 3.2 is generic for all items. Please output ISFILE= to see the exact values of "item difficulty + Andrich threshold" for each threshold of each item. They are the "MEASURE" columns, see https://www.winsteps.com/winman/index.htm?isfile.htm


baywoof February 27th, 2012, 5:57pm: Mike,

I hope you are doing well.

I'm working on a project with Judy Wilkerson. We have several judges scoring two different but parallel instruments (ETQ=10
items; SRA=20 items). The items are constructed on 10 INTASC principles (Interstate New Teachers Skills) and the scores are
rated on a Krathwohl scale (an affective version of Bloom's Cognitive Taxonomy). We are interested in our judge training and
scoring guide.

The control file is attached. Facets are teachers, judges, instruments(2), principles(10), Krathwohl rating(6).

Obviously, real data is messy, and we will add more score to the file as we need more higher ratings in the sample.

My questions:

First, the iterations don't stop. I know that this comes up from time to time, but I don't really understand what parameter
or number of iterations is appropriate for "convergence". Should I use Convergence=?,? or Iterations=?. Perhaps something
else is wrong?

Second, we'd like to see a graph of the rating thresholds by judges. In other words, not just the judge on a ruler but also
each judge's overlapping differences in thresholds. Has anyone proposed that before? Is it a bad idea? How would we
generate that from FACETS?

Mike.Linacre: Thank you for your questions about Facets, baywoof. You are doing well, but your analysis needs some tweaking.

1. Lack of convergence. You can stop Facets at any time (Ctrl+F) because the correct estimates are undefined. Did you notice this message?
Warning (6)! There may be 2 disjoint subsets
It is telling us that the elements in the Facets are nested. You need to make a decision about the relationship between the different nests. An immediate one is:
5,instrument, A ; anchor this Facet at 0 so that it becomes a dummy facet

2. If you want separate graphical output for each judge, please model accordingly:
model = ?,?,#,?,?,Krathwolh

Then look at Table 8 and also the Graph screens.

3. model = ?,?B,?B,?B,?B,Krathwolh specifies a 4-way bias analysis. This is unlikely to make much sense. Please do not model the Bias directly. Instead please do two-way bias analysis from the Output Tables menu.

baywoof: Thanks, Mike.

I found the warning once you pointed it out. I missed that the first time.

There are two affective instruments:
ETQ is a questionnaire (10 items)
SRA is a thematic apperception test (20 items)

Both are scored on Krathwohl's taxonomy.
Both are constructed with items from the 10 INTASC principles.

I guess we wanted to see how the different instruments compared after scorer training. I thought we had connectivity in our judge plan, so I'll have to figure out what is "nested". This is something that I haven't considered before. I suppose that some students only took one instrument and others took both, so that would be a cause for the disconnect.

I've never used a dummy facet before either. I see examples, "League Baseball" and "Flavor Strength of Gels" so I'll review them and see if I can learn more about using them.


Mike.Linacre: baywoof, the disconnection in your data is between ETQ and SRA. The problem is the two disjoint sets of items.

Facets does not know how you want to apportion the difficulties of the ETQ items between the ETQ element and its individual item elements. Similarly for SRA.

Anchoring ETQ and SRA at 0 apportions none of the item difficulty to the classification elements ETQ and SRA.

435. Items with multiple keys

Devium February 23rd, 2012, 5:46pm: Hello,

I'm trying to analyze a data set that contains items with multiple keys, and I'm having trouble figuring out how to set up the data and control files.

One group of items has two keys, and is worth one point for choosing both key options. (No points for only one key.)

Another group has three keys, with one point for all three key options. (No points for two or one.)

I think I have to use KEYn= to specify the keys, e.g., for a three-key item followed by a two-key item, followed by a single key item I'd use:

KEY3= D**

But where I get confused is how to use KEYSCR. If I specify:


Does this mean each key gets a point -- so the above items are scored 3, 2, 1 respectively? What do I specify so that all three are scored as 1?

My apologies for being so dense! And my thanks in advance for your assistance.


Mike.Linacre: Devium,
KEY3= D**

item 1: A or C or D are scored 1. B is scored 0
item 2: B or D are scored 1. A or C are scored 0
item 3: B is scored 1. A or C or D are scored 0.

Is this what you want?

Devium: [quote=Mike.Linacre]Devium,
KEY3= D**

item 1: A or C or D are scored 1. B is scored 0
item 2: B or D are scored 1. A or C are scored 0
item 3: B is scored 1. A or C or D are scored 0.

Is this what you want?

Hello Mike,

Thank you for your swift reply.

What I'd like is:

Item 1: The set A,C,D is scored 1 (must choose all three); all other choices are 0
Item 2: The set B,D is scored 1 (must choose both); all other choices are 0
Item 3: Only B is scored 1; all other choices are 0

Where I'm getting confused is how to set KEYSCR=

Your help is greatly appreciated.


Mike.Linacre: Devium, this scoring is possible, but complicated, in Winsteps.

I recommend scoring the data before Winsteps, and then entering 0/1 (right/wrong) for each item in the data for the Winsteps analysis.

Devium: [quote=Mike.Linacre]Devium, this scoring is possible, but complicated, in Winsteps.

I recommend scoring the data before Winsteps, and then entering 0/1 (right/wrong) for each item in the data for the Winsteps analysis.

Hi Mike,

I think that's the best plan. Thanks for your help and advice.


436. Odd Measure to Raw Score Relationship

uve February 21st, 2012, 6:48pm: Mike,

Attached is the person measure table 17.1 output of a 4 option Likert survey. Categroy 1 was scored 1 and so on. I find it interesting that the raw scores (2nd column) decrease then increase again while the person measures (4th column) decrease continuously. Why is this happening? I also have the TIF graph and am wondering if this is part of the problem.

Mike.Linacre: Uve, in your measure Table, scores are monotonic with measures except for person 11 who responded to fewer items.

The TIF concurs with the standard errors: high TIF = low standard error.
Notice that the trough around 2 logits corresponds to a local peak in the size of the standard errors.

uve: Mike,

:-/ Please look at person 11. The person before 11 is 22 whose score is 77, then comes 11 with a score of 55, then person 12 with a score of 76. Perhaps my definition of monotonic is off.

But you'll notice this person only answered 16 of the 22 questions and was the only one to do so. Since most of the tests I analyze are multiple choice dichotomous instruments, Winsteps "Missing" is set to 0. However, in the control files of all my surveys, I have this set to -1. So perhaps Winsteps is calcuating total score off of only 16 items for this person and is not penalizing him (as should be) for not answering the other six items. So though he has a much lower raw score than the respondents before and after, based on the choices made for those 16 that were answered, his logit maesure is appropriate. That's just my guess.

My next question then is: if the test scores are monotonic, then how can I explain the TIF?

Mike.Linacre: Thank you for your question, Uve.

Yes, the person who responded to only 16 items is being measured on those 16. If you want Winsteps to analyze that person's missing responses as 0, then please specify:

The TIF is related to the statistical precision of the estimates. It is higher where the item density is higher. The relationship between raw scores (on all the items) and measures is shown by the TCC (Test Characteristic Curve). This is always a monotonic ogive.

You may also like www.winsteps.com/a/polyprob.xls - this computes the expected score on an item when the person ability, item difficulty and Andrich thresholds are know.

437. Common person equ, anchoring, and odd person meas

tkline February 22nd, 2012, 10:24pm: Hello Dr Linacre,

I have an analysis that you helped me get started (THANK YOU) and I was able to get everything run correctly (I believe). Now I am running into some odd person estimates (unusually high) when I look at the person files. Particularly when they had a score of zero with missing data. I glanced through the other posts, but didn't see anything exactly like this, so hopefully I am not repeating anything.

Let me give you the full story.

I have data on two tests (Test A & Test B) with different questions on the same subject. One group of students took Test A, another group took Test B, and a third group took both Test A and Test B. So it looks something like...

P0 1 0 0 . . .
P1 1 1 0 . . .
P2 1 . 1 . . .
P3 1 1 0 1 0 .
P4 1 1 0 1 1 1
P5 1 1 0 1 1 1
P6 1 1 0 1 0 .
P7 . . . 1 1 0
P8 . . . 1 0 1
P9 . . . 0 0 .

I followed the procedures we learned in class that is 'Step 1' of the common person equating process on just those who took both tests (I did not have the other data files then) and the two tests seem eligible for equating (empirical slope was very close to 1 with common persons). I have tried a couple different things, and there are odd estimates. I know I will get questions on this, so I am hoping that you can please help me sort this out.

1) Concurrent estimation - every item and every person into one analysis. Rescaled the estimates using UMEAN and USCALE and output a PFILE
2) Established difficulty on the sample that took both tests, rescaled using UMEAN and USCALE, anchored and used UASCALE=0 to estimate the students who only took Test A on those items, and the students who only took Test B on the Test B items.

The issue is, in the PFILE for both methods, I have students who have missing data, and have a 'score' of zero (from the PFILE) but have a high measure 1.8 or even 3.2 on a scale of 0-5.

Those estimates don't make sense to me, especially when there are students that have a 'score' of 2 that have an estimate of 2.4, which aligns with Table 20 in the output.

Please let me know what I did incorrectly.
Thank you so much in advance,

Mike.Linacre: Good to meet up with you again, tkline.

Your data are thin, so the equating will only be approximate.

Omit from your list of "common persons" everyone with an extreme score in either test or less than 5 observations in each test. Their measurement is much too imprecise. We really need at least 30 observations for each common person in each test, but it looks like that is too demanding.

Draw a scatterplot of the pairs of measures for the common persons to verify that the equating makes sense. You can use a plot like https://www.winsteps.com/winman/index.htm?comparestatistics.htm

tkline: Thank you so much for all your help with this task.

I was hoping that we had enough data, but going back, it seems that quite a few cases have extreme scores. Certainly more than I remember.

Thank you and have a wonderful day!

438. negatively and positively stated items

Sander_de_Vos February 20th, 2012, 8:27pm: Dear mr. Linacre,

It is nice to see the extensive body of knowledge I can find on this forum about Rasch analysis. It is very helpful!

I have a question about a specific problem. In a partnership with two universities from the Netherlands we are conducting research on the process of burnout among nursing staff.
For this research we use a Dutch translation of the Maslach Burnout Inventory (MBI).
It consists of three subdimensions. Two are negatively stated (emotional exhaustion and depersonalization) and one is positively stated (personal accomplishment).

Though I reversed all the answers of the subdimension personal accomplishment (7point likert scale) before analyzing with Winsteps, it still shows contrast on the residual loadings for items (contrast 1) compared to the other two subdimensions. It seems that the unexplained variance in the 1st contrast (7,5, 16,5%) mainly is explained by the personal accomplishment items, indicating multidimensionality.

Or is there something that I'm missing here? Maybe there are statistical problems with combined positively and negatively stated items in one questionnaire (all 7 point likert scale)?

Kind regards,


Table of STANDARDIZED RESIDUAL variance (in Eigenvalue units)
-- Empirical -- Modeled
Total raw variance in observations = 45.6 100.0% 100.0%
Raw variance explained by measures = 25.6 56.2% 54.3%
Raw variance explained by persons = 3.2 7.0% 6.8%
Raw Variance explained by items = 22.4 49.1% 47.5%
Raw unexplained variance (total) = 20.0 43.8% 100.0% 45.7%
Unexplned variance in 1st contrast = 7.5 16.5% 37.5%


------------------------------------------------- ------------------------------------------
|------+-------+-------------------+------------| |-------+-------------------+------------|
| 1 | .85 | -.93 1.08 1.12 |A 4 BPA1 | | -.72 | .68 1.58 1.42 |a 18 BEE8 |
| 1 | .83 | -.93 .71 .73 |B 15 BPA4 | | -.71 | .28 1.07 1.01 |b 8 BEE5 |
| 1 | .78 | -.93 .83 .86 |C 16 BPA5 | | -.66 | -.25 .66 .67 |c 1 BEE1 |
| 1 | .77 | -.93 .64 .70 |D 7 BPA2 | | -.66 | .62 1.26 1.13 |d 12 BEE6 |
| 1 | .75 | -.98 .81 .83 |E 17 BPA6 | | -.61 | -.02 .87 .85 |e 2 BEE2 |
| 1 | .73 | -.96 .95 .98 |F 9 BPA3 | | -.49 | .54 1.19 1.19 |f 11 BDP3 |
| 1 | .70 | -.87 .91 .98 |G 19 BPA7 | | -.49 | -.25 .84 .88 |g 3 BEE3 |
| | | | | | -.44 | .73 .92 .85 |h 6 BEE4 |
| | | | | | -.38 | .07 1.12 1.20 |i 13 BEE7 |
| | | | | | -.29 | .57 1.43 1.31 |j 10 BDP2 |
| | | | | | -.24 | 1.23 1.69 1.92 |J 20 BDP5 |
| | | | | | -.23 | 1.87 1.08 1.12 |I 14 BDP4 |
| | | | | | -.22 | .44 .99 .99 |H 5 BDP1 |
------------------------------------------------- ------------------------------------------

Mike.Linacre: Thank you for your question, Sander..

Please look at the Winsteps Diagnosis menu, A. Polarity.
If all the correlations in the first Table are positive, then all the items are in the same direction. If not, please use NEWSCORE= or IVALUE= to reverse the scoring.

The Contrast Table confirms that there are different dimensions within the items.

Our biggest concern is the small variance explained by the person measures, only 7%. This suggests that something has gone wrong.

439. Developing a reporting scale

kjd February 16th, 2012, 9:42am: Hi, I am reasonably familiar with Rasch modelling techniques, and as such have been asked to contribute to a project being run by some colleagues who are working on developing a test of English language knowledge involving several sections: grammar, vocab, reading etc.

They have asked me to recommend what they have called a reporting scale, by this I understand that I should somehow use the information I have about relative item difficulties etc. in order to create a mark scheme/reporting system that would reflect the functioning of the different items. They are also expecting that it could be used to reflect different levels of proficiency in a stepped scale. They are not intending on conducting a Rasch analysis to gain ability estimates for every cohort of students.

This is not something I have any experience of, and I am struggling to find any relevant references which may help me; it seems that common papers on the subject address the issue of recording the raw scores, or translating ability estimates derived from the Rasch model into a more digestable scale for people not familiar with the technique. Neither of these are what I am hoping to be able to do.

The first question I have, therefore, is whether it makes sense to use Rasch difficulty estimates from a sample of students (circa 200) to develop a marking scheme to be applied in other situations? And, if so, do you know of any papers that discuss this? Or have any advice on how to approach this?

Many thanks in advance.

Mike.Linacre: Yes, kjd, communicating our findings in a useful way is one of our major challenges (often the major challenge).

A sample of 200 relevant students should certainly be enough to construct a meaningful picture (map) of the latent variable.

Please look at www.rasch.org/rmt/rmt84q.htm - WRAT

http://www.merga.net.au/documents/MERJ_18_2_Wu.pdf - Figure 1

The is also mentioned in www.winsteps.com/winman/keyform.htm

kjd: Hi, thanks for the prompt reply and interesting references!

One thing I have not quite gleaned, is whether -- following the initial analysis -- it makes sense to allocate more marks to more difficult items (i.e. can I devise a mark scheme where the raw scores are weighted according to diffculty?)

I am thinking here of cases where students undertake the test subsequently and teachers who don't have the resources to carry out rasch analysis will be marking the tests.

Mike.Linacre: Kjd, "more marks for more difficult items" is not a Rasch procedure.

If subsequent students will take the whole test, then please give the teachers your score-to-measure table (Winsteps Table 20), then they will not need their own Rasch analysis.

If students will respond to selected items, then give the teachers Keyforms so that they can measure the students themselves. The pioneer in this is Richard Woodcock's original KeyMath diagnostic profile produced for American Guidance Service.

kjd: OK, thanks, that's pretty much what I thought, which is why I was worried about what I am being asked to do -- I will find out more about keyforms and hopefully they will help with a constructive approach to this. Thanks again for your help :)

440. Sample size?

weblukester January 8th, 2012, 12:16am: I am writing a dissertation about language testing, and would like to explore the reliability of raters and the rating scale regarding an academic speaking test. The trouble is, I only have a dvd of 8 performances for raters to mark... This would mean that any analysis would have to be restricted to 8 examinees, 10 categories (the scoring scale) and, up to 8 raters.

I would like to use facets but don't know if the results will be useful so I just want to know if it is worth a go. I'm very new to Rasch/FACETS!

Mike.Linacre: Weblukester: certainly worth a go - and you can do your analysis with the free student/evaluation version of Facets called Minifac: www.winsteps.com/minifac.htm

Of course, your findings will not be definitive, but they will certainly be indicative - is it likely to work well with a large sample? That is probably good enough for a dissertation :-)

weblukester: Thanks Mike, I have downloaded Minifac, and have read through the user manual. My brain hurts!

I am trying to make a dummy run using excel (with parson/rater/item facets), but am having trouble with the rating scale - there are 10 categories: 8 are scored 0-10, but 2 are scored 0-5. The idea is that the scores total 100, which can be reported as a percentage to students.

I can't work out how to enter the 0-5 data into excel, or the model= to use.

Any help would be much appreciated!

Mike.Linacre: Here's how to do it, Weblukester.
Let's assume the each row in excel has:
1 person number, 1 rater number, "1-10a" as the item numbers, then 10 item scores
Items 1-8 are scored 0-10, items 9-10, are scored 0-5
so one Excel row looks like
6 5 1-10a 7 8 4 5 6 9 10 8 3 5
and the Facets models look like:
?, ?, 1-8, R10
?, ?, 9-10, R5


weblukester: Yes, thanks Mike, now I have the fun of trying to interpret the analysis!

Wish me luck!

weblukester: I want to study rater effects, but I can't work out how to create z scores on FACETS. Can anybody help?

Mike.Linacre: Weblukester, Facets gives you the size of each rater effect = the rater measure, also the standard error of each rater effect = the precision of the rater measure.

Please define the z-score that you want.

weblukester: I don't understand... I want to replicate a study where a z score of +2 or -2 indicated bias. Is this what you mean?

Mike.Linacre: OK, weblukester, It sounds like that study is doing a significance test on the bias size. This would be the t-test in Facets Table 13.

weblukester: Hello, Mike, I think I have to forget about bias for the moment, because something seems to be amiss with the basic analysis.

Remember my 2-tier rating scale? (see below)

?, ?, 1-9, R10
?, ?, 10-11, R5

Well, test takers are given a score which includes 30% group score - for 8,9,10,11 - this means that each group of 4 students has the same score for these 4 categories.

Is there a way for FACETS to account for this?

Mike.Linacre: Weblukester, please explain what is wanted more precisely.
For instance:
The 4 items 8,9,10,11 are performed bv a group of 4 students. A total score is obtained, e.g., 80.
Then each student is given 25% of the score, and each item is given 25% of that, so each of items 8,9,10,11 for each student is given a score of (80/4)/4 = 5

weblukester: Mike, the test is a group presentation - 4 students per group. 7 categories are individual marks (1-10), 4 categories are group marks (all students in that group get the same mark), but 2 categories are scored 1-10, the other categories are scored 1-5. The total mark of each students scoring sheet adds up to 100, but 30 marks are shared with their group members.

I think this is throwing my bias analysis out, because when I try to do a rater/item bias analysis, infitmean squares and z scores (t in table 8) go wild but only for the group scores.

Is there a way to limit this but still get reliable information?

I have sent you an email with the scoring sheet attached.

Mike.Linacre: Weblukester, this suggests that, for the purposes of validating fit, two separate analyses are done.
1) Items 1-7 by individual
2) Items 8-10 by group

Then the datasets are combined for the final measures of the individuals
models =
?, ?, 1-7, R10
?, ?, 8-9, R10
?, ?, 10-11, R5

weblukester: Hi Mike, does this mean that if I want study rater effects, I would have to analyse the individual and group scores separately?

Or I do 2 separate analyses for individuals and groups, then combine the two somehow? Does this involve anchoring?

Mike.Linacre: Weblukester, your data are a composite of two different rating situations. So we need to specify how rater behavior for a "group" relates to rater behavior for an "individual". The easiest way is to analyze group items separately from individual items. Then rater attributes for the two situations can be cross-plotted to identify differences in rater-behavior.

If we want to measure individuals, and hypothesize that each individual contributes equally to the individual's group, then we can simply analyze all the data together.

weblukester: Hi Mike, I would like to return to my original question about sample size (January 8th).

I've managed to get 20 raters to mark 8 examinees on 11 items.

Earlier, when I had only 8 raters, you said that although not definitive, the data would be indicative that facets is likely to work well with a larger sample.

I was just wondering whether 20 raters had changed things?

I have read that a sample of 30 is the minimum needed to avoid excessive measurement bias/unstable fit statistics, but I don't know whether this refers to examinees, raters, items or all three. I want to examine rater effects (central tendency, bias, halo, leniency/severity, and inconsistency) - does this mean 30 raters is the minimum to study this?

Mike.Linacre: Weblukester, the basic principle is that we need 30 observations of anything to have some level of statistical certainty. Of course, we are often satisfied with much less. How many tosses of a coin would convince you that it is fair? Most people would be satisfied after 3 or 4 tosses. For a roulette wheel, it would probably require hundreds of spins to establish fairness. So there are considerations, such as "how much is at stake?" and "how much would it take to convince ourselves?"

442. Automated Table Output

uve February 16th, 2012, 10:51pm: Mike,

Is there a way to include in the control file a way for Winsteps to automatically run certain tables as it runs the analysis. For example, supposing I wanted table 10.1 and 2.1. Thanks as always.

Mike.Linacre: Uve:

In your Winsteps control file:


443. Item Calibration - Low and High Performers

NothingFancy February 1st, 2012, 2:12pm: I'm helping a friend at a school district conduct some item selection & calibration on various academic tests they give out. We are both somewhat self-taught on Rasch.

But he brought up that he was started looking at various scenarios, such as breaking up a dataset by separating out the high performing students and low performing students. The item difficulty estimations change a fair amount between the two groups. Besides the logit estimations, even the order of items (in terms of difficulty) shifts a fair amount.

What I read says that item estimation is sample independent, so I think I need some clarification.

For the All group, the person separation and item separation indices are much better than either of the smaller sub-datasets.

I've attached the WINSTEPS control file with all of the students. If you separate this by selecting only the high performing students (1 SD above the mean) and low performing students (1 SD below the mean), the item difficulty calibration are different.

My more practical reason is in the future, we may be calibration some items that only honors students take initially, and will that affect how we select certain items, especially if this items might end up a test taken by students with less academic ability (but still over the same subject material).

Mike.Linacre: OK, NothingFancy, thank you for the post.

The easy one first:
"For the All group, the person separation and item separation indices are much better than either of the smaller sub-datasets."

Comment: Yes, that is exactly what is expected. The person separation depends on the person ability variance (smaller in the two sub-samples than in the "all" sample) and the item separation depends on the precision of the item difficulty estimates. The precision (standard errors) depend on the person sample size. This would be larger for the "all" sample, so the item difficulty S.E.s would be smaller and item separation larger.

The high-low split is more important. Is there misbehavior among the high performers (carelessness) or low performers (guessing)?

In an "All" analysis, look at Table 5. We can see a peak of misfit (outfit mean-squares) among the low performers. We can also see a cluster of unexpected "1" answers at the bottom right of Table 10.6

We can trim the low-probability-of-success responses with CUTLO=-1. This produces a much more homogeneous dataset. Use this item difficulties (IFILE=if.txt) as the anchor values in a complete "all" analysis: IAFILE=if.txt

NothingFancy: So is it fair to say, even with Rasch, you really do have to have a wide variance in ability for accurate item analysis?

Mike.Linacre: NothingFancy, yes. If we need to know how an item operates for high performers, then it must be administered to some high performers, and similarly for low performers. In general, we assume that an item behaves the same way for all ability levels, but this is an assumption. As you say, we need a wide range of ability for accurate item analysis.

There are situations, such as adaptive-testing, where we aim an item at a narrow ability range. If so, we need only investigate its performance for that range.

444. Beginner question on combining item SEs

ct800 February 10th, 2012, 1:33pm: Hi
As a newbie to Rasch, I'd appreciate some help - I (will) have data from a number of linked tests (from MCQs, so binary results) which I can use in a one parameter Rasch model to estimate item difficulties. What I would then like to do is find the mean difficulty of each of the individual tests - which so far I have done by finding the mean item difficulty across the items in each individual test. I would then like to find a 95% confidence band (or a similar measure of uncertainty) for the mean item difficulty, which involves combining the SEs for the items included in an individual test.
I am sure there is a simple way of doing this but unfortunately my skills are not up to the job! Many thanks.

Mike.Linacre: Thank you for your email ct800.

Assuming that your sample sizes are reasonably large, then
1. random measurement error of each item difficulty estimate is already included in the reported item difficulties
2. the standard error of the mean follows the usual formula:
S.E. (mean) = S.D.(item difficulties)/square-root(n): http://en.wikipedia.org/wiki/Standard_error_%28statistics%29#Standard_error_of_the_mean

3. 95% confidence intervals are 1.96*S.E.(mean) away from the mean.

ct800: Hi

Now I feel slightly stupid! I'm working with pilot data at the moment, but the 'real' data will include approx. 150 items sat by 150 examinees in each of 30 tests, with a total of 200 'linked' items used in approx 8 tests each. I am assuming this is a sufficient sample size. Your help is much appreciated.

PS What text would you recommend to help me better understand 1. In your answer?

Mike.Linacre: Ct800, since we always measure imprecisely, i.e., with error, we need to know whether we need to compensate for that error in our computations. In this situation we do not need to compensate for error because the error is already included in the observed values of the set of estimates.

If we are looking at individual estimates, i.e., point-estimates, then we would need to compensate for error.

ct800: This helps my work no end! Many thanks for your help.

445. rating condition question

wantednow February 7th, 2012, 11:25am: I had the same group of raters rate two sets of essays (50 in set A and 10 in set B) twice using two rating scales (holistic and analytic). While set A was rated silently, set B was rated while thinking aloud. I want to examine the effect that thinking aloud might have on rater severity and self-consistency, as well as their interactions with rating criteria. How should I proceed? Should I run separate FACETS analysis of the four sets of data? Or should I combine them across conditions or scales or both conditions and scales? Many thanks!


Mike.Linacre: Thank you for your questions, Lee.

Start by analyzing all the data together. Use elements of dummy facets to indicate the different situations. You can then look at the fit of the dummy elements to see if there is any overall change in consistency. Interactions between dummy elements and raters or essays will tell you about changes in severity.

If this doesn't work, then it is easy to do four separate analyses using the one combined dataset. Merely comment out the dummy elements for the situations you want to exclude from each separate analysis.

wantednow: Thank you very much for your input! I tried combining all the data, and the session (rating condition) measurement report shows that there is no difference between the two conditions. There are, however two bias interactions between raters and sessions out of a total of 18. No bias interactions were reported for examinee-session interactions.

446. Judging plans

Chris February 3rd, 2012, 4:19am: Mister Linacre,

I would like to use the Rasch model to analyze rater effects with the following instruments and judging plan. Could you please tell me if the linkage between elements will be sufficient?

The plan will take place over 16-20 days of tests and will involve +- 12 raters. n candidates are going to pass an oral expression exam. Each candidate will be evaluated by a pair of raters. Every candidate will perform two tasks based on two random prompts/topics. The prompts/topics will basically change from one day of testing to another. Normally, in a given day, the same pair of raters will rate about 4 or 5 candidates. The rating will be done with a standardized rating scale which, previous experiments have shown, lends itself to a high inter rater agreement %.

Over the 20 days of testing, pairs of raters should mostly remain the same. Considering that plan, will the linkage provide enough data to use the Rasch model to study rater effects?

Thank you,

Mike.Linacre: Chris, this judging plan sounds good, provided that:
1. The raters change their partners occasionally, so that they form a linked network of raters.
2. It sounds like each day has a different pair of tasks. This will require us to say that each pair of tasks has the same average difficulty. It would be better if the tasks also changed their partners.

Chris: Mister Linacre,

Concerning your second observation, each testing day has, indeed, a different pair of tasks.

However, if G-Theory has already been used to establish the fact that tasks/prompts all share, pretty much, the same level of difficulty, the problem disappears, yes?

As for your first observation, we will try to form different pairs of raters for different testing days.

Thank you very much for taking the time to answer this. It is as if you are the scientific advisor extraordinaire of the forum's members!

Mike.Linacre: Thank you for your comment, Chris:
"if G-Theory has already been used to establish the fact that tasks/prompts all share, pretty much, the same level of difficulty, the problem disappears, yes?"

Can G-Theory really establish that tasks share the same degree of difficulty? The reason that many early adopters switched from G-Theory to Rasch was because of G-Theory's failure in this respect. Probably we can say, "G-Theory findings indicate that tasks are similar in difficulty, hopefully not different enough to have any substantive impact on decisions based on the candidate measures."

Suggestion: look at the G-Theory variances for the Task facet. Imagine two 2 tasks at the high end of the difficulty range. Also imagine two 2 tasks at the low end. What would be the differences in the average ratings on the pairs of tasks? What would be the differences in the candidate measures on the pairs of tasks? Do these differences matter?

Chris: I think that it is reasonably safe to say that if the interaction between task (prompt) and candidates explains a small percentage of the total variance and that the task facet explains a very small percentage of the error variance, then we can say that G-Theory has helped us to determine that tasks share the same difficulty.

But I could be wrong. As for your questions, I am going to think.

Thank you,

Mike.Linacre: Chris: "Small percentage of the total variance" is somewhat meaningless. In many examinations, the difference between "passing" and "failing" is also a small percentage of the variance!

Also, surely it is not the interaction term but the "task variance" term itself that is important! We need to compare the putative "Task S.D." with the "Candidate S.D." (assuming these will be correct for the actual examination). We can then estimate what proportion of the Candidates are likely to be misclassified, relative to a fixed pass-fail point at the mean-candidate-ability, if Task S.D. is ignored.

Starting with Prof. Edgeworth's paper in the 1890s, "The Element of Chance in Competitive Examinations", the goal has been to eliminate sources of variance that can seriously impact pass-fail decisions. In your design, "Task difficulty" could be one of those.

Chris: Disclaimer : I am in no way, shape or form an expert on G-Theory.

What I wrote earlier is what I have read in at least 3 distinct articles. All those articles basically affirm that a small task*candidate percentage of total variance translates as "tasks have a similar level of difficulty". And it makes sense intuitively, no? My reading and understanding of these articles could, however, account for a large percentage of the total error in what I wrote!

We know, of course, that the difficulty levels are not identical and I agree that, sometimes, that variance will result in failing a candidate that would have, passed the exam (or the opposite). But I am not investigating the validity of the exam (which does not have a pass / fail mark : it is an exam that is designed to assess the level of proficiency). I am simply using it to study rater effects. So my question becomes : is the G-Theory result good enough to let me assume that task difficulties are close enough that they will not, if all set to 0 on the logit scale, invalidate my experimental design?

Mike.Linacre: Chris: we often have to make compromises in judging plans (as in other parts of life), but it is wise to know what the effect of the compromise could be.

For our purposes, we are not really interested in the percent of variance explained. We are really interested in the S.D. of the task facet in raw score terms (assuming that the original G-Study study was well designed). This will tell us the average size of the bias, in raw scores, that is introduced into the candidate ability measures by treating all the tasks as equally difficult. We can compare this with the candidate raw score S.D. to see the effect of the Task variance.

For instance, imagine a situation, such as a test of nursing proficiency at the end of nurse training. The trainee nurses are all at almost the same of level of proficiency (as intended). They are each asked to perform a different nursing task and are rated on that one task. Result? The nurse who happens to be assigned the easiest task will have the highest ratings, and so will be awarded the prize as "best new nurse". (A situation similar to this actually happened!)

Mike.Linacre: A further thought, Chris. Our discussion has highlighted the major difference between G-Theory and Rasch measurement.

G-Theory focuses on partitioning raw score variance. The aim is to minimize unwanted variance relative to wanted variance = greater generalizability/reliability

Rasch measurement focuses on constructing additive measure on a latent variable. The aim is to maximize the accuracy, precision and meaning of the estimated measures = greater validity

447. Which Rasch model?

bluesy January 31st, 2012, 5:14pm: Hello,

So far I have been a WINSTEPS user and I have a question about which Rasch model and software should I use in the following situation.

For example, imagine I wanted to measure the arithmetic (addition) ability of children. I researched a theory on addition and wanted to put it to the test. For instance, it is a developmental stage theory of addition.

Typically, for multiple-choice items, only one response is requested of the respondee. So, in the case of this example multiple-choice item, x=5+5...x= a)6 b)10 c)9 d)7, the respondee would answer only one response from these choices. Using WINSTEPS I could run either dichotomous or polytomous (RSM or PCM) models.

However, if I wanted to test this addition stage theory and extract maximum information from the 4 choices, wouldn't it be better to ask the respondees to rank the 4 choices (in the form of 4 responses) in regard to their level of correctness?, i.e. in the above example, a)0 b)3 c)2 d)1 would be correct.

If this the case, which Rasch model would you recommend that I try?
And/or, should I be trying this analysis in FACETS instead of WINSTEPS?

Mike.Linacre: You are on a voyage of discover, bluesy :-)

If the data are dichotomous (right/wrong, yes/no), then it doesn't matter what Rasch model you specify. The data will be analyzed with a Rasch dichotomous model.

If you have more than one scored category of answer, e.g., (right=2, partially right=1, wrong=0) or (best=1, good=2, bad=3, worst=4), then a Rasch polytomous model would be used. The choice is between the "rating scale model", in which all items are conceptualized to share the same rating-scale structure, or the "partial credit model" in which items are conceptualized to have different rating-scale structures.

More response categories for each item produce more information, but also more opportunity for chaos! If you apply PCM, ISGROUPS=0, then Winsteps will report in Tables 3.2 and 14.3 the functioning of the rating scale for each item.

Definitely use Winsteps in preference to Facets unless the data cannot be expressed as a simple rectangle with rows=persons and columns=items.

bluesy: Hi,

Thank you very much for your quick reply. I apologize in advance if I am "barking up the wrong tree." I will try to clarify a bit more about my current situation.

I have already run the pilot study with the respondees answering in a typical way, that is, only one response to each 4-option multiple choice item. Therefore, using a simple rectangular matrix with a 1-line column and a 1-line row. I have already run that data through WINSTEPS using dichotomous and polytomous models.

In contrast, I am thinking of extending the scope in the main study. Instead of asking the respondees to respond to only one choice of the multiple choice items, they would be requested to rank/rate all four choices. So, this matrix would be more complex. For instance, in the case I gave above in my first post, Person A responded to Item 1 as 0321. So, even though I have have not used it yet, this more complex matrix might be looking like a FACETS file. That is, a matrix with a 4-line column for each item and a 1-line row for each person.

So, my question is how do I use Rasch model analysis using 4-option multiple choice items in an atypical way (that is, with 4 responses required)? As opposed to the typical way (with only 1 response required).

Which Rasch model (is it a rank model?) and which software (FACETS?) would you suggest?

Mike.Linacre: Thank you for the further explanation, Bluesy.

It would probably be easiest to analyze these data using Winsteps.
Each row is a respondee.
Each column is one option of an item
Each item label contains the item number and the option code.

Then use the rating-scale model. There is little advantage in using models that include the inter-rank dependency: Journal of Applied Measurement 7:1 129-136.

To report item summary statistics, specify ISUBTOT=(item number code in item label) and output Table 27.

448. Bulk Variables

Gene_Muller January 29th, 2012, 9:53pm: Hi Mike,

I took your on-line course about 8 months ago and found it to be very informative. I have since applied the techniques discussed to a set of worker observation data, in which each behavior is rated by supervisors along a scale ranging from "good performance" to "neutral" to "bad perfomance". Given the nature of the rating scale I'm using, it does not seem appropriate to me to reverse the items as one would do for negative statements on an attitude scale. But in the course discussion forum, you suggested to me another possible way of analyzing such data that involved separating items describing "good" and "bad" behavior, and then combining them into a bulk variable (like height plus weight). I've been searching your site and other references to see if I could find an example of such an analysis, but I haven't been able to locate anything. Would you be able to direct me to any sources on this? Any advice you can offer would be greatly appreciated.

Yours truly,
Gene Muller

PS: Thanks again for a great course. I've recommended it to a number of colleagues.

Mike.Linacre: Glad you have found the Rasch Course helpful, Gene. It is now being taught by Prof. Everett Smith who is even better qualified than I am :-)

An approach would be to analyze all the "good" behavior items and produce a "good" measure for everyone. Similarly for the "bad" behavior items, producing a "bad" measure for everyone. It is then a policy decision as to how the "good" measure is to be combined with the "bad" measure

A similar situation arises in the evaluation of the performance of medical staff. We want them to do the "good" things. We don't want them to do "bad" things. The rule is "First, do no harm!", so a possible combination could be:
Overall measure = "good" measure - 2 * "bad" measure

Gene_Muller: OK! Thanks very much, Mike.

dachengruoque: [quote=Gene_Muller]OK! Thanks very much, Mike.

Thanks a lot, Dr Linacre! Is the formula like"Overall measure = "good" measure - 2 * "bad" measure" a rule of thumb accepted by all the insiders of the Rasch community or it could be found in the mainstream textbook on Rasch ( sorry for my not so good questioning).

Mike.Linacre: Dachengruoque, there is no rule-of-thumb about the relationship between "good" and "bad". Policy-makers must decide for their own areas how to combine "good" and "bad" into one number. For instance, in the search for the "top quark", one good result was more important than millions of bad results.

dachengruoque: [quote=Mike.Linacre]Dachengruoque, there is no rule-of-thumb about the relationship between "good" and "bad". Policy-makers must decide for their own areas how to combine "good" and "bad" into one number. For instance, in the search for the "top quark", one good result was more important than millions of bad results.

Thanks a lot, Dr Linacre!

449. scale transformation & anchoring simultaneously

uve January 27th, 2012, 6:30pm: Mike,

When I used UMEAN and USCALE, I got very close to the transformation I needed. I realized the small difference had to do with the fact that I forgot to anchor to last year's measures. However, when I attempt do common item anchoring simultaneously with UMEAN and USCALE, the result is way off. Is it possible to do this? I got it to work by running the IAFILE command first then doing separate commands using Specification but I thought I could do it all at once.

Mike.Linacre: Uve, UMEAN= and IAFILE= do not cooperate well together!

Also, please use UASCALE= to specify the scaling of the anchor values in the IAFILE=

Suggestion, proceed as follows:
1. In the first anchored analysis, IAFILE=, UASCALE=
2. Compute the values you want to apply for the rescaled output.
3. In the second anchored analysis, IAFILE=, UASCALE=, and also, USCALE=, UAMOVE=

See Example 2 at https://www.winsteps.com/winman/index.htm?uamove.htm

uve: Mike,

In our first year, I developed the measures then transformed the logit values to a specified transformation scale similar to the one used by our state department of education. For example, raw score 37 = logit 1.14 = target transformation score 350. Raw score 26 = logit .09 = target transformation score 300. Applied to the logit value of 1.14 for raw score 37: ((350-300)/(1.14-0.09)*1.14)+300-((350-300)/(1.14-0.09)*0.09)= 350 rounded. Identical transformation is accomplished in Winsteps by setting UMEAN = 295.71 and USCALE = 47.62.

In the second year, I used the common item non-equivalent form method to anchor the new form logits to old form. Once these new measures were created I then applied the same transformation equation from the prior year in Excel. So after anchoring let's say raw score 37 has now increased to 1.20 logits =((350-300)/(1.14-0.09)*1.20)+300-((350-300)/(1.14-0.09)*0.09) = 353 rounded. Essentially 37 (350) last year is now 353 this year. To duplicate this same output in Winsteps, I first had to run the IAFILE then run UMEAN and USCALE in Specification box, 295.95 and 45.05 respectively.

I guess I'm not sure why the anchor items need to be treated differently using UASCALE and UAMOVE.

Mike.Linacre: Uve:
In the anchored run, please set:
IAFILE= (anchor values from first year)
USCALE= (USCALE= value from first year)
UASCALE= (USCALE= value from first year)
Please omit UMEAN= since the scale origin is set by the IAFILE= values
Please omit UAMOVE= since there is no shift in scale origin relative to the IAFILE= values


uve: Thanks Mike. It appears I still have a long way to go towards understanding how equating and scale transformation work together and how to best use Winsteps to achieve this. So to make sure I understand your response, USCALE value in the 2nd year comes from the first year equation which was 47.62. I also use this for 2nd year UASCALE. Is that correct? The reason I ask is that when I attempted to put in UASCALE Winsteps responds with Not Done, so I'm not sure what I did wrong.

Mike.Linacre: Uve, "not done" usually indicates that a command that should be implemented before estimation (usually by inclusion in the control file) is being done after estimation (usually from the Specification dialog box). Since anchor values (IAFILE=) is implemented before estimation, then UASCALE= (which tells Winsteps the scaling of the anchor values) must also be implemented before estimation.

450. Reliability

dma1jdl January 27th, 2012, 4:40pm: Hi,
I'm trying to reproduce the winsteps person reliability measures.

All seems to go well following:

However if some people have got every item right or wrong I don't seem able to reproduce the figures - does anyone know of any more information on how winsteps proceeds?



Mike.Linacre: John, when computing Reliability, Winsteps uses the same formula whether including or excluding extreme scores:

EV= Error variance = average of (squared (standard error of each measure))
OV = Observed variance = observed variance of the set of measures

Reliability = (OV-EV) / OV

451. Misfit Too Optimistic?

uve January 27th, 2012, 6:02am: Mike,
I recently finished reading, The Rasch Model and Additive Conjoint Measurement by George Karabatsos reprinted in Introduction to Rasch Measurement, 2004. Its a bit overwhelming for a Rasch novice such as I and Ill likely need to reread it numerous times before it becomes much clearer for me. However, I will say I found its basic premise to be very intriguing: that outfit and infit fit statistics can be overly optimistic in measuring item misfit and offers the axiomatic based additive conjoint model as a supposedly superior substitute. One thing I can say for sure is that this would not be a practical method to use given my limited abilities. But it has started me thinking about a possible alternative.

Im wondering if the discrimination index couldnt be used as an additional measure to make the final determination whether fit is overly optimistic or not. Perhaps a simple fit/discrimination ratio might work. For example, if an item has a relatively decent fit such as 1.26 but its discrimination index value is .73, then 1.26/.73 = 1.726. We would now have an item marked for possible misfit that would otherwise have slipped by. Of course, I have no idea whether such a measure has any validity or not, but I would be most interested in your opinion.

Mike.Linacre: Yes, Uve, we are always hunting for better fit statistics. Many fit statistics have been proposed for Rasch models. Most test the hypothesis "These data are perfect". The challenge is to formulate fit statistics which test the hypothesis "these data are useful". We are rarely in the situation where we want to eliminate an item or person because of misfit. We are often in the situation where we want to construct the best measures that we can out of misfitting data.

So far in my experience (based on both theory and practice), Infit and Outfit Mean-squares are the most constructive. They can be summarized as:

>2.0Distorts or degrades the measurement system.
1.5 - 2.0Unproductive for construction of measurement, but not degrading.
0.5 - 1.5Productive for measurement.
<0.5Less productive for measurement, but not degrading. May produce misleadingly good reliabilities and separations.

Depending on the situation, this summary is optimistic, pessimistic or correct.

Conventional item-discrimination statistics, such as point-biserial correlations, are highly inversely correlated with Infit Mean-squares. See Figure 2 in https://www.rasch.org/rmt/rmt142a.htm

uve: Mike,

If we rarely want to eliminate an item or person because of misfit, what other options do we have to ensure our measures are the best they can be?

Mike.Linacre: Uve, yes, definitely eliminate items that really are bad, but a well-designed test usually has few, if any of these. The problem is more often idiosyncratic behavior by test-takers. If so, how about a procedure such as https://www.rasch.org/rmt/rmt234g.htm - in which as much as possible is reinstated at the last step when the good items are anchored at their best difficulties.

452. Table 1.2

uve January 23rd, 2012, 6:23am: Mike,

In Table 1.2 what is the criterion for items appearing in the same row?

Mike.Linacre: Uve, in Winsteps Tables 1, 12 and 16, items and persons with approximately the same measure appear in the same row. If there are too many for one row, then the overflow continues in the next row of the Table, but without a "|" row indicator.

uve: Mike,

The reason for my question is that I'm trying to use these tables to determine if we might have too many items targeted at specific levels. If I see three or four items on the same row, it may warrant removing some and replacing them with harder/easier ones for levels not initially targeted well by the test.

So that's why I was asking about the criterion.

Mike.Linacre: Yes, Uve. Ben Wright used to do the same thing :-)

453. Facets and Further courses??

solenelcenit January 20th, 2012, 2:45pm: Dear Mike:

I was waiting to attend this year to Facets courses. However Somebody from statistics.com have told me the facets courses (also the further one) are not scheduled and they don't know if the will come again. Please.. do you know anything about that?, where can I learn how to manage facets software, is there are any web page where you have uploaded the easy to understand notes with two columns??



Mike.Linacre: Luis, sorry, I am not available to teach those online courses. Have you looked at the Winsteps and Facets Help files? They contain a huge amount of material and more is added with every software update. I will try to add more instructional PDFs to this website.

solenelcenit: Dear Mike:

Thanks for answering me. I'll follow your advice, therefore I should state that your allumni will miss this highly pedagogical courses, and of courses having you as a teacher.


Mike.Linacre: Luis, have acted on your request:

Facets Tutorial PDFs:

455. Unidimensionality & Sample Size

uve January 15th, 2012, 10:02pm: Mike,

Below are two dimensionality analyses for the same test. One was done at a point in time when there were only 176 scores in our data system, then the second was done after all school sites had finished testing giving us about 762 scores. It seems the dimensionality test is highly sensitive to sample size. I had run a simulation file but can't seem to find the results. If I recall, the 1st contrast eigenvalue was about 2.2 or less. The simulation on the larger file is now 1.6. My question is why is there such a large discrepancy? How do I know if the eigenvalues I see are truly representive and not inflated/deflated due to sample size?

"Detecting and Evaluating the Impact of Multidimensionality Using Item Fit Statistics and Principal Component Analysis of Residuals" E. Smith reprinted in Introduction to Rasch Measurement, 2004 mentions this effect to some degree. In it, different ratios of two different dimensions/components are introduced into a simulated study. Component ratios on a 30-item test were 25:5, 20:10 and 15:15. Common variance between the two components ranged from 0 - 90%. The ability of the PC of residuals to correctly classify items belonging to each component is reduced sharply as the ratio becomes more balanced and the common variance increases. I'm wondering if you have any thoughts about this and if it even applies to my situation below.

INPUT: 176 Person 65 Item REPORTED: 176 Person 65 Item 2 CATS WINSTEPS 3.73

Table of STANDARDIZED RESIDUAL variance (in Eigenvalue units)
-- Empirical -- Modeled
Total raw variance in observations = 82.3 100.0% 100.0%
Raw variance explained by measures = 17.3 21.1% 22.4%
Raw variance explained by persons = 10.3 12.5% 13.4%
Raw Variance explained by items = 7.0 8.5% 9.1%
Raw unexplained variance (total) = 65.0 78.9% 100.0% 77.6%
Unexplned variance in 1st contrast = 4.4 5.3% 6.8%
Unexplned variance in 2nd contrast = 3.1 3.8% 4.8%
Unexplned variance in 3rd contrast = 2.6 3.1% 3.9%
Unexplned variance in 4th contrast = 2.5 3.0% 3.8%
Unexplned variance in 5th contrast = 2.3 2.8% 3.6%

INPUT: 762 Person 65 Item REPORTED: 762 Person 65 Item 2 CATS WINSTEPS 3.73

Table of STANDARDIZED RESIDUAL variance (in Eigenvalue units)
-- Empirical -- Modeled
Total raw variance in observations = 81.5 100.0% 100.0%
Raw variance explained by measures = 16.5 20.2% 20.9%
Raw variance explained by persons = 10.7 13.1% 13.6%
Raw Variance explained by items = 5.8 7.1% 7.3%
Raw unexplained variance (total) = 65.0 79.8% 100.0% 79.1%
Unexplned variance in 1st contrast = 2.4 2.9% 3.6%
Unexplned variance in 2nd contrast = 2.2 2.7% 3.4%
Unexplned variance in 3rd contrast = 1.9 2.3% 2.9%
Unexplned variance in 4th contrast = 1.7 2.1% 2.6%
Unexplned variance in 5th contrast = 1.7 2.1% 2.6%

Mike.Linacre: This is an awkward finding, Uve. It reinforces the need for both statistical and substantive reasons for taking action. Otherwise we are controlled by accidents in the data.

One investigation would be to analyze the 176 persons separately from the (762-176) persons. Compare the person distributions, and then cross-plot everything for the items (difficulties, fit, etc.). Also compare the plots in Table 23.2 to identify which items are problematic.

I am eager to hear about any progress you make ...

uve: Mike,

Well this took a bit but I did find some unusual issues. The original file of 176 persons came from just one site (site B in the attachment). Once all the data were in, about another 150+ scores from this site were added. The three other high schools submited their data as well. So I ran table 23 on each of the four sites, including another run for site B with the additional scores, as well as the district total, and finally, the district total minus the original 176. I lost one student score in this whole process, so the totals are not perfect.

I plotted the measures and outfit. The correlations are low and again, it seems the lower number of students included in the analysis, the greater the first contrast. I am puzzled.

Mike.Linacre: This is interesting, Uve.
The outfit plot is basically random (as predicted by the model), but the measure plot is idiosyncratic. The item distribution for 176 is unimodal (much as we would expect). The item distribution for 587 is bimodal (not what we would expect). There appears to be something at work in 587 that is causing a split in the item difficulties. is this a flexilevel or similarly tailored test?

uve: If you mean an adaptive test like CAT, then no. This is a fixed form scan sheet multiple-choice 4 distractor dichotomously scored instrument.

Mike.Linacre: Uve, then something strange has happened in 587. It is unusual for an item-difficulty distribution to be bimodal.

Perhaps there is a "streaming" effect in the person sample. For instance, perhaps low ability students are not taught what the "/" sign means, but high ability students are taught that. This lack of knowledge would make all division items very difficult for low ability students.

Other problems that can cause this type of problem are time-limits or the way the items are printed or the instructions given to the students (for instance, "guess if you don't know", or "work slowly and carefully", or "be sure to double-check all your answers".)

uve: Thanks Mike. I'll keep plugging away at it.

456. Inter-rater agreement/Bias/Rating scale

Susan_Tan January 17th, 2012, 8:02am: Hi,
I would appreciate some assistance in understanding output data from FACETS.

I had a benchmarking essay test that was scored by 35 raters for 30 students (35 x 30). Each essay was scored 1 - 3 for three criteria Content, Organisation, Language representing three items.

I wanted to check inter-rater agreement and this was in the output data:

Inter-Rater agreement opportunities: 71400 Exact agreements: 39105 = 54.8% Expected: 37746.5 = 52.9%

My questions are:
1) How were the inter-rater agreement numbers arrived at? What does exact agreement mean?

2) I am also trying to understand what Bias is in Table 13/14? My understanding is that the rater is putting more (positive bias) or less (negative bias) weight on the item. Am I right?

3) In my test Item 3 (language) is weighted a double of item 1 & 2 (Content & Organisation).
For example I have a 3 band rating scale and three items for scoring the essay. So an essay could be scored:

C = 2
O = 1
L = 2 (weighted twice)

The observed rating is therefore: 2 + 1 + 2(x2) = 7/4 = 1.75 (Essay Obs score)

By writing this weight in the specifications file, FACETS automatically using the partial credit scale instead of the rating scale. Am I right to say that it is not possible to use the rating scale if there is uneven weights to items?

Thank you very much.


Mike.Linacre: Thank you for your questions, Susan.

1. "Exact agreement" - please see

2. Bias: this depends on your setting of "Biassign="
Look at the "Obs-Exp Average" in Table 13. If this is positive, then the rater is more lenient than expected. If this is negative, then the rater is more severe than expected.

3. Weighting: there are three methods of weighting the data in Facets:
For instance, you could use this:
?,1,?,R ; 3-facets: model for item 1
?,2,?,R ; 3-facets: model for item 2
?,3,?,R,2 ; 3-facets: model for item 3, with a weight of 2

457. Effect of Nested Facet Design on Intepretation

melissa7747 January 4th, 2012, 7:19pm: Hello,

I plan on using Facets to analyze responses where students are nested within writing tasks. I anticipate group anchoring the tasks so that differences between tasks are the result of student ability. After reading the nested and subset connectedness sections in the manual, it seems group anchoring serves as a control (or otherwise holding it constant) across all other variables in the model.

If so, are measures then interpreted accordingly? I.e., Would I be able to conclude that student X earns a higher or lower overall measure than student Y regardless of the writing task? Also, if group anchoring does act like a control, is it correct to assume that variation in writing task is not used to construct student measures, as well as other facet measures? What would the equation for such a model look like exactly?

Many thanks (from a beginner Facets user),

Mike.Linacre: Melissa, when we have a nested design (anywhere in a statistical analysis), we must make assumptions. We must decide how much of each person's performance is due to the person and how much due to the nesting situation. The usual choices are:
1. All tasks have the same difficulty. All the differences in person performances are due only to the person.
2. The samples of persons in each nest (responding to each task) have the same average ability. Mean difference in person performance are due to differences in task difficulty.

Obviously, our choice alters the interpretation of our numerical findings.

The mathematical model is the same for every choice. It is the constraints placed on the estimates that differ. For instance "all tasks are estimated to have the same difficulty", or "all nested subsets of persons are estimated to have the same mean ability."

melissa7747: Mike,

Thank you for your reply. This helps me prepare for how to best analyze the data so the results are most meaningful to our various audiences.


melissa7747: Hi Mike,

I've been thinking more about your earlier reply and the issue of anchoring. My study is keenly interested in understanding the differences in student ability (i.e., we do not assume students have the same average ability). Thus, my reasoning for thinking I should anchor the writing task.

There is a concern, however, that the writing tasks will vary in terms of difficulty and influence student observed scores. So, I'm struggling to think of ways in which I can capture all sources of variation in the student measures - i.e., I'd like the student measures to take into consideration judge severity, item difficulty and prompt difficulty. Might the following approach address this: run the analysis anchoring the prompts; next examine bias between student and task; if significant bias is found, adjust those measures for bias? Or due to the nested nature of the student-task relationship, must the assumption of tasks having the same average difficulty be maintained? Any suggestion on how to tackle this issue are greatly appreciated.


Mike.Linacre: Melissa, if I understand your study correctly, the students are nested within tasks. So we might guess that one task is easier than another task, but that would only be a guess. We have no hard evidence.
Suggestion: conduct a small study in which some students response to two or more tasks. This will provide the evidence for anchoring the task difficulties at different values. OK?

melissa7747: Hi Mike,

Unfortunately, assigning students different tasks is not possible for this specific component the study. Rather, various instructors will independently construct and assign their students different writing assignments, which will be scored by independent raters. My task is to construct linear measures from these scored (several items each scored on a scale from 1-6) assignments so they might be analyzed.

While it is an assumption that the tasks will vary in difficulty, it remains a concern among many that it may possibly contribute to students' scores. Am I correct to assume then that this design restricts my ability to construct measures then that take task difficulty into consideration (i.e., I'll need to simply make the assumption that all tasks are equally difficult)? Or is there some other way to address task difficulty (or at least identify/quantify it) that I'm unaware of?


Mike.Linacre: Melissa, you have a nested design. Suppose one group of students have a higher average score than another group. We don't know whether one group of students is more able than the other group, or one group of assignments is easier than the other group.

It looks like "equal assignment difficulty" is the most reasonable option ... :-(

melissa7747: Hello again, Mike,

I really appreciate your patience and feedback on this matter. Being new to MFRM, I really want to ensure that my understanding is (as well as measures are) as solid as possible.

So, equal task difficulty it must be. However, I was thinking more yesterday afternoon about what you said concerning using a small study where students respond to two or more tasks. With the same nested design, would the following be legit? Or does it yield something entirely different that I've yet to think about?

- Group anchor students (by task) at 0 to determine which tasks are more difficult/easier.
- Use results to anchor task difficulties with different values (i.e., other than 0)
- Rerun analysis anchoring task with these 'new' values'
- Student measures then have taken task difficulty into consideration (?)

I'm assuming your answer will still be the same - need to anchor tasks to obtain student ability measures. But, I thought it might worth it :).


Mike.Linacre: Sorry, Melissa, you can't get more out of the data than was included originally (without information outside of the data). So your proposed procedure will not solve the problem :-(

melissa7747: Hi Mike,

That's what I figured, but I though it might be worth asking :). Thank you again for all your input. It has not only provided me with great guidance but also lets me know I'm on solid ground, in terms of analytic approach....Melissa

458. Difference in measure scores for two sets of items

SueW January 9th, 2012, 2:45pm: Hi Mike (or anybody on here)

I have two sets of items (each a different inventory from the other). One set of items has 12 items and the other 24. I want to do a test which looks at the difference in measure scores (between the two items). I also want to test for heterogeneity of variance in order to test for the difference in spread between the two sets. Could you tell me how I might do this?


Sue Wiggins

Mike.Linacre: Sue, it sounds like you administered two tests, one of 12 items and one of 24 items. Were they to the same people? What was the experimental design?

"difference in measure scores (between the two items)" - which two items are we talking about?

"heterogeneity of variance" - variance of what? Item difficulties? Person ability measures? Raw scores?

SueW: Yes, the two tests were to the same people. Experimental design -it wasn't an experiment - the Relational Depth Inventory (RDI) was administered post-therapy and the Working Alliance Inventory administered during. Not really sure what the experimental design is!
Sorry meant to say Difference in measure scores between the two sets of items
Heterogeniety of variance - I meant the spread of scores


Mike.Linacre: Sue, this is difficult:

"the Relational Depth Inventory (RDI) was administered post-therapy and the Working Alliance Inventory administered during" - so they were the same patients in name, but not in status.

We can certainly compare the spread of scores, but the comparison is definitely apples compared with oranges.

We could imagine that the persons have not changed, then we could compare the variances of the raw scores. A statistical test is an F-test comparison of the two variances.

SueW: Mike, I do apologise, I have actually got that wrong: the two tests were administered at the same time (same stage of therapy).
All I want to be able to do is see whether the two measures assess the same levels of depth of relationship - in theory the Relational Depth Inventory should assess 'higher' experiences than the Working Alliance would.
Hope this makes sense.

Mike.Linacre: Sue, it sounds like you can analyze all 36 items together. Include in each item label a "test" code. You can then do item summary statistics by "test" code, also you can look at the dimensionality of the items to see whether the items classify neatly by test.

SueW: Thanks Mike - I have done it and it makes sense doing them together; it makes sense theoretically.

459. Point system for essays

uve January 3rd, 2012, 5:10pm: Mike,

What is the best way to set up the file for Winsteps where the scoring is based on a point system rather than a scoring rubric? For example, one of the essays is scored from 0-28, instead of a Likert type rubric. I'm not sure what teachers are using as a rationale that would, say, define giving a 10 versus giving a 22. This seems really odd to me. I haven't done this before in Winsteps so I'm not sure what to do. Thanks as always for your help.


Mike.Linacre: Uve, the challenge with scoring scales like 0-28 is the unobserved categories. How do we want them to be modeled? There are three options in Winsteps:

1. Only observed categories are used. They are renumbered from lowest observed category upwards. For example: the observed categories: 21,23,25. The renumbered categories: 21,22,23. Control instruction: STKEEP=NO

2. Observed categories and intermediate unobserved categories are analyzed. For ex ample: observed: 21,23,25. Analyzed: 21,22,23,24,25. Control instruction: STKEEP=YES

3. The specified range of categories is analyzed. For example, observed: 21,23,25. Analyzed: 0-28. Control instructions: ISRANGE=, SFUNCTION=. see https://www.winsteps.com/winman/index.htm?sfunction.htm

uve: Mike,

I must be missing something basic here. I'm having a bit of difficulty setting up the control file. Since there is only one item, an essay scored 0-28, there doesn't seem to be a need for labels.

When I run the control file below , the error message I get is: HAS LENGTH OF 59. THIS DOES NOT AGREE WITH XWIDE=2

When I change to XWIDE = 59 I get this message: NONE OF YOUR RESPONSES ARE SCORED 1 OR ABOVE.

TITLE = ELA 10 Essay
PERSON = Person
ITEM = Item
ITEM1 = 11
NI = 1
NAME1 = 1
Codes = "07 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28"
997581263 22
342178226 16
128947272 05

Mike.Linacre: Uve, we have two concerns here.
1. XWIDE=2, so the CODES= should be:
Codes = "0710111213141516171819202122232425262728"

2. NI=1. This does not give Winsteps information about the spacing of the rating scale categories. Winsteps needs at least NI=2.
Suggestion: add two dummy items and anchor them at 0. Everyone succeeds on one dummy item and fails on the other dummy item

TITLE = ELA 10 Essay
PERSON = Person
ITEM = Item
ITEM1 = 11
NI = 3
NAME1 = 1
Codes = "00010710111213141516171819202122232425262728"
ISGROUPS = EDD ; one group is the essay rating, one is the dummy items
2 0 ; dummy items anchored at 0
3 0
Essay item
Dummy success
Dummy failure
997581263 220100 ; dummy items always scored 0001
342178226 160100
128947272 050100

uve: Thanks Mike. That worked out well. It added one raw score point to everyone but I suppose that doesn't really matter.

I would like to make my first attempt at a Facets analysis by possibly using this data set. The only rated data currently in use in my district is the scoring of these essays. They occur three to four times a year, so there will likley never be more than one item rated at a time. Unfortunately, the use of this point system will continue. I'm hoping they will change to a category rubric method instead. In the meantime, I'm assuming we can continue to add the two dummy variables. What would the facet model look like?


Mike.Linacre: Uve, if you want to correct the raw scores:
Codes = "00010710111213141516171819202122232425262728"

The Facets model would look the same was the Winsteps model
2, Item, A

Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse Rasch Measurement Theory Analysis in R, Wind, Hua Applying the Rasch Model in Social Sciences Using R, Lamprianou Journal of Applied Measurement
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch Rasch Models for Measurement, David Andrich Constructing Measures, Mark Wilson Best Test Design - free, Wright & Stone
Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias Diseño de Mejores Pruebas - free, Spanish Best Test Design A Course in Rasch Measurement Theory, Andrich, Marais Rasch Models in Health, Christensen, Kreiner, Mesba Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen
Rasch Books and Publications: Winsteps and Facets
Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes Rasch Models for Solving Measurement Problems (Facets), George Engelhard, Jr. & Jue Wang Statistical Analyses for Language Testers (Facets), Rita Green Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind Rasch Measurement: Applications, Khine Winsteps Tutorials - free
Facets Tutorials - free
Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan

To be emailed about new material on www.rasch.org
please enter your email address here:

I want to Subscribe: & click below
I want to Unsubscribe: & click below

Please set your SPAM filter to accept emails from Rasch.org

www.rasch.org welcomes your comments:
Please email inquiries about Rasch books to books \at/ rasch.org

Your email address (if you want us to reply):


FORUMRasch Measurement Forum to discuss any Rasch-related topic

Coming Rasch-related Events
Oct. 6 - Nov. 3, 2023, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Facets), www.statistics.com
Oct. 12, 2023, Thursday 5 to 7 pm Colombian timeOn-line workshop: Deconstruyendo el concepto de validez y Discusiones sobre estimaciones de confiabilidad SICAPSI (J. Escobar, C.Pardo) www.colpsic.org.co
June 12 - 14, 2024, Wed.-Fri. 1st Scandinavian Applied Measurement Conference, Kristianstad University, Kristianstad, Sweden http://www.hkr.se/samc2024
Aug. 9 - Sept. 6, 2024, Fri.-Fri. On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com