Old Rasch Forum - Rasch on the Run: 2007

Rasch Forum: 2006
Rasch Forum: 2008
Rasch Forum: 2009
Rasch Forum: 2010
Rasch Forum: 2011
Rasch Forum: 2012
Rasch Forum: 2013 January-June
Rasch Forum: 2013 July-December
Rasch Forum: 2014
Current Rasch Forum

88. Classification Consistency

Seanswf July 20th, 2007, 10:24pm: Hello All,
I am looking for a method to calculate a classification consistency index for criterion referenced tests. What does the Rasch model have to offer? How can I use winsteps output to obtain such an index?

MikeLinacre: Thought-provoking question, Seanswf.
Under the Rasch model, every person measure has a standard error. So for each member of the person sample, one can compute the probability that the person will pass or fail relative to any criterion measure level. If the criterion level itself has a standard error, then the person S.E. can be inflated by the criterion S.E. Let's call the statistically averaged S.E. across the sample relative to the criterion level its CRMSE.
A possible sample-based classification index would be Sum(probability of being correctly classified relative to the criterion level for each person)/N.
The literature seems to like Reliability-type consistency indices. So we could formulate one based on the observed variance of the sample measures around the criterion level (COV). Then
Consistency reliability = (COV - CRMSE*CRMSE)/COV
Is this helpful, Seanswf?

Seanswf: Hi Mike thanks for the reply.

I am still a bit confused. On your first suggestion for a sample based index how would I calculate the "probability of being correctly classified relative to the criterion"?

For the Reliability-type index how would I obtain the COV? Would I calculate the variance for persons within 2 S.E. of the criterion?

Maybe an example would help, are there any articles on this subject?

Thanks for the clarification

MikeLinacre: Seanswf, certainly:
1. Probability of being correctly classified. Since we don't know an individual's "true" measure, we base this on the estimated measure, so:
Probability of being correctly classified for a person = nomral probability [(observed measure of person - criterion level difficulty)/(person measure standard error)]

2. COV = sum ((observed person measure - criterion level)**2)/(count of persons) for entire sample (if you want the classification consistency for the entire sample).

Seanswf: Great thanks Mike!
1. Now I think I understand- this equation results in a z score which I use to find the probability under the normal curve. So for a single person the interpretation would be "Person [n] has a probability [x] of being classified at the criterion on a repeated attempt". I can average across persons for a sample average.

2. Just to be clear COV = sum ((observed person measure - criterion level) raised to the power of 2) / count of persons.

Is this the same interpretation as other reliability indexes; a ratio of "true score" or "correct classification" to "error" or "incorrect classification"? For example I obtained an index of .56 so I can say 56% correct classification and 44% incorrect classification for the sample?

I am also interested in calculating a reliability-type index for subgroups within the sample (for instance all those who obtained a certain raw score). Would I use the same formula?

I really appreciate your help!

MikeLinacre: Seanswf: The classification reliability is formulated like Cronbach Alpha reliability etc., The classification reliability is the ratio of "true variance around the cut-point" to "observed variance around the cut-point".
But the classification reliability is not the "proportion of correct classification". For instance, a classification reliability of zero about the cut-point would be like a coin-toss, so it could result in 50% correct classification.
And yes, for sub-groups the same mathematics applies.

chong: Hi Mike,
I've been looking for the same thing and I find the discussion very useful. Though from your first reply to Seanswf I'm still clueless to determine the 'criterion measure level' on Rasch scale and transfer it to my situation.

My case is that there are 4 subtests (from a 20-item MCQ test) where:
(i) each 5 items per subtest measuring a mastery level k, i.e.,
q1-q5 --> Level 1
q6-q10 --> Level 2
q11-q15 --> Level 3
q16-q20 --> Level 4
(ii) each person is considered to pass each subtest (and hence said to have mastered that level k) if he or she obtains the raw score of at least 4 out of 5.

The levels are hierarchical, i.e., a person is classified into a Level K if he or she has mastered every k < K and k=K and fails every k > K. For instance, an individual belongs to Level 2 if his or her scores for first and second subtests are '(>=4)/5' while score '(<4/5)' for the rest of the subtests.

In a single administration of such criterion-referenced test, I'm interested in knowing the extent to which each member of a sample is to be classified consistently into a single level. My questions are:

1. With those item measures in Table 13.1, how do I determine the 'criterion measure level'?

2. With those S.E. in Table 17.1, how do I calculate the S.E. across the sample relative to the criterion level?

Mike.Linacre: Thank you for your email, Chong.

1. "how do I determine the 'criterion measure level'?
Chong, the criterion for Level 1 "at least 4 out of 5".
So, take the item difficulties for q1-q5, and compute the ability measure corresponding to a score of 4.
In Winsteps, after analyzing all the items,
"Specification menu":
Table 20.

2, "the S.E. across the sample relative to the criterion level?"
Chong, sorry, the S.E. of what? First we need to compute the relevant statistic, and then we can compute its S.E.

chong: Thanks for your guidance, Mike.

(1) I found the measures for score '4' of each subtest (in Table 20.2) according to my understanding of the direction provided:

Measure = -.55; S.E. = 1.46

Measure = .39; S.E. = 1.14

Measure = 3.66; S.E. = 1.43

Measure = 3.98; S.E. = 1.18

My question: Is the procedure above correct in getting the Measure= 'criterion measure level'?

(2) I apologize for not making it clear earlier. Put it another way, now I have all S.E. of person measure from Table 17.1 and S.E. of criterion measure just obtained. Since you said "the person S.E. can be inflated by the criterion S.E.", how do I proceed to obtain the "averaged S.E. across the sample relative to the criterion level" (CRMSE)?

Many thanks in advance.

Mike.Linacre: Chong, usually the criterion level is set as a point-estimate. It has no standard error. It is like saying "the criterion level is 2 meters". We are not concerned about the precision of the "2 meters".

We are only concerned about the S.E. of the person measure if our decisions require us to be certain that a person is above (or below) the criterion level. If so, we need to set the decision certainty. For instance, "persons in Level 4 must have a 90% certainty of being in level 4", then we need that person to have a measure of at least 3.98 + (1.3 * person measure S.E.)

The statistical average of S.E.s is their root-mean-square.
average S.E. of N measures = square-root ((SE1^2 + SE2^2 + ..... + SEN^2)/N)

Michael: Hi there, you are moving in the right way, guys. I've looked through your comments and just wonder why you put + when dealing with positive numbrs, like here: idelete=+11-15
Measure = 3.66; S.E. = 1.43 There is no sense to be so peculiar about this. I usually do it like this: idelete=11-15

Mike.Linacre: Michael, + means "do not delete", so
means "delete all items except for 11, 12, 13, 14, 15"

means "delete items 11, 12, 13, 14, 15"

idelete= 11-15, +12
means "delete items 11, 12, 13, 14, 15", but do not delete item 12

idelete=+11-15, 12
means "delete all items except for 11, 12, 13, 14, 15", but do delete item 12

and we can have more complex deletions, such as:
idelete=10-90, +32-48, 39
which is the same as:
idelete=10-31, 39, 49-90

696. person and item logit scores

david_maimon July 15th, 2007, 11:25am: Hi Mike,
I am trying to create a delinquency score for 8000 respondents using Rasch model (I have 15 relevant items). This score will be used as a dependent variable in a further analysis.I am debating between two approchas to construct this score... Should I use individual abilities score as delinquency proxies or mabey I should use the item difficulties scores and construct a delinquency score for each respondent, summing the items logit scores. Is it possibel to sum the item logit scores and generate such delinquncy score? Is there a way to include both ability and item severity in such analysis (I am using Winsteps)? Thanks David

MikeLinacre: Interesting question, Daimon. Rasch logit ability estimates are adjusted for the difficulties of the items encountered. As an alternative, you could add the Rasch logit item difficulties of the items someone succeeded on, and subtract the item difficulties of items that were failed. This would give an approximation to the logit ability estimate - but it would often not be a good estimate. For instance, suppose that every item on the test is equally difficult, and the person has 10 successes and 5 failures. Then a change of local origin by one logit would change the ability estimate by 5 logits, not the one logit that it should. In summary, the Rasch ability estimate is the relevant number

david_maimon: Thanks Mike, this is really helpful. Another issue I have to deal with in this study is the longitudinal nature of my data. Normally, when creating delinquency score from suppose 7 items you get the same max nd min values for the relveant variable across the years.This allows an investigation of individuals behavior changing patterns over time. However, I suspect that while estimating individuals ability score for each wave, the criminal ability logit score will have different min and max values and thus will not allow a meaningful comparison between indivduals criminaluty level over time. How should I address this issue?

MikeLinacre: David, yes, if each year is an independent analysis then there will be differences in logit ranges. So this becomes an equating/linking problem - often encountered with measurement at multiple time-points, pre-post, etc.
The solution depends on which time-point(s) you consider to be the benchmark ones. One approach is to obtain the item difficulties from the first time point, and then anchor (fix) the item difficulties for all subsequent time-points at those values.

david_maimon: Thanks Mike.... Are you familiar with any tutorial or reference that teaches how to run the anchoring procedure when applying a longitunal analysis?

MikeLinacre: David: perhaps https://www.rasch.org/rmt/rmt101f.htm

Khalid: Hi, my question is about the distribution of the student'S (scores) logits given by Winstep, I have a test of 12 items which I have calibrated and the model parametrisation is good.
the distribution of student's scores (measures) is:

Mesure (logit) Frequency Percent Valid Percent Cumulative
-.49 162 5,6 5,6 5,6
-.96 196 6,8 6,8 12,4
-1.41 285 9,8 9,8 22,2
-1.89 335 11,6 11,6 33,8
-2.44 483 16,7 16,7 50,5
-3.11 523 18,1 18,1 68,6
-4.08 461 15,9 15,9 84,5
-5.44 251 8,7 8,7 93,2
.01 109 3,8 3,8 96,9
.62 56 1,9 1,9 98,9
1.51 25 ,9 ,9 99,7
2.82 8 ,3 ,3 100,0
Total 2894 100,0 100,0

Even knowing that the patterns of students answers are more than 12 categories (ex : 000011100111 and 111100011000 are different even if the scores are equal 5/12).
my question is why have i only 12 categories of score mesures?
thks in advance

MikeLinacre: Khalid, in the Rasch measurement model, raw scores are "sufficient statistics", so there is one measure corresponding to each raw score on a test consisting of the complete set of items.
So, for 12 dichotomous items, there are 11 possible measures corresponding to scores 1-11.
Scores 0 and 12 are extreme scores. The measures for these are modeled to be infinite. In practice, an adjustment is made so that the measure reported for 0 is actually the measure for a score of 0.3, and the measure for 12 is actually the measure for 11.7.
Of course, depending on the response string, the reported measure is a more or less accurate indication of the data. This accuracy is quantified by the fit statistics.

Khalid: Hi, Mike, thanks for your answer.
Do the respondants mesures (the mesures for students) given by the model depend on the number of test items?

MikeLinacre: Kahlid, in principle, the person measures are independent of the number of items administered. The more items, the higher the precision (smaller person measure standard error) and the more powerful the determination of accuracy (fit statistics). So in computer-adaptive testing, a general rule is "the closer someone is to the pass-fail point, the more items will be administered to that person".

Khalid: thank you very much Mike.

Khalid: Hi Mike,
i created ability scores for student in 5 level secondary score aver 4 years whith an anchorins scheme over secondary levels and 2 versions (V1 in year 1 and V2 in year 2 ). So the model i produeced for the 2 first years was used to estimate scores over the third and the forth year.
the aim is to use the mesuers given by model in a mixte model.
my question is about the linearity of the mesures, for the first to years the distribution of the student's mesures is shifted from year 1 to year 2 for the all levels. i have 12 categories of mesures for year one same for year two.
thanks in advance

MikeLinacre: Thanks for your question, Khalid. Rasch measures are always linear, but they don't always match the data, and different sets of measures may not be compatible. Please give us more information about the situation you are describing.

Najhar: Hi Mike..
I have some questions for you.
If i want to do the analysis using Rasch measures in SPSS, how can i transform the Rasch Measures (logit scores) into SPSS?is it any way to do it? thanks..

MikeLinacre: Najhar, in SPSS, Rasch measures (logits) are exactly like any other additive variable, such as height or weight. For Rasch measures we also have their precision (S.E.s) which we don't usually have for height and weight. But most SPSS routines are not configured to take advantage of S.E.s.

697. Rasch score in SEM

connert July 5th, 2007, 2:46pm: Is there any reason(s) for not using Rasch scores as measured variables in a structural equation model? If that is possible/reasonable, how do you include measurement error? I use MPlus to do SEM.

MikeLinacre: Connert, Rasch scores (measures) are linear, and so are better for SEM than raw scores. If you don't include standard errors of the raw scores in your SEM, then you don't need to include standard errors of the Rasch measures in your SEM. But, of course, it would be better to include standard errors!
In the MPlus documentation, see "variable measured with error" or similar wording.

Najhar: Hi Mike..
I have some questions for you.
If i want to use the Rasch scores (Logit scores) to do SEM (AMOS), how could i convert the logit scores to SPSS before using AMOS?

connert: Najhar,

The way I do it is to do the original Winsteps analysis with the raw data in your SPSS file where the other SEM variables are. Then in Winsteps, save the person scores to an SPSS file. Copy and paste the person scores from this Winsteps created file into the original SPSS file.


889. Zero person reliability

xinliu March 1st, 2007, 11:46am: Running on a dichotomous test data of responses from over 2000 persons to 70 items, Winstep gave me a warning that "Data may be ambiguously connnected to 5 subsets". Then I found person reliability (and also separation) is zero while item reliability is 1.00. The easiest item has a logit of -46.88. I wonder if anybody could share your ideas about what could possibly cause this problem. Thanks a lot.

MikeLinacre: Xinliu, you are reporting a very strange situation. Please email me you Winsteps control and data files. (Remove any confidential person or item labels.)

mmaier: We've had the same issue (person reliability = 0 and item reliability = .94) with a dataset of 150 children and 144 items. Xinliu, did you ever figure out what was causing this problem?

MikeLinacre: Thank you for your question, Mmaier. I don't recall the solution to Xinliu's problem, but from the description here it was probably a technical glitch in setting up or scoring the data.
In general, zero person reliability occurs when the test is very short or the sample is very central around its own mean. If you email me your control and data file, I will give you an exact diagnosis: mike - -at- - winsteps.com

897. Coming in Winsteps and Facets ....

MikeLinacre December 27th, 2007, 10:05pm: You have asked for the output tables to be more publication-friendly. The next releases of Winsteps and Facets will incorporate options for prettier Tables. Either in text format or HTML format. Here is a Table displayed in Lucida Console font:

drmattbarney: Looks beautiful. Do you have a goal for when this release will be ready?


MikeLinacre: Thanks for the comment, Drmattbarney.
This is already released in Facets. www.winsteps.com/facetman/index.html?ascii.htm
Winsteps in Feb.-March.

901. on Facets

happiereveryday December 31st, 2007, 1:50am: Dear Michael,
Thank you for very much your reply. I have two more questions:
1. Winsteps is intended for two-facet analyses (persons and items) with a rectangular dataset. Facets can analyze any number of facets and non-rectangular data designs.
What does "rectangular dataset" and "non-rectangular data designs" mean?
2. I want to keep the judge severity (2 diffferent groups of judges) of two tests the same. My design of an equating study is like this:
2, judges, G ; group anchoring
1=Ben, 1 ; Ben is in group 1
2=Mike, 1
3=Jane, 1
4=Cathy, 2 ; Cathy is in group 2
6=Bluesea, 2
I don't know how to define the other data in these two groups. would you examplify it? Thanks a lot.

MikeLinacre: Here's a little more, Happiereveryday.
1A. Rectangular design: items are columns, persons are rows. There is one or no observation for each person and item.
1B. Non-rectangular design examples: Items, person, judges. More than one observation of each person and item. Paired comparisons between persons.
2. Your equating study ... not sure what you are doing, but you do need to set the logit measures, e.g.,
2, judges, G ; group anchoring: each group has an average severity of 0 logits.
1=Ben, 0, 1 ; Ben is in group 1, group anchored at 0 logits.
2=Mike, 0, 1
3=Jane, 0, 1
4=Cathy, 0, 2 ; Cathy is in group 2
5=John, 0, 2
6=Bluesea, 0, 2
Then the average severity of the Group 1 judges is forced to be the same as the average severity of the Group 2 judges.

902. more questions on Facets

happiereveryday December 29th, 2007, 5:18am: Thank you very much, Michael. Your answers really help a lot. Still I have two questions:
1. You said, "point-biserial is an extension of point-biserial of person and item data. It correlates the current obeservation with the raw score correspondign to that observation". Does it mean point-biserial examines the correlation between logit (the real ability of the students) and their test scores? What do you mean by "an extension"?
2. In equating or linking process I should set a facet anchored at a certain calibration. Can you give me some examples of the equating or linking process? Should the facet to be anchored in two subsets anchored at the same value?
3. I haven't used Winsteps. Could you please tell me the major differences in their functions?
Thank you for you patience and generous help!

MikeLinacre: Certainly, Happiereveryday.
1. The usual point-biserial correlates the persons' scores on this item with their total scores on the test. In Facets, this is extended to "the point-biserial correlates the observations of this element with the total scores (not measures) of the other elements generating this observation".
2. An example is linking two tests. The measures for the items on Test A are estimated. Then the items on Test B which are also on Test A have their measures anchored at the Test A values. Test B is then analyzed. The measures on Test B will be linked to those on Test A through the anchor values, so that all measures will be directly comparable.
3. Winsteps is intended for two-facet analyses (persons and items) with a rectangular dataset. Facets can analyze any number of facets and non-rectangular data designs. Since the Winsteps data are so well controlled, it has many more output options available.

903. questions on Facets

happiereveryday December 28th, 2007, 2:34pm: Dear Michael,
Now i'm reading the Facets manual, still i'm confused about some points. would you please help me with these questions?
1. What is the minimal element number for "examinee" facet in a bias analysis?
2. Why are some elements are anchored at 0 while others anchored at their calibrations other than 0?
3. What facets does pt-biserial analysis invlove? I guessed it examined the discrimination of items, but the manual said "it is a rough fit analysis".
4. What does "Umean" refer to?
Thanks a lot for your help! Looking forward to your reply.

MikeLinacre: Thank you for your questions, Happiereveryday. Let's see what I can do to help ....
Qu. 1. What is the minimal element number for "examinee" facet in a bias analysis?
Answer: 2 examinees is the technical minimum for Facets program operation, but a serious investigation into bias/interactions/DIF needs at least 30 examinees in each group.
Qu. 2. Why are some elements are anchored at 0 while others anchored at their calibrations other than 0?
Answer: Elements anchored at 0 are usually part of "dummy" facets. These are facets such as gender and ethnicity that are not intended to be part of the measurement model, but are included for bias or fit investigations.
Elements anchored at other values are usually part of an equating or linking process which connects the measures from one analysis with the measures for another analysis.
Qu. 3. 3. What facets does pt-biserial analysis involve? I guessed it examined the discrimination of items, but the manual said "it is a rough fit analysis"?
Reply: All facets. The point-biserial implemented in Facets is an extension of the familiar point-biserial of two-facet (person vs. item) data. The point-biserial correlates the current observation with the raw score corresponding to that observation. In Facets, all facets contribute to the observation and to the raw score.
Qu. 4. What does "Umean" refer to?
Reply: Rasch measures are estimated in logits. The zero point is usually in the center of the item, judge, task, etc. facets. But this means that some examinee measures will have values like -2.34 logits. No mother wants to hear that her child is "-2.34". So Facets allows for linear rescaling of logits into more friendly units. For instance,
Umean = 50, 10, 0
This says that a logit measure "X" is to be reported as (50 + 10*X) with 0 decimal places. So that -2.34 is reported as 50 + 10*-2.35 = 27. This is equivalent to converting Celsius to Fahrenheit. It does not change the fundamental nature of the measures as linear, additive units.
Hope this helps .....

904. Interpretation problems with FACETS vertical ruler

ImogeneR December 19th, 2007, 11:57pm: Hi all,
I've been using a partial credit model for facet=items and displaying the items as a group in one column of the vertical ruler, then their particular rating scales (5-20) appear in a column for each (so 39 qns, 39 columns. )However I've noticed that the raing scale for any particular item doesn't seem to correspond with the individual item in the group column in terms of where they are on the vertical measure. E.g. say a question is placed at about -1.0 logits on the measure scale , and it's one of the 'easiest' items, I would have thought the rating scale column for that item would show a rating scale where the top score was at a lower logit level than the other items, and so on, but this doesn't happen and so I am not sure how to relate the 2 piesces of information.
Any advice would be much appreciated.

MikeLinacre: Imogene, thank you for your question.
The columns on the Table 6 "rulers" are designed to be additive. So the item location is added to the category location.
You want to reposition the category location to include the item location. This is a good idea, but not currently supported in Facets.
Must think about how to act on your request ...

905. Manipulating file output from WINSTEPS

ImogeneR December 20th, 2007, 12:54am: Hi,
Does anyone know if there any way we can manipulate or customise file output from WINSTEPS? ie. we want to import distractor stats from winsteps output via excel into our item database but would need to transpose the output as it is. Was also wondering if there is anyway to 'build' an output file (apart from the oservation (XFILE=) file) by selecting variables? The current outputs are great but we are looking to create an import file for our database for item performance history that has variables that are currently in different output files and we are trying to avoid manually rearranging the data.

MikeLinacre: Imogene, thank you for your request.
1. Transposing and selecting variables.
These can be accomplished in Excel. How about an Excel macro?
"Record macro" - then do what you want with the Excel file (delete columns, transpose, etc.) - then "Stop recording". Save the macro to make it permanently available.
2. Variables that are currently in different output files.
This is more challenging. You could use Excel to massage each output file. Perhaps then the database software could combine them.

906. Winsteps & Excel

Raschmad December 13th, 2007, 4:38pm: Hi Mike,
Thanks for your reply on the sub-measures.
I have another question about Excel.
In the former versions of Winsteps when I got DPF plots and scatter plots with 95% control lines, there were 3 buttons at the bottom of the each plot which gave options such as "plot empirical line", "plot identity line" , "local measures", "worksheet", etc. The plots that I get these days at the bottom have some drawing and shaping options, where can I get those previous stuff? Has Excel changed or Winsteps?
I'm using Excel 2003 and Winsteps 3.64.2.
Thanks alot

MikeLinacre: Raschmad, on your Excel screen, the "drawing and shape options" are the Excel drawing toolbar. Right-click on the toolbar to see the tool-bar menu. You can uncheck "Drawing". The three buttons are behind the toolbar. You can probably scroll down (right-hand side) to display the buttons,

907. DPF summary stats.

Raschmad December 12th, 2007, 7:47pm: Hi Mike,
Does Winsteps give summary statistics (mean, SD etc.) of different sections of a test?
If you run DPF and obtain say, 3 measures for each individual on sections A, B & C of a test, can one get three means and SDs to compare studnets performnaces on these sections?

MikeLinacre: Thank you for your qeustions, Raschmad.
It looks like you want sub-measures estimated for each individual for each section of the test, based on the overall set of item difficulties. Then summarized.
One approach would be:
1. Include a code in the item label indicating test section.
2. Analyze everything, write out an IFILE= if.txt (and SFILE= if polytomies).
3. Analyze first section: IAFILE=if.txt ISELECT=first section. Produce summary statistics.
4. Analyze second section: IAFILE=if.txt ISELECT=second section. Produce summary statistics.
5. similarly for the other sections ....
This approach would be more robust than basing results on a DPF analysis.

908. (Polytomous) Rasch vs. (Ordered) Logit

Jet December 11th, 2007, 2:05pm: Hi dear readers,

I was wondering if the ordered logit model can be seen as the Polytomous Rasch model? I would say they are alike, am I mistaken here? :-/

Thanks in advance,

MikeLinacre: Thank you for your questions, Jet.
Models based on the log-odds of adjacent categories of a rating scale are Rasch models: adjacent logit, transition odds, continuation ratio models.
Models based on the log-odds of accumulated categories are not Rasch models: order logit, cumulative logit, graded response models.

909. scatter plot confusion

harmony December 8th, 2007, 5:09pm: Hello all, and thanks Mike for your response to my last post. I'm new at this and attempting to link tests using, in one case, common items and, in another, virtual linking. After creating a crossplot of the selected items in either instance, we are instructed to "draw a useful line." I have interpreted this to mean a line that approximates the range of my data (my data goes below the x-axis on a few points to the left of the y-axis and above the x-axis on many points to the right of the y-axis and I have drawn a diagonal line that closely resembles and even intersects at least two of those points . From the linked data points I have calculated a slope. The instructions in Winsteps indicate that a best fit line has a slope near 1. In both instances my slope is near 1. I wonder, however, how far away from 1 is a problem?

It also says, however, that: "slope = (S.D. of Test B common items) / (S.D. of Test A common items) i.e., the line through the point at the means of the common items and through the (mean + 1 S.D.). This should have a slope value near 1.0. If it does, then the first approximation: for Test B measures in the Test A frame of reference:Measure (B) - Mean(B common items) + Mean(A common items) => Measure (A)"

At this point everything becomes unclear and I feel compelled to ask a number of questions that will most certainly reveal my ignorance! "slope = (S.D. of Test B common items) / (S.D. of Test A common items)" The SD for common items must be different from the SD for the entire test, right? Any quick/easy methods of calculating this using Winsteps data? "i.e., the line through the point at the means of the common items and through the (mean + 1 S.D.)." This one sets my head spinning! Is there any way to explain the above quotes in a way that is easier to digest? Is my original idea of a useful line truly useful, or is it useless? :-/

MikeLinacre: Harmony, let's see what can be done to clarify matters:

A. "I wonder, however, how far away from 1 is a problem?"
If in doubt, do it both ways and compare the results. If the results don't lead you to any different actions, then there isn't a problem.
Remember the natural statistical fluctuations in the data, so don't become too precise. Next data collection your results will be slightly different, so allow for this.

B. "easy methods of calculating this using Winsteps data?"
If you include a code such as C= "common", U="unique" in column 1 of the item labels, then
1. analyze your test
2. Output Tables menu
3. Table 27 - item subtotals based on column 1: $S1W1
This gives you the mean and S.D. of the C items and the U items and all items.

C. "my original idea of a useful line truly useful?"
All this isn't as complex as it appears. Imagine that you have two sets of measures: the heights of your friends wearing shoes (measured with a tape measure), and the heights of your workmates with bare feet (measured with a laser guage). You want to discover the relationship between the two sets of heights.
So you identify friends who are also workmates and cross-plot their two heights.
On the plot you draw a line relating to the two sets of numbers.
You expect to see an adjustment for the height of the shoes.
You also expect that the slope of the line will be near 1.0 (tape measure and laser work the same), but the slope may be far from 1.0 if the tape measure is in inches and the laser in centimeters. If the slope is 1.01, the tape measure may be slightly stretched, or the laser guage miscalibrated, but that is too small a departure to be of concern.
And you would omit as a "common" person anyone who was too far away from the general trend line - obviously something has changed, perhaps the person was slouching on one occasion.

harmony: Thanks again Mike!

I thought I understood, but your response gives me confidence. I actually calculated the SD of both tests manually last night and got a slope of 1.04, which was slightly better than the slope of the line in my crossplot (.92).

Just to be sure that I've truly got this, when I do the virtual linking by setting the USACALE = 1/slope, which slope is a better figure to use and does it really matter? Also, in putting in the UMEAN = x intercept, I used the point of my line on the scatterplot that crossed the x-axis. To get this figure I changed the scale of the x-axis in excel so that the point of intersection was readable and it came out to .006. (I assume that since I'm using the point where the line with a slope of .92 crosses the x-axis that I should use the slope of this line for the USCALE = calculation and this is what I've done.) Is this correct?

Using these figures the linking works without any misfitting items. Thanks a lot for your help. It's a really amazing program and I've been having a lot of fun with it!


harmony December 6th, 2007, 11:50am: Hi all:

I have what is probably a very basic question. In equalting tests it is suggested to begin with item polarity for each test separately and we are instructed to: "check that all items are aligned in the same direction on the latent variable, same as Table 26."

How do we know that they are aligned in the same direction? Does this mean that the positive and negative measures are not mixed?

MikeLinacre: Harmony, thank you for this question.
"same direction" means "higher score on the item = more of the latent variable".
For instance, on a multiple-choice item, we expect people who succeed on an item to have a higher ability, on average, than people who fail on an item. But if the item is miskeyed, then the reverse will be true.
Or, on a survey, we expect people with higher scores on the rating scale also to have higher scores on the survey. But if some items are written negatively, but not rescored, this will not be true.
The crucial indicator is the point-measure correlation for each item. This captures whether "more on the item = more on the latent variable". When the correlation is negative, then "more on the item = less on the latent variable" meaning that the item is contradicting the other items, rather than contributing to measurement.

911. measure order

ning December 5th, 2007, 6:55am: Dear Mike,
I have a polytomous itemed instrument, higher coding represent higher ability. Could you confirm for me for the Table 13.1 about the measure order of the items? That is, the item with the highest measure means that respondents tended to endorse the lower category more frequently, in other words, they tended to have lower ability corresponding the content of that item, correct?

MikeLinacre: Gunny, Table 13 of Winsteps (items in difficulty measure order) is organized so that higher measure = lower score on the item. In other words, higher measure = lower performance by person sample on the item.

912. SPSS

happiereveryday December 3rd, 2007, 5:01am: Dear Mike,
Thank you very much for your prompt reply! One problem still exists. After i've specified the "txt" formula, FACETS fails to work and the warning reads,
"Specification is: ?B,?,?B,R ; 1=examinee, 2=sex, 3=rater, 4=item, observations are scores
(Facets) Error 11 in line 9: All Model= statements must include 4 facet locations + type [+ weight]
Execution halted"

my specifications in this part is like this,
?B,?,?B,R ; 1=examinee, 2=sex, 3=rater, 4=item, observations are scores
; look for interaction/bias between examinee and rater
?,?,?B,RB ; look for rater x item interaction
; weighted 0.5 because all data points entered twice"

In my data, each examinee is rated in 4 items by 2 raters.
I can't figure out where the problem is. Would you please help me?
Thank you so much!

MikeLinacre: Certainly, happiereveryday!
Is this your conceptual model?
Examinee + (sex) + rater + item = rating

This has 4 facets:
1. Examinee
2. Sex (dummy)
3. Rater
4. Item

Models =
?B, ?, ?B, ?, R
? , ?, ?B, ?B, R

Example data point:
Examinee 27 of sex 2 is rated by rater 14 on item 3 and given a rating of 1
data =
27, 2, 14, 3, 1

You are probably entering all 4 items consecutively, so
data =
27, 2, 14, 1-4, 2,3,1,2 ; 4 ratings on the 4 items

But 1-4 is in all the data records, so we can simplify:
dvalues =
4, 1-4 ; facet 4 is always 1-4 in the data
data =
27, 2, 14, 2,3,1,2 ; 4 ratings on the 4 items

happiereveryday: Mike,
Thank you very much for your help. Now i run into another problem. My data specifications look like this:

Facets = 4 ; there are 4 facets
Positive = 1 ; for examinees, greater score, greater measure
Noncentered= 1 ; examinee facet floats


2, sex (dummy)
1= male
2= female
3, rater
4, item
dvalue= 4, 1-5 ; facet 4 in the data is imputed as 1-5
1,2,1,28,40,70,70,75 ; 5 scores on the 5 items

Rasch running tells me " Invalid datum location: 1,2,1,1,28 in line 29. Datum "28" is too big or not a positive integer, treated as missing." and no result shows.
I don't know whether RASCH only support rating scale data, because my data is scores, ranging from 0 to 100 (the best). Would you please help me out in this particular data specification?
Thank you very much.

MikeLinacre: Happiereveryday, please specify the length of your rating scale to Facets:

But there is another concern here. Are all values in the range 0-100 observable? Were all those values observed in your dataset?

R100 specifies that all values 0-100 are observable. But your example data indicate that only multiples of 5 are observable. If so, you will need to recode your data. For instance if only multiples of 5 are valid data, then they need to be recoded from 0-100 to 0-20.


Rating scale = recode, R20, General, Keep
0 = 0 , , , 0
1 = 5 , , , 5 ; recode 5 to 1, but label it "5"
2 = 10 , , , 10 ; recode 10 to 2, but label it "10"
20 = 100 , , , 100

happiereveryday: Mike,
Thank you so much for your generous help! Now RASCH works smoothly.

913. SPSS or excel

happiereveryday December 1st, 2007, 9:41am: Dear Mike,
I'm a new member here and i'm learnign FACETS now. The problem is, my data is in SPSS format and i don't know how to put it into FACETS, though i've gone over the FACETS manual. Would you be kind enough to tell me what to do? If FACETS supports SPSS format, do i still need to further specify the data in the "specifications=?"
Thank you very much!

MikeLinacre: Happiereveryday, Facets currently doesn't support SPSS.
You need to write the data from SPSS into a CSV format data file, which will be specified by "data=" in the Facets specification file.
Organize your SPSS file so that each case (row) is in Facets format.
For instance, if your data have 3 facets and 5 responses to the 5 items, then the SPSS case might be:
examinee number, judge number, response to item 1, item 2, item 3, item 4, item 5.
Save this as: datacsv.txt
The Facets specification file will include:
facets = 3
models = ?,?,?,R
1, examinees
(list of examinee numbers)
2, raters
(list of rater numbers)
3, items
(list of 5 items)
dvalues= 3, 1-5 ; facet 3 in the data is imputed as 1-5
data = datacsv.txt

914. Anchoring Groups in Winstep

DAL August 30th, 2007, 1:29pm: Hi there,
This is probably a basic question, but after searching I have been unable to find the answer.

A Michigan test was given to about 950 students at the beginning of the year and again to 250 of them near the end. I have run the two tests separately in winsteps , and run them in one control file and of course the results are slightly different. Is it better to run them separately or together?

If running them together is better, is there any way of anchoring the 1st administration, and just capturing the difference between the two administrations? At the moment I am putting the figures into excel and subtracting one from the other.

Finally, the Michigan test is has 100 MC items, but it has listening, grammar, vocabulary and reading sections. Is there any way of isolating these sections and running Winsteps on them without creating 4 new control files?

Thanks for your help!

MikeLinacre: Thank you for your questions, DAL.
1. "Is it better to run them separately or together?"
a. Run them separately and check that all is correct in both analyses.
b. Cross-plot the item difficulties (e.g., in Winsteps, "Plots", "Compare Statistics"). Do you see an approximate identity line?
c. If so, analyze together.
d. If not, are there only a few outlying items on the plot? If so, these are no longer "common items", and should be treated as different items on the two instruments in a combined analysis.
e. If the plotted line has a slope far from an identity line, then you are in a "Fahrenheit-Celsius" situation. Perform the two analyses separately. Then compare or combine the results in the same way you would if the two test administrations had been thermometers with different scaling.

2. "capturing the difference" - your Excel approach is as easy as any, and has the advantage the you know exactly what is happening at each step. If all you need is to compare mean performances. Then put a code in the person label, something like "0"=only administered at time 1. "1" for administered at time 1, and will be at time 2, "2" for administered at time 2, and was at time 1. Then use PSUBTOT= and Table 28 to obtain summary statistics for the three sub-samples.

3. "isolating sections" - In Winsteps, put a code in the item label to indicate section, e.g., the first letter of the item label could be L,G,V,R. Then use ISELECT=L (in the Control file or at the "Extra ..." prompt) to select only the listening items for analysis

DAL: Hi Mike,
Thanks for your lightning quick reply. This is going to keep me busy.

Even if exactly the same test is used, the items could be considered 'different' if taken at a long enough interval apart? Oh well, I will see on the crossplot!


DAL: Have run the analysis as suggested, and they seem to have come out ok.

Using 'plots-compare ' I plotted measures of the first admin against the second admin measures, and in the 'identity' it comes out as two straightish lines, pretty much parallel, with the dotted line in between, which I'll take as a good sign. Especially compared to page 246 of the manual which are crazy by comparison.

One thing though, they only seem to be doing the first 10 MC items. At least only L1 to L10 are listed in the plot and the scatter graphs. How do I get it to look at the full set?

Thanks again

MikeLinacre: DAL, "they only seem to be doing the first 10 MC items" - are you specifying ISELECT=L? In which case only the "L" items will be analyzed. How about NI=? Is that the full number of items on the test?
"two straightish lines" - Oops! One line is probably the one you want, the other is showing some type of differential test performance. See the same problem in a slightly different situation at www.rasch.org/rmt/rmt72b.htm

DAL: Hi Mike, finally got back to this project
After some more rooting around I have found the embarrassingly simple problem. I need to enter an output file (IFILE or PFILE not the control file!!). Sorry for the basic question, but how do I get an IFILE? My question is too basic for the manual because although I can find an explanation of the IFILE I can't work out how it is produced.

MikeLinacre: DAL, IFILE= is an output file of item measures and statistics. PFILE= is the same thing for the persons.
If IFILE=(filename) and/or PFILE=(filename) are included in the Winsteps control file, then these output files are produced automatically immediately after the measures and fit statistics are computed.
Also on the "Output Files" pull-down menu, IFILE= and PFILE= can be clicked to produce those files interactively after estimation has been performed.
These files have the same format as the IAFILE= and PAFILE= input anchor files.

DAL: Thanks!

With the full set of items the comparison of means looks remarkably similar to what I had before: two parallel lines (well, slightly curved into each other) with a dotted line down the middle. The dotted line is the approximate identity line, right? Most the items are between the two lines, but there are about 15 items out of the 100 outside to varying degrees. It looks similar to the graph in https://www.rasch.org/rmt/rmt72b.htm.

This shows the items are functioning differently for the two groups, and is surely what I want: the students at the end of the year should be performing better than they did at the beginning of the year. I guess the next step is to remove the items that are outside the two parallel lines, put the two data of the remaining items together from the two administrations and analyze them in one winsteps file. Is this right?

Looking forward to your answer,

DAL: double post

DAL: And just to let you know what else I've been doing: I managed to get anchoring sorted out. I did a quick comparison of the measures obtained by four separate Rasch analyses: first admin only (967 stds), second admin only (274 stds), second admin with person anchored from the first admin and finaly, first & second lumped in together.

Results: Because all of the items were the same and most of the people were the same, the person-anchored item measures came out in exactly the same order as the second administration alone, though anchored results were with one exception 2-3% higher. The first admin only and the first+second were in a different order to each other and the anchored analysis (though not radically so), but had a similar range, not surprising given the greater number of students in the first administration.

Hmm. I guess before anchoring I should weed out those items that are outside -2 and 2 t outfit and infit, run the pre-admin without these items, then feed in the new item measures into the anchor file. This would give me more accurate results, right?

I have to admit though, I feel pleased with myself for getting anchoring sussed (baby steps, baby steps!).

MikeLinacre: Dal, you've somewhat left me behind, but it sounds like you know what you are doing. Congratulations!

DAL: Oh thanks!

Okay let me clarify the question I'm attempting to ask:

Exactly the same 100 question MC test is given twice to the same students, 1 year apart. I have done a Rasch analysis of the first administration and used the person measures & the PAFILE command to anchor the students for the second analysis. However, in the first test there were a lot of misfitting and overfitting items (20 with infit z scores 2 or above, and 10 -2 or less).

Should I take these badly misfitting and overfitting items out of the anchor analysis? It seems rather a lot of data to take out...

MikeLinacre: Dal, Infit z's tell you how unlikely it is to observe the response string when the data fit the model, but they don't tell you how much damage the response string is doing to the measures.
The infit mean-squares (and outfit) report on the damage. Usually this damage is large enough to be noticeable only when the mean-square is 2.0 or greater.
If an item mean-square is greater than 2.0, examine the response string to identify the source of the problem. Is it a "bad item" - in which case, drop the item. Or is it misbehavior by some persons in your sample, such as guessing on the later items in the test as time runs out? In which case, you may want to downweight some of the misbehaving sample (PWEIGHT=0) for the purposes of item estimation.

DAL: A belated thanks for that answer, it really clears it up for me.

Finally returning to the data, I decided to compare the item facility with the measures and found that the lower the measure the more difficult the item. I had just assumed that higher measure meant the item was easier, not more difficult. I'd been looking at things the wrong way round all along! Doh!

I thought the command below controlled it:
0 = wrong
1 = right

but a quick experiment proved me wrong.

What's the command for making higher measure = easier item? I searched the help but couldn't find it.

MikeLinacre: Dal: this is mysterious! The Winsteps default is definitely "lower total score on the item = more difficult item = more positive item difficulty measure".
If that is not happening for you, and you can't explain it, then please email me your Winsteps control and data file to see what is going on .....

Raschmad: Hi DAL,
Just out of curiosity: how have you set up your data for a combined analysis?
Have you put all the items in columns and persons in rows?
That is, as if the test had 200 items all together?
250 persons have attempted all the 200 items and (950-250=) 700 persons have attempted only 100 items?
So you have 200 columns for items. 101 to 200 being the same as 1 to 100?

This is what I can think of. Is there any other way round it?


915. item pvalue

ning November 28th, 2007, 3:19am: Dear Mike,
I understand item p-value for dichotomous items is the proportion correct [0-1]. I have polychotomous items with 3 categories. I set PVALUE=YES, and the pvalue are >2, what does it mean? Can I get a pvalue between 0 and 1 for each polychotomous items? For instance, what if I'd like to know if this item fit or misft is significant at 5%? Thanks.

MikeLinacre: Gunny, yes, the item p-value for dichotomous items is the proportion-correct-value, i.e., the average score of the persons on the item. Winsteps similarly computes the average score of the persons on polytomous items as their "p-value".
The item p-value is not the probability-value of a hypothesis test of fit to the model. That is the ZSTD values, which is a probability-value expressed as a unit-normal deviate.

916. criteria for fit statistics

ning November 26th, 2007, 5:12pm: Dear Mike,
Based on my understanding, items must fit before residual factor and DIF analysis. However, brief literature review revealed that there are many criteria for fit statistics that one can use. For example, for infit & outfit mnsq:
1. 0.5-1.5
2. <1.3
3. 0.4-1.4
4. <1.7
5. infit: 0.8-1.2; outfit: 0.6-1.4
6. 0.6-1.4

I know you've provided the rule of thumb at many places but I'm still uncertain about this, could you please provide a little more guideline on this or narrow down the options since it's relevant to decision making? Thanks, Mike.

MikeLinacre: Gunny, thank you for asking for clarification. There is an immense literature about Rasch fit statistics, and much depends on your situation.
Fit statistics for high-stakes carefully-constructed multiple-choice tests can be much stricter than fit statistics for clinical observation data of gun-shot victims. So we need fit statistics that are strict enough to give useful measures for our purposes, but not so strict that measurement becomes impractical.
A place to look is https://www.rasch.org/rmt/rmt83b.htm

917. goodness-of-fit and PCA

godislove November 26th, 2007, 1:15pm: Hi Mike,
I understand the use of both PCA and goodness-of-fit (I think!) but one question that puzzle me about the use of both methods. If PCA showed that the assessment tool is unidimensional but two items were misfit, how do we reason this since item fit can also be used to the dimension of an assessment tool? Can we say that item fit is used to describe the relationship between the item and the model but not dimension of the tool?

Please advise! thanks!

MikeLinacre: Thank you for these questions, godislove. There are an infinitude of different ways that the data can misfit the Rasch model. We are concerned about two of those ways here:
PCA is intended to detect multidimensionality across items (or persons). This can be too subtle to appear as misfit to individual items. For instance, a test combining reading and math items often shows both types of item as fitting the Rasch model on an item-by-item basis, but PCA shows that there are indeed two types of items.
Item-level misfit investigation is intended to detect violations of the Rasch model on an item by item basis, such as a miskey to a multiple-choice question, or a badly worded survey question. These are not usually part of a multi-dimensional (shared across items) structure.

918. person/item measures and perfect scores

ning November 25th, 2007, 6:10am: Dear Mike,
I run the analysis using "XMLE=YES," hence, out of 2300 respondents, only about 1900 were estimated since about 400 had perfect scores and they were removed from the parameter estimation. However, "PFILE= " give the person measures on all 2300 people.....can I still line the estimated person logit measures and the item logit measures on the same metric? Since perfect scores add little value to the probabilistic model, even though the item measures were based on 1900 respondents, will it be a problem if I line them up without re-estimate the item measures on the full sample? Thanks

MikeLinacre: Gunny, XMLE=YES is experimental - particularly for perfect scores. Thoroughly recommend using the standard Winsteps defaults for making sense of perfect scores. From the perspective of Rasch estimation and fit, perfect scores are irrelevant, so whether there is 1 or 4 perfect scores makes no difference. But perfect scores do make a considerable difference to what the measures are telling us about our sample.

919. Construct validation

Raschmad November 23rd, 2007, 6:35pm: Dear Mike,
I was just wondering how and why item hierarchy from easy to diffcult and person hierarchy from more able to less able provide evidence for construct validity.
It seems axiomatic to me that whenever we give a set of items to a group of people some items turn out to be harder and some easier; some persons more able and some less able.
What is so important about this obvious matter?

MikeLinacre: Raschmad, what you say is correct, but it omits the substance of the items. How do we know what the test is measuring? It is by looking at the content of the items. As measures get larger, what does that mean? That is the "construct". The most direct way of determining that is to see what is the substantive difference between easy items and difficulty items.
In the early days, at Rasch meetings a speaker would present the results of an analysis and declare what they meant. But someone in the audience would look at the item hierarchy in the handout, and present a different interpretation of the construct - sometimes an opposite one! The reason was because the speaker had forgotten which way the items had been scored: Did more ability = higher score or More disability = higher score. Now we almost always include that information in the published paper.

920. item misfit and instrument

ning November 21st, 2007, 6:52pm: Hi Mike,
I have one item that misfit consistently in all subgroups, but the variance explained in all subgroups are all >88% from residual factor analysis and, the variable maps show that the particular misfit item does not line up with any of the person locations, it's located way at the top of the map and persons are all at the lower part of the map. Given these three information, can I say that since sufficient variance explained by the measure, misfit is due to insufficient information available for that item. At the same time, the instrument is still measuring the same construct or unidimensional? Are there any other statistic should look into to draw such conclusion?
Thanks Mike and Happy Thanksgiving!

MikeLinacre: Gunny, thank you for your question. You have a similar situation to that in the "Liking for Science" data, example0.txt - a few low performers score unexpectedly highly on some very difficult items. This is also what lucky guessing looks like. So there is probably a small cluster of people with an unusual profile.
Here are some thoughts.
1. These may be an interesting diagnostic group on their own - perhaps the "idiot savants".
2. This may be an indication that way-off-target responses should be treated as missing data. They are reflecting irrelevant behavior.
3. This may be an indication that the variable has changed its nature as it reaches its extreme. This is what happens with mathematics. A mathematics professor may be excellent at set theory, but also make simple arithmetic errors computing his own grocery bill.
4. It is unlikely to be a statistical artifact that will disappear with more data (i.e., more statistical information), unless your data is very thin.
Has anyone else any ideas or insights about this?

ning: Hi Mike,
Thanks for your reply. You're right, this won't be a statistical artifact. Nonetheless, I think it's too systematic across the subgroups to attribute this finding to one particular group of people as outliers. I'm looking inot this but so far I'm still puzzled by this "one item consistently misfit in all 11 subgroups while variance explained by the measure is greater than 88-90% in all 11 subgrouups."

MikeLinacre: Quote: "one item consistently misfit in all 11 subgroups while variance explained by the measure is greater than 88-90% in all 11 subgroups".
Thank you, Gunny. The "variance explained" is largely controlled by the range of item and person measures. If the ranges are wide, then the "variance explained" is going to be large. Misfit to the Rasch model has to be huge (mean-squares > 2) before it has any noticeable effect on variance explained. This is why "rejecting misfitting items" is so often counterproductive. Unless mean-square > 2, rejecting an item is rejecting more "variance explained" than "variance unexplained" and even then, there are enough other contributing factors that the only way to know is to do the analysis with and without the suspect item.

921. Point Estimation

ziauddinb November 19th, 2007, 8:06am: The analysis of point estimates (for example, WLE, MLE or EAP estimates) of students’ abilities are can often yield misleading results. Why is this the case and in what ways can they be misleading?

MikeLinacre: Ziauddinb, thank you for your question. Do you mean the analysis of point estimates as though they are perfectly precise? If point estimates are used in regression analysis (without their accompanying standard errors), then the computed regression error terms are too small, so that differences are reported as more significant than they really are. Similarly for other statistical procedures.
An example of point estimates used this way is test scores. Suppose a child scores "15 out of 20". "15" is a point estimate of the child's ability. It is often analyzed as though it is perfectly precise, even though its standard error is probably of the order of 2 score points.

922. A basic question on item difficulty

ning November 13th, 2007, 6:32pm: Hi Mike,
I have a basic question on item difficulty that needs clarification. According to Rasch hypothesis, item measures should be invariant across sub-populations and over time. But items and persons are measured simultaneously each time, hence, measures differ across populations and over time as person abilities change, does this mean Rasch hypothesis do not hold? In real data analysis, item measures do differ each time, Mike, am I looking at a wrong direction? Thanks.

MikeLinacre: Yes, Gunny, "item measures should be invariant", but so also should bricks be when we are building a house. But every brick differs slightly. The crucial question is "Are the bricks similar enough to be considered identical for practical purposes?"
For bricks, the basic shape, size and color are usually enough. For items there are similar considerations. One way to look at invariance is to do a DIF (differential item functioning) analysis. Another is to cross-plot the item difficulties from different time-points.
When we find that item difficulties have changed (due to DIF, item drift, etc.), then we have a challenge. How are we going to compare person performances for different sub-populations or across time? We usually look for a core of items that are invariant for any particular comparison. The other items become "different" items for each of the compared samples.

ning: Thanks, Mike,
I do believe and found that no items across subgroups or over time have the "identical" measures, hence, how do I locate the "core of items that are invariant?" Will DIF and fit statistics be sufficient to determine that? Also, if you do racking and stacking, say racking, you are looking for the item differences over the different time periods, holding the person measures constant, are we taking the "absolute" changes of the item measures between the two time points? In other words, if I have an item measure at time1=2.3 and at time2=1.8, item at time1 show DIF and no DIF at time2, are we taking the difference=0.5 as the change of value or...? Thanks for helping me out on this confusion.

MikeLinacre: Gunny, there are two aspects to "identical". One is "are they so close that the difference doesn't have any practical effect". The other is "are they so close that the difference could be due to chance."

The first difference is quantified by a logit difference, something like 0.3 or 0.5 logits (but this depends on your situation). It is the same problem with building bricks: how big a difference in length matters?

The second difference is quantified by a statistical t-test:
difference = (value1 - value2) / sqrt(s.e.(value1)^2 + s.e.(value2)^2)
With building bricks this would be "is the difference within the natural variance of the manufacturing process?"

923. residual factor analysis

ning November 12th, 2007, 11:38pm: Hi Mike,

I tried to fit the data extracted from 4 different subgroups. One of the items demonstrated misfit on all subgroups except the 4th subgroup ( >1.5 infit mnsq on 3 subgroups compared to 1.13 infit mnsq on 4th subgroup, and >0.8 outfit mnsq on 3 subgoups compared to 1.20 outfit mnsq on the 4th subgroup). However, on the Rasch residual factor analysis, the 4th subgroup has the largest unexplained variance 18.8% compared to 5%-9% on other 3 subgroups. Further, the unexplained variance from the first contrast for the 4th group is 6% while others are all <5%. At the same time, the eigenvalue of the biggest residual contrast for all 4 subgroups are equivalent 1.6-1.8.
I found this counterintuitive, I thought greater misfit should be associated with larger unexplained variance. Mike, could you please explain? Thanks

MikeLinacre: Gunny, thank you for your question. It sounds like you've discovered the reason that PCA of residuals is incorporated in Winsteps! Most of the Winsteps fit statistics are based on a single item or person, or a local view of the data. But this means that less conspicuous, but pervasive patterns of misfit could be overlooked. "Unfortunately, some computer programs for fitting the Rasch model do not give any information about these. A choice would be to examine the covariance matrix of the item residuals, not the sizes of the residuals themselves, to see if the items are indeed conditionally uncorrelated, as required by the principle of local independence" (McDonald RP (1985) Factor Analysis and Related Methods. Hillsdale, NJ: Lawrence Erlbaum. p. 212).

924. examine outlier people

windysnow November 5th, 2007, 3:52pm: How to examine those individual persons who don't fit the model? Can I plot something like ICC in Winsteps? thanks

MikeLinacre: There are several approaches, Windysnow. An individual usually doesn't have enough responses for a meaningful ICC.
If the persons overfit (mean-squares less than 1.0, ZSTD less than 0) there is nothing to see. The persons succeeded on the easy items and failed on the difficult items.
If the persons underfit (mean-squares greater than 1.0, ZSTD more than 0) then the Person Keyform Table 7.2.1 etc. (right-hand side of Output Tables menu) gives a profile of the responses by each person to the items.

windysnow: Thanks, Mike!

Can anyone help with two further questions please?

I found negative point-measure in table Table 10.3. What do they suggest about outfit?

After picking out mean square>2 persons, how should I examine them? how to make use of table 7.2.1

MikeLinacre: A place to start is "dichtomous fit statistics" in Winsteps Help, also at
Good books are "Applying the Rasch Model" (Bond & Fox), "Best Test Design" (Wright & Stone), and their "Measurement Essentials", linked from www.rasch.org/memos.htm

925. how to use demographic info

windysnow November 9th, 2007, 12:11am: I've gender/grade info and have put it into Winsteps. How to tell Winsteps about it? Where can I spcify the columns in the data/control file?

MikeLinacre: Windysnow, thank your for your question.
There are several options.
1. Be sure that the demographics are included in the person label
2. Count the column of the demographic code within the person label, e.g., column 6

To select based on demographic, e.g., column 6="M"
either before analysis (in the control file) for estimation, or after analysis ("Specification" pull-down menu) for reporting

To subtotal person measures based on demographics:
and Table 28

Differential item functioning
DIF = $S6W1
and Table 30

926. Discrimination, one more time.

omardelariva November 8th, 2007, 1:10am: Mike:

Your explanations were very helpful, thank you. I got many more information than I expected. I revised the references you gave me and checked a problematic set of data, in this case the difficulty range runs from -1.93 to 1.08 logits and p-value minimum of 0.13 an maximum of 0.69. I expected that extreme difficulty items would be misfitted but I founded an item with difficulty of 0.76 logit with a negative point biserial correlation. Could you say me what is happening? I attached an output of my data. Thank you.

77 -1.93 11698 8055 0.02 1.02 1.75 1.07 5.19 0.10 0.94 0.69
59 -1.05 11698 5853 0.02 0.91 -9.90 0.88 -9.90 0.29 1.49 0.50
18 -1.02 11698 5782 0.02 1.02 2.82 1.02 2.59 0.14 0.91 0.49
58 -0.97 11698 5662 0.02 0.97 -5.76 0.96 -5.23 0.21 1.17 0.48
151 -0.61 11698 4722 0.02 1.01 2.21 1.02 1.72 0.15 0.94 0.40
50 -0.54 11698 4566 0.02 0.89 -9.90 0.87 -9.90 0.33 1.40 0.39
146 -0.38 11698 4164 0.02 1.01 0.67 1.00 0.20 0.17 0.99 0.36
185 -0.35 11698 4103 0.02 1.00 -0.23 0.99 -0.94 0.18 1.01 0.35
15 -0.27 11698 3905 0.02 0.96 -4.24 0.96 -3.74 0.23 1.09 0.33
69 -0.18 11698 3704 0.02 1.03 3.85 1.04 3.42 0.13 0.92 0.32
184 -0.12 11698 3561 0.02 1.02 2.65 1.04 2.72 0.15 0.94 0.30
52 0.12 11698 3052 0.02 0.96 -3.38 0.96 -2.92 0.24 1.06 0.26
158 0.39 11698 2549 0.02 1.07 5.45 1.14 7.13 0.08 0.89 0.22
16 0.42 11698 2499 0.02 0.94 -4.43 0.93 -4.08 0.27 1.08 0.21
31 0.49 11698 2369 0.02 0.95 -3.23 0.96 -1.82 0.25 1.05 0.20
45 0.54 11698 2290 0.02 0.95 -3.79 0.96 -1.94 0.26 1.06 0.20
166 0.60 11698 2182 0.03 1.05 3.32 1.10 4.50 0.11 0.94 0.19
314 0.76 11698 1944 0.03 1.16 8.96 1.27 9.90 -0.03 0.84 0.17
167 0.97 11698 1648 0.03 0.97 -1.43 0.96 -1.34 0.22 1.02 0.14
188 1.02 11698 1593 0.03 1.01 0.44 1.07 2.38 0.16 0.98 0.14
207 1.03 11698 1576 0.03 1.03 1.63 1.10 3.43 0.13 0.96 0.13
187 1.08 11698 1520 0.03 1.06 2.95 1.16 5.49 0.09 0.94 0.13

MikeLinacre: Let's try this, Omardelariva.

314 0.76 11698 1944 0.03 1.16 8.96 1.27 9.90 -0.03

You can see that this item with a very slightly negative PtBis also has the highest OUTFIT MEAN-SQUARE. This indicates that there are unexpected correct answers (by low performing persons) to this item that are influencing the correlation. If you look at Table 10.5, you will see what they are, and whether they are of concern to you.

927. F36 error

DAL November 4th, 2007, 11:27am: Hi Mike,
Bit of a strange problem this. We have FACETS control files for a speaking test that works fine for 4 of the 5 bands on which students are being judged. However on the fifth one exactly the same file doesn't work for more than two iterations and we get the F36 error message. I've gone through the file line by line and the one that doesn't work is in all respects identical in the data and commands (apart from the placement of ';' of course).

Any idea what is going on?


MikeLinacre: The F36 error is explained in the Facets Help as:
"F36 All data eliminated as extreme. Are there enough replications?
Extreme scores (zero or perfect) correspond to infinite measures and so provide no useful information about the measures of other elements. Therefore, data included in extreme scores are eliminated. All scores were perfect, so all data were eliminated. Perhaps, the problem is simply that not enough data is present."

This can happen when the data have a Guttman pattern. On first analysis, there are some extreme scores for persons, items, judges, tasks, etc. So these are dropped from immediate estimation. Then, in the remaining data, there are some extreme scores. These are dropped. The process continues until no data are left.

Is this happening with your data? If the problem is still a mystery. Please email me your Facets specification and data file.

What to do about it? One approach is to add a dummy data record which makes the initial extreme scores non-extreme.

DAL: Thanks Mike!
Hmmm, I don't think so. I think it is something to do with the setup. There is a mistake in the instructions which only affects 'comm'. I'm e-mailing the file to you.

MikeLinacre: Thanks, DAL. Have investigated. Every response to "comm" is in the same extreme category in the data file. have sent you a more detailed email.

928. Indexes of discrimination

omardelariva November 6th, 2007, 5:14pm: Hello Mike:

In order to construct a bank of items, after their application to examinees, we check many features about their quality. First filter is difficulty; we discard out-of-range items (we selected an interval between 0.25 and 0.75 p-value and our respective logit values in Rasch scale). Second filters are the Discrimination Index and Point-Biserial correlation, where we find that many good items have Pt-Bis correlation lower than 0.20; We say that an item is “good” when it is located inside the range of difficulty, their distracters have negative Pt-Bis correlation and negative Discrimination Index, we know that this indicators are dependent of each sample. For this reason, we would want use the discrimination (2-PL approximation) computed by WINSTEPS and rescue some of “good” items with low Pt-Bis correlation. I looked for papers in Rasch Measurement Transactions and in two of them I found the existence of a relationship between infit mean-square and discrimination (2-PL). Then, I revised my data and compared these two latter indicators and concluded that I do not get rid items with infit mean square less or equal to 1.00 even they have a Pt-Bis correlation less than 0.20.
What do you think about my conclusion?

MikeLinacre: Dear Omardelariva - thank you for your question. Your conclusion sounds correct. May I comment more widely? And perhaps other readers also have input ....
"between 0.25 and 0.75 p-value" - if you are using MCQ items, then 0.25 is much too low (too near to guessing). The floor should be more like 0.35. If you want your examinees to have a positive psychological experience, then 0.75 is also too low, the upper limit should be more like 0.80, even 0.85. The statistical information in a response is p*(1-p). So that the statistical information in one p=0.5 item is the same as in 1.3 p=0.75 items and 1.5 p=0.8 items.
"many good items have Pt-Bis correlation lower than 0.20". Your p-value range of 0.25 to 0.75 corresponds to a logit range roughly 1.5 logits below to 1.5 logits above the sample mean. We can compare this with www.rasch.org/rmt/rmt54a.htm - where we see that Pt-bis =0.2 is the maximum possible value at the extremes of your p-value range. So we expect to see good items with a Pt-bis less than 0.20.
"I do not get rid items with infit mean square less or equal to 1.00" - in practice, the usual motivation for getting rid of over-fitting (too predictable, high discrimination) items is to shorten the test length. We may also want to get rid of them because they are simply bad items: www.rasch.org/rmt/rmt72f.htm
"a relationship between infit mean-square and discrimination (2-PL)." - yes, because they are both summarizing the predictability of the responses to the item by persons targeted on the item.

929. Precision of Estimates

ary November 6th, 2007, 6:34am: Correct me if i am wrong. In evaluation test usefulness, I have to ensure the precision of the estimates. In order for a test to provide enough precision, the logit precison(standard error) of the respondents must fall in the range of 2/srqt(L) <SEM < 3/srqt(L)-a test of L items.

I ran the 1st found of a test: 41 items, 145 respondents. The result was as expected.
After improvement have been made on the items, it was distributed again to 567 respondents. Improvement was made again and more items were added- all together 76 items distributed to 166 respondents. Items were further improved.
this time it was distributed to 1233 respondents.

What should i do if the respondents' mean standard error is larger or smaller than the narrow range given. What does it means? Does it mean that this test unable to measure precisely enough to meet its purpose? Does it mean that i don't have enough well targeted items in estimating respondents' ability?


MikeLinacre: Ary, thank you for your question. Here's the logic for "2/srqt(L) <SEM < 3/srqt(L)"
1. Imagine a test of L perfectly-targeted items. Then the difficulties of the items exactly match the ability of the person. The person's probability of success is .5 every time. So half the observations are 1 and half the observations are 0. The model variance of each the observation is 0.5*0.5. So the standard error of the person's ability estimate is 1 / sqrt (L*0.5*0.5) = 2/sqrt(L), the smallest it can possibly be.
2. Suppose the L items are easy. In fact, about as bad as easy as they get for a test intended to measure, then p = 0.9. Then the variance of the observations is L*0.9*0.1. So the standard of the person's ability estimate is 1 / sqrt(L*0.9*0.1) = approximately 3/sqrt(L).
So the practical range of the standard errors of person measurement, for a dichotomous test designed to measure, is about 2/sqrt(L) to 3/sqrt(L). The value can't be less than 2/sqrt(L), but it can be greater than 3/sqrt(L).

ary: Thank you for your promt reply Prof.

930. LLTM on Polytomous Data?

jinbee October 29th, 2007, 7:34pm: Hi All,

Does anyone know how to run the LLTM on polytomous data? All my attempts so far have either crashed the program or been unstable.

I'm using ConQuest but have also used Stata for IRT analyses.

Any suggestions are welcome and appreciated!

MikeLinacre: Jinbee, thank you for your question.
It sounds as though you are implementing Gerhard Fischer's Linear Logistic Test Model using ConQuest. But it fails ...
Perhaps a two-step solution would work.
1. Obtain the item difficulties using a standard polytomous Rasch model.
2. Use Stata to decompose the item difficulties into their LLTM components using a regression model with dummy variables corresponding to the LLTM design matrix.

931. Response style and culture

deondb October 24th, 2007, 7:20am: Dear Mike
I wish to compare across cultures/language groups the category probability curves obtained with a rating scale analysis of a 12-item scale with Likert items. The items appear to be DIF free according to the Winsteps method. I propose to do a joint analysis of the groups and to then use the obtained person and item locations as anchors in separate analyses of the cultural groups. In the separate analyses the item thresholds will be freely estimated. I hope that differences in the obtained category probability curves will inform me about differences in the characteristic ways in which the different cultural groups use the categories of the Likert items. Do you think this strategy can work or do you propose a different method?
Deon de Bruin

MikeLinacre: Deon - a challenging project.
If there is a difference in the probability curves, it would be reflected in different ICCs across cultures. So looking at the Winsteps "Graphs" menu, "Non-Uniform DIF" would be instructive. Is there anything interesting?

How about modeling culture style directly?
1. Reduce your sample size to 30,000 or less. Sort the rows so that culture rows clump together.
2. Transpose the data matrix (Winsteps Output File) - so now the persons are the columns
1-5000 A ; culture A
5001 - 10000 ; culture B
4. Analyze.
5. Show graphs by "display by scale group"
6. You will now get the sets of probability curves in the same frame of reference.

deondb: Thank you Mike. Interesting suggestion and I will sure try it and let you know of the results. Just to be certain... Does this mean I will with the transposed data run an analysis with "12 persons" and "1000 items" (grouped into four according to language group)?

MikeLinacre: Exactly!

deondb: Mike
The procedure worked beautifully ang gave me exactly what I wanted.
Thank you

932. item weight

ning October 23rd, 2007, 5:22pm: Hi Mike,
I'm having trouble justifying the item weight process. If I have an item that didn't fit before (INFIT MNSQ=1.85), but once I added more weight to this item(IWEIGHT=1.5), it became a perfect fit (INFIT MNSQ=1.01). Could you please help me on the justification on imposing a weight to improve the item fit, or the model fit?

MikeLinacre: Gunny: In general, if you upweight a mis-fitting item, it will be more influential in measure estimation, and so will fit better. But this approach to improving fit contradicts the principles of Rasch measurement.

ning: Hi Mike,
Thank you for your reply. I agree with what you say, however, I'm still wondering what's the main purpose of IWEIGHT function in Winsteps and when do you use it?

MikeLinacre: Gunny:
IWEIGHT= was introduced into Winsteps because examination boards sometimes mandate that "Item 3 is worth twice as much as Item 4" or such like.
A useful application is for pilot items. These can be weighted 0. Then they are calibrated along with all the other items, together with fit statistics, etc. But they do not influence the person measures.

933. what should I do with 0 in this marking system?

Danika October 22nd, 2007, 12:38pm: Hi Mike, I¡¯m using the Many-facet Rasch Model to investigate the Chinese oral proficiency test (intermediate level). The intermediate level includes 3 levels (level 4, level 5, and level 6). When the rater marks the test paper, they use 0,4,5,6 (0 means haven¡¯t achieved in intermediate level). This marking system is not a continuous rating scale. So when I use the MFRM to analysis the result, what should I do with the 0?
Thank you very much! :)

MikeLinacre: Danika, thank you for your question. There are several options:
1. If "0" means "missing data", then use the "Rating Scale=" specification to recode "0" as "-1" (indicating "missing").
2. If "0,4,5,6" means "3,4,5,6", then use the "Rating Scale=" specification to recode "0" as "3".
3. I "0,4,5,6" means "0,1,2,3,4,5,6" (but 1,2,3 are not observed), then use the "Rating Scale=" specification and specify "Keep".

Danika: Mike, thank you very much for you answer. The case I¡¯m dealing with is number 3 which you talked about. ¡®0¡¯ means ¡®0,1,2,3¡¯ (which are not observed), and I have to keep 0 for sure. But my problem is if I have 2 raters mark the same paper which suppose to be on the level 4. Rater1 gave 0 which might stand for 3 but not observed. Rater 2 marked the paper with 5. Would rater1 show higher severity than rater2 in absolute value of logits? Since ¡®0¡¯ could stand for any one of 0, 1, 2 or 3, someone suggest me to replace the ¡®0¡¯ with ¡®2¡¯ when I run the software with the data. It might be better for the data convergence. What do you think about it?
Thanks again! :)

MikeLinacre: Danika - The data in Rasch measurement represent counts of qualitative levels of performance. The bottom level can be given whatever number we choose. For simplicity this is 0, but it can be any other number. Then the next level upwards must be the bottom category + 1. This category need not be observed, but it must exist conceptually.
In your data 0,4,5,6 would means that you have 3 unobserved levels of performance, 1, 2, 3 which exist conceptually, even if they are not observed. To maintain these unobserved categories in the rating scale, specify "keep".
Since 0 may not be observed for all raters, you need to
either: use a rating scale model which includes all the raters
or: use pivot anchoring: anchor the rating scales for the individual raters relative to a point on the rating scale which all raters will employ, such as the point of equal probability between categories 4 and 5.
With modern software and fast computers, convergence is rarely a decisive consideration.

Danika: Hi Mike, It¡¯s amazing that I got your reply so fast. Thanks very much for your help. However, I still have question about it.
Chinese proficiency test includes 3 categories, Beginner(level 1-3), Mediate(level 4-6) and Advanced(level 7-9). They have different test for different level. So all the grades conceptually exist, but in the certain test, if the examinee passed the exam they give the certain level of this scale, or else they just give 0. There ¡®0¡¯ represents failing in the test. It could be 1 or 2 or 3. So the rating scale is ready made. And what do you mean by using a rating scale model which includes all the raters?

MikeLinacre: Danika: Thank you for your explanation.
You will need to structure your dataset a little differently. Each of your original ratings is two scored ratings:
a) Fail or Pass (0 or 1)
b) If pass, what is the pass level. (missing, 4, 5, 6)
So format your data accordingly.
Original observation: 0 Scored observation: 0 and missing
Original observation: 4 Scored observation: 1 and 4
Original observation: 5 Scored observation: 1 and 5
Original observation: 6 Scored observation: 1 and 6

You may prefer:
Original observation: 0 Scored observation: 0 and missing
Original observation: 4 Scored observation: 1 and 3
Original observation: 5 Scored observation: 1 and 4
Original observation: 6 Scored observation: 1 and 5
This scoring keeps the total raw scores the same.

You will also need to decide whether the raters share the same rating scale structure ("Rating Scale Model") or each have their own understanding of the rating scale structure ("Partial Credit Model").

Danika: Mike, sorry to get confused for a while. Yes, we are using the Rating Scale Model. So if the rater1 score the examinee1 with 0, examinee2 with 4, examinee 3 with 5, the original data should be as follow,

If we structured it with the 1st system you suggested, should it be like the follow?
1,1,0(¡®,¡¯ or ¡®+¡¯ or ¡®space¡¯ or nothing?)missing

Or if we structured it with the 2nd system you suggested, should it be like the follow?
1,2,3(¡®,¡¯ or ¡®+¡¯ or ¡®space¡¯ or nothing?)1

Did I read your explanation in a right way?
Thanks! :)

MikeLinacre: Danika: Let me illustrate the 2nd system, because that one gives the same raw scores as the original data:
Facets = 3
Models =
?,?,1,D ; pass-fail is dichotomous
?,?,2,R5 ; pass-level is 3 to 5 (originally 4 to 6)
1, Rater
1, Rater 1
2, Examinees
1, Examinee 1
3, Items
1, Pass-fail ; 0 or 1
2, Pass level-1 ; 3 to 5
1, 1, 1-2, 0, . ; original rating of 0, . means "missing"
1, 2, 1-2, 1, 3 ; original rating of 4
1, 3, 1-2, 1, 4 ; original rating of 5
1, 4, 1-2, 1, 5 ; original rating of 6

934. Rack & Stack

Seanswf October 18th, 2007, 5:24pm: Hi,

The same test was administered to a group of students before and after a training program. 995 students took a pre-test, 1099 took a post test, & I have 718 students which have both pre-test & post-test scores.

I want to report the change in person scores from time 1 to time 2 & the change in item difficulty from time 1 to time 2.

I have chosen to analyze the data in reference to the post-test scores. Here is what I have done:
1) Ran Winsteps on the post-test and created an IFILE and PFILE.
2) Ran Winsteps on the pre-test with the IFILE from step1 as an IAFILE and Created a PFILE (putting the results in the same frame of reference).
3) Conducted a paired-samples T-test in SPSS on the measures of my group of 718 students (measures from both PFILES matched by ID#).

I am interested in "stacking" the data in Winsteps, what is the advantage to doing this?

Is there a way to calculate how much change has occurred per student?

I am considering racking the data as well to see the change in item difficulty.
Do I need to anchor this analysis to keep the item difficulties in the same frame of reference like the person analysis?
If so should I use the post-test IFILE or the post-test PFILE?

MikeLinacre: Thank you, Seanswf.
You write: "I want to report the change in person scores from time 1 to time 2 & the change in item difficulty from time 1 to time 2."
There are several ways to do this. From your description you know the basics of them.
You have already discovered the change in item difficulty from time 1 to time 2. It is the item displacements reported in step 2) of your analysis.
Alternatively you could stack the item 1 and time 2 person records into one long file. Include in each person label a code indicating time-point. Then do a standard unanchored analysis, followed by a DIF report on time-point. This will give you a statistically more exact time 1 to time 2 shift in item difficulty.
For the shift in person ability, You already have the paired samples in your 3). So the change per student is the difference between the two measures for each student.
Alternatively, in your step 2), go to Plots menu, "Compare statistics", and compare the person measures from your step 2) with the PFILE from your step 1). Be sure that the entry numbers for the 2 time-points line up properly. Leave blank lines in the data file where persons did not take the test.

935. Error variance

Matt October 22nd, 2007, 2:33pm: Good day!

using this formula:

"Real" error variance = model variance * MAX(1.0, INFIT mean-square)

What is the ''MAX'' in this formula?
Why do we use (1.0, INFIT mean-square)?

Thank you

MikeLinacre: Matt, thank you for your question.
Some years ago Jack Stenner (of www.lexile.com) performed a simulation study to evaluate the effect of misfit on measure standard errors. He discovered the impact of misfit is to increase the size of the error variance (as we all expected). More particularly:
Misfit-inflated standard error = No-misfit standard error * maximum of 1.0 and the INFIT mean-square.
When the INFIT mean-square is less than 1.0, there is overfit of the data to the model. The local standard error does not increase, but the overall measures become slightly stretched. There is no simple adjustment for this. Fortunately it is usually inconsequential.

936. Table 1.2 and 1.12

ning October 18th, 2007, 5:39am: Hi Mike,

I'm wondering what's the difference between Table 1.2 and Table 1.12? To me, they are a bit counterintuitive. Compare the two tables, they both holding the left hand side person distribution constant while flip the right hand side item distribution that range from rare to frequent or from frequent to rare. For example, given that most of the person distribution at the left hand side are located towards the bottom of the continuum, in Table 1.2, the item at the top seems have inadequate person observations. However, in Table 1.12, the item used to at the bottom now seem to have inadequate person observations because it's now at the top.

Could you please explain this to me? Thank you very mucy.


MikeLinacre: Gunny, thank you for your question.
Tables 1.2 and 1.12 contain exactly the same measurement information. The difference is in how the measures are conceptualized and communicated.
Table 1.2 follows the approach usually used in educational testing: the persons are matched with the items: high ability persons with high difficulty items etc.
Table 1.12 follows the approach usually used in medical rehabilitation: high scoring persons (high ability persons) and high scoring items (easy items) are lined up.
Probably one approach makes much more sense to you (and your audience) than the other, so use that approach.

ning: Thanks, Mike,

Continue on that topic, say, I use Table 1.12, the item listed on the top of the map should be the easiest item to endorse. However, there is no #'s at all next to that item, even though that item should be the easiest to endorse for that particular diseased cohort, what does this mean? The frequency table from the raw data shows that 30% people endorsed category 1, 59% people endorsed category 2 and 12% of them endorsed category 3.



MikeLinacre: Gunny: in Table 1:12, the idea is that the persons and items are being compared to the "degree of success", not to each other.
If an item and a person happen to be on the same row, then they have both achieved the same overall level of success, e.g., in principle,
If a person and an item are both in the 0.0 logit row (for a standard dichotomous analysis), then
When the person at 0.0 interacts with the item at 0.0 we expect 50% success.
If a person and an item are both in the 1.0 logit row (for a standard dichotomous analysis), then
When the person interacts with a 0.0 item we expect 73% success.
When the item iteracts with a 0.0 person we expect 73% success.
But if that person at 1.0 interacts with that item at (-)1.0, then there is a 2-logit difference, so we expect an 88% probability of success.

937. Infit Computation

Raschmad October 14th, 2007, 8:17pm: It' pretty streightforward how outfit mean square is computed. However, I have problem understanding how infit mean square is computed. I read the residuals are standardized. How?
What is done with residuals to compute infit mean squares?
It's obvious from the computation of out fit mean square that, this statistic is sensitive to lucky guesses and careless misses.
What does infit means square show and why is it more important than outfit mean square?

MikeLinacre: Raschmad, the computation of the infit mean-squares is similar to the outfit mean-squares:
outfit mean square = sum( (observation-expectation)^2 / model variance ) / count
infit mean-square = sum( (observation-expectation)^2 ) / sum( model variance )
Outfit mean-squares are more influenced by unexpected outlying observations (lucky guess, careless mistakes) than the infit mean-square. So large outfit-mean-square values are usually caused by a few aberrant observations.
Infit mean-squares are more influenced by unexpected response patterns. These are a greater threat to the validity of the measures, but are often difficicult to diagnose.

938. Run-time Error

connert October 5th, 2007, 3:13pm: I have Winsteps 3.64.1. When I try to use some SPSS files to create a Winsteps file I get an error message: Run-time error '380' Invalid Property Value. This happens with data files from the General Social Survey that were no problem with earlier version of Winsteps. It does not happen with a data file from a survey I conducted. Any suggestions?

MikeLinacre: Thanks for reporting this bug, Connert.

Please email me your an SPSS .sav file that is causing problems. And I should be able to remedy the problem speedily.

Also are you using the default SPSS interface uploaded with Winsteps or the SPSS interface installed with your own version of SPSS? The easiest way to find out is to have Winsteps write an output file in SPSS format. Then right-click on the file name and open that file with NotePad. What version of SPSS does it say?

MikeLinacre: Thank you, connert, for reporting the bug and emailing your SPSS .sav file.
The problem was the large number of SPSS variables in the file.
A modification has been made to the Winsteps-SPSS interface and will be included in the next Winsteps update. Anyone who needs the larger-variable-capacity sooner, please email me.

939. Rasch for continuous rating

ning October 5th, 2007, 6:02pm: Hi Mike,

I'm interested in running Rasch on one item that extends from 0 (the worst) to 100 (the best) but I'm not sure how to set up the control file properly. I know you did Rasch analysis on VAS but you had 12 different ratings for each patient (Rasch analysis of visual analog scale measurement before and after treatment of patellofemoral pain syndrome in women). If my patients only have one VAS score, could you please help me on how to run one item in Rasch with 101 categories?


connert: Hi Gunny,

I'm not Mike but here is an example of one of my projects which involved rating newspaper stories. There were coders in this example but that can just be removed. This is for Facets.


Excel File

;coders articles items traits
1 1 1a 1
1 2 1a 4
1 3 1a 2
1 4 1a 2
1 5 1a 3
1 6 1a 2
1 7 1a 4
1 8 1a 1
1 9 1a 2
1 10 1a 2
1 11 1a 1


Control File

Title = Favorability Toward Asylum Seekers
Facets = 3 ; put number of facets here
Inter-rater = 1
Positive = 1 ; list the positively oriented facets here
Noncentered= 1 ; put the one (usually) floating facet here
; Vertical = ; put the control for the "rulers" in Table 6 here, if not default
Arrange = mN, N ; put the order for the measure Table 7 here
Model=?,?,?,R5 ; put the model statement for your Facets here.



2, Stories
3, Traits


; enter in format:
; element number for facet one, element number for facet two, , , , observation

ning: Hi Tom,

Thank you very much for your help. My excel data file would look something like below:

ID Item Score
1 1 4
2 1 10
3 1 90
4 1 89
5 1 89
6 1 19
7 1 8

When I tried to run in Winsteps, it kept asking me about RFILE, I'm wondering if you can help me out on this particular set of data?


MikeLinacre: Gunny, Winsteps suggests you look at the "RFILE=" output because there appears to be a problem in the data. The RFILE= shows your data as Winsteps sees it.

From your description, the message appears because there is only one observation per patient. This is enough information for an ordinal analysis, but not enough for a linear analysis, Rasch analysis usually requires more than one observation per subject so that the linear structure beneath the ordinal observations can be constructed.

VAS data is conventionally analyzed with assumed, rather than constructed, linearity. The analyst assumes that the VAS numbers are linear measures and proceeds accordingly. To verify or challenge this assumption requires more than one VAS observation per patient.

940. Quest-Polytomous Items, how?

Anna September 25th, 2007, 4:05am: Dear All

First of all, I would like to say hello to all member, I am a new member. I'll be glad to joint with you.
I have some trouble to run Quest software for polytomous items. My data is scale between 1 to 6, amount of items is about 70 items. In manual books of Quest, I can run with partial credit model. But, I don't know why it can't run. Would you please to give a syntax that suitable to analysis my type data above with quest?

Thanks before.

MikeLinacre: Welcome, Anna!
In Quest, use the "credit" option of the "estimate" command.
estimate | credit
but this is the default. See Sample 3 in the Quest manual.

Anna: Ok, Thanks so much for your suggest. After few trial & error, finally, I can run polytomous items with quest.

941. Equating redux

biancarosa September 30th, 2007, 6:13pm: I have a question about the order in which to conduct my equating analyses.

Students from 2 different schools took 2 or more forms of a test. Specifically, children at School 1 took forms A and B, while children at School 2 took forms A, B, C, and D.

I am thinking I will do the following.
First, conduct an analysis of each form for each school separately.
Second, conduct an analysis of the forms taken simulatneously for each school separately.
Third, conduct an analysis of all four forms across schools.

The demographic composition and overall achievement of children in the two schools are quite different, which is why I am treating them as two different samples.

Is this correct, or am I making thing unnecessarily complicated?

MikeLinacre: Biancarosa, your procedure sounds wise.
There are so many things that can go wrong (misprinted forms, incorrect scoring keys, kids given the wrong form ....) that it is always best to analyze the separate pieces and make sure the results make sense. Cross-plot the difficulties of common items or common persons, to make sure they are equivalent before putting the datasets together. The final goal is usually a combined analysis, but sometimes this is not possible.

biancarosa: Thanks much!

942. Equating

biancarosa September 23rd, 2007, 6:18pm: Equating Question

Hi all,
I have an equating situation that is pretty complicated (at least to my novice eyes) and am somewhat stymied about how to proceed with the analysis. :-/

I am working with data that was collected at 3 time points, but I was only asked to help build the equating design at the 3rd time point. That combined with a limited sample size made for a creative but complicated design.

The test itself is a timed reading fluency measure where middle school children read one or more passages aloud for one minute. Each passage is in essence its own test form because they are designed to be used interchangeably.

For each passage a child reads, it yields 4 potential raw scores: number of words read (wpm), number of words read correctly (wcpm), number of words read incorrectly (epm), and percentage of words read incorrectly out of all words read (pcte). Obviously these overlap and all 4 cannot be used in a single analysis.

One question I have is about which to use. The literature on fluency would suggest I use wcpm and pcte, but my thinking is that I will use wcpm, epm, and pcte. Pcte could be thought of as redundant with the other scores (especially epm) except that it captures reading efficiency regardless of how many words a child reads. Some preliminary tries at this have shown that these scores do seem to measure different things; it’s easier to get a certain percentage incorrect than it is to get a raw number of words incorrect, although this depends on which cut scores I use. But I’m interested to know whether others think pcte is too redundant with epm.

Another issue is cut scores. I do not have a large enough sample to use the raw scores. So far I have been choosing cut scores informed by the literature, and the model appears to get saturated at about 4 cut scores (5 steps of development). However, the literature is clearer about some cut scores (pcte) than others (wcpm). Is experimenting with different cut scores starting up a slippery slope? Or is there some principled way to do this?

Now for the real problem: the design itself and how to proceed through the equating of 8 different ‘parallel’ forms. The passages are all ‘equated’ based on readability, but as one might imagine the topic of a passage and its genre (narrative vs. expository) can make quite a difference in how kids do (as large as 30wcpm). Hence, the desire to equate.

The equating design is complicated because data were gathered contemporaneously at two different schools with researchers answering different research questions as part of a larger project. Sadly, they did not confer with each other about passage selection. So here’s the design as it evolved.

At time 1, all kids took passage A and passage B. A subsample also took passage C and passage D (but these were an easier readability than all of the other passages and are not considered a high priority for the equating).
Pattern 1a: AB (n = 300)
Pattern 1b: ABCD (n = 200)
Lower case a stands for School a, and b for School b.

At time 2, one group of kids took passage E and passage F, while another group took passage E and passage G.
Pattern 2a: EG (n = 260)
Pattern 2b: EF (n = 200)

Based on my design, at time 3, all kids took passage H. Those who had not taken F took F and those who had not taken G took G (this satisfied the needs of those collecting the data who wanted to use two unseen passages). Kids also read a third and sometimes a fourth passage to allow for linking to the previous administrations. In order to try and create a direct link of at least 100 kids between all of the passage pairs at the same readability level (more or less ignoring passages C and D). I could not create a direct link between all of the passages at the last time point because of a limited sample size and also because I wanted to insure that at least some of the links occurred at both schools. Here’s what I ended up with (gods forgive me!) with about 100 kids taking each pattern of passages.
Pattern 3a1: HFGB (n = 90)
Pattern 3a2: HFEA (n = 90)
Pattern 3ac: HFEB (n = 90)
Pattern 3b1: HGA (n = 100)
Pattern 3b2: HGE (n = 100)

So, my question is in what order should I proceed with the equating? Do I equate the passages given at any one time point and then proceed to work across time points? And am I right in assuming I should first analyze each passage at each time point separately? Finally, what is the optimal way to construct the datasets? Right now I have a separate dataset for each time point for each school.

Sorry for the uber-long question, but as I warned you, it’s an uber-complicated situation (at least it appears that way to me!). :o
Thanks much,

MikeLinacre: Gina ... thank you for your question. It will take some thought to answer. Please make suggestions, anyone!

biancarosa: Yes, definitely not a straightforward situation. So far I have only toyed with approaches because I'm not convinced of the right path. But I did try equating the four passages from the first time point (for the one school that gave all four) using 3 'items' per passage. I tried one cut score, two, three, and up to four. The cut scores line up as I would expect them (being more accurate is harder than being less accurate, etc.) and seem to relate across passages in a sensible way (expository passages are harder than narrative ones, more obscure topics are harder than less obscure ones). So I am pretty optimistic.

I look forward to any thoughts anyone has on any of the issues I raised.

MikeLinacre: Biancarosa, please pick out your most urgent question and the information directly related to it. Post that as a question, and let us respond to that. Then you can move on to the next question. That will make it much easier for us to answer.

biancarosa: Okay, will do, thanks!

943. item separation

godislove September 28th, 2007, 9:50am: Hi Mike,
I have a question about item separation which I have no idea how to solve it.
We have done 2 studies. In the first study, we had 210 assessments from 75 persons and the person and item separation were 3,79 and 11.59 respectively. In the second study, we had 96 assessments from 96 participants and the person separation was 5.03 and 7.43 respectively. I am happy with the bigger person separation because this means that this second sample included a wider range of person abilities. I don't understand how come the item separation became smaller. CAn you explain this? The 1st study was analysed with Facets and the 2nd study was analysed with Winsteps. Do the softwares make a difference?
Please advise.

MikeLinacre: Thank you for your question. Most of your "Separation" numbers are huge, g., so their actual size doesn't have any practical meaning. It is like measuring our heights to the nearest millimeter. We can do it easily, but all we need to know is "this measuring instrument is precise enough." That is what yours is.
Item separation is determined by:
1. spread of the items
2. size of the person sample
3. targeting of the person sample on the items.
You analyzed with Winsteps and Facets. If the measurement models are the same (2 facets, etc.), Winsteps and Facets produce the same numbers, but if the the measurement models are different (e.g., Winsteps with 2 facets and Facets with 3 faets) then the reported measures will differ, and so will the the separations.

944. Correspondence of long and short form

deondb September 22nd, 2007, 5:10pm: Hi
I hope you can help. I wish to examine the correspondence of person measures obtained with a long and a short form of a questionnaire. I used Winsteps (Compare Statistics on the Plots menu) to create scatterplots of the person measures obtained with the two forms. Winsteps generates two plots with 95% confidence bands: one for the empirical line of best fit and the other for the identity line. Which of these two plots tell me more about the correspondence of the person measures obtained with the long and short forms? Do they tell different stories?


MikeLinacre: First off, we expect the two sets of person measures to approximate a statistical straight line. Let's hope this happened.
Then the question becomes: are both forms measuring in the same unit size (line parallel to the identity line) or are they measuring in different size units (like Fahrenheit and Celsius) which would be a line not parallel to the identity line?
When the empirical line statistically matches a line parallel to the identity line, then the forms have equal test discrimination. If not, then one form has higher test discrimination than the other. We would expect the longer form to also be more discriminating if its questions are more detailed. But if the short form is a selection of items from the long form, then we expect the test discrimination to be the same.
We always expect the long form to produce small person-measure standard errors (and so higher "test reliability") than the short form.

945. What am I missing?

SusanCheuvront September 18th, 2007, 11:17pm: Hi Mike,

Can you tell me what I'm missing here? FACETS won't run with the data as is. Do I need somewhere in the data lines a "1-60" so it knows to expect 60 items? We have a 60-item test where each item is scored 0 for incorrect and 1 for correct. Each test is rated twice, once each by 2 raters.

Facets = 3
Models = ?, ?, ?, R
Inter-rater = 2
Labels =
1, Examinee
2, Rater
1, Jack
2, Jane
3, Jill
4, John
3, Items
data =
1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10 Examinee 1, Rater 1, Item 1-60, Rating = 10



MikeLinacre: Susan, only one line missing. You do need to tell Facets the items are 1-60.
Easiest to add this specification line somewhere near the front of your specificiation file:

dvalues = 3, 1-60 ; The data are: facet 3, elements 1-60

946. Sample size again

ning September 13th, 2007, 4:10pm: Hi Mike,
Regarding the sample size, if I have about 3000 individuals, will that exaggerate the misfit on some of the items? Do you think it's appropriate randomly draw a smaller sample for about 300-400 for a more reasonable fit in the Rasch model?

MikeLinacre: Gunny: interpretation of mean-square statistics is relatively robust against sample-size.
But with increasing sample size, tests of the hypothesis "the data fit the Rasch model in the way the Rasch model predicts" become increasingly sensitive.
The plots at www.winsteps.com/winman/diagnosingmisfit.htm show the relationship between mean-squares, sample size, and results of this hypothesis test (expressed as standard normal deviates).

947. Should I reverse code?

ning September 11th, 2007, 5:12am: Hi Mike,

If the person ability in my data set is measuring a person's health, and the original items have categories such as: 1=no problems, 2=some problems, 3=extreme problems. Hence, lower scores would indicate better or higher health status. Should I reverse code the categories so that the person ability and item parameter are in the same direction? Further, if I'm collecting my own data using the same instrument, If I reverse the categories order first before I go out and collect, would that make my collected data comparable with the existing data that's reverse coded subsequently?



MikeLinacre: Gunny, thank you for your question.
We usually orient the latent variable so that "higher person measures = more of what we are looking for". So, in education, we are usually looking for "more ability", but in your area of healthcare, you may be looking for "more disability". Recode the items, if necessary, according to what you want.
One frequently-encountered situation is for judges or raters: do we want to measure "high rater measure = more lenient" or "higher rater measure = more severe"? It doesn't matter which we choose, so long as we are clear about it to ourselves and our audience.
If the latent variable ends up pointing in the wrong direction, it is usually easy to made the necessary arithmetic adjustments to the person measures and item difficulties.

connert: Is there a specification in Winsteps equivalent to the postive= or negative= specification in Facets?

MikeLinacre: Connert: Good idea! But it's not there yet ...
Winsteps follows the Rasch tradition in which persons are "able" and items are "difficult". There is no option for persons are "able" and items are "easy". There should be ....

ning: Thank you, Mike.

This is helpful.

Have a great day!


948. About development of CAT

kong September 7th, 2007, 5:08am: Dear all,
I would like to seek your help.
I'm developing an CAT about cognitive assessment, I had already got 65 item diffiuclties, and I according the the articles in
to calculate the test taker's cognitive ability.
My question is in the article, Prof. Wright said to set the difference between the better estimation of measure M' and ability measure M is 0.01, is it means at 1% level, if so, could I set it at 5% level, i.e |M'-M|<0.05?

My second question is if all item banks were used (65 items), there still difference between M' and M, how should I report the person's ability measure?

The third question is my assessment still not yet have cut off point, how could I use the SE to determine the stopping criteria?

At this moment, my stopping criteria are:
1. |M'-M|<0.05
2. use all the question of the item bank

sorry for asking 3 questions at a time and thanks in advance for helping

MikeLinacre: Thank you for your questions, Kong. CAT is an exciting area of testing.
https://www.rasch.org/rmt/rmt102t.htm is to do with estimating measures from known item difficulties.
1. M is the current logit estimate. M' is the updated logit estimate from applying the Newton-Raphson estimation equation to the current set of item difficulties. M'-M is the logit difference between the two estimates (not a percentage or a p-value). The estimation equation must converge for the current set of item difficulties. It usually converges in 4 or 5 iterations. With modern super-fast computers, you can run the procedure at rmt102t.htm to convergence after each item is administered to obtain a new value of M ready for the next item administration.
2. Let's say M(1) is the converged ability estimate according to rmt102t.htm after one item. M(2) after 2 items, etc. Then at the end of the test administration there is always a converged final estimate M(65). This will differ from M(64) by about 0.02 logits in a typical application.
3. Associated with each measure estimat M(n), there is a standard error SE(n). If you are making pass-fail decisions, you can stop the CAT administration as soon as the pass-fail decision is statistically certain.
a) If L is the logit pass-fail ability level. Then if |M(n)-L|>3*SE(n) then you have more than 99% in the confidence of your pass-fail decision for this examinee. You may decide there is no point in continuing the CAT test.
b) You may not need high measurement precision for your purposes, so if SE(n)<0.2 logits, this may be good enough for what you are doing. This is typical in screening tests or routing tests, which only need an approximate measure of ability.

kong: Dear Professor,
thank you for your information.
I adjust the stopping criteria to a)|M'-M|<0.02 and b) use all item bank, but as in CAT, it a person pass an item he will be giving a more difficult one, so some situation will be if a person with a high ability and performing the CAT, than he will finish all the item bank quickly and the convergence of M may not reach, so the difference between M' and M is large, so which one we need to take as the person's ability.
Also according to rmt102t.html, the score is the accumunlated probability of the person get the correct answer, am I right ? if yes, did it have any information about the person performing the CAT.

My procedure of developing the CAT is first construct the 65 questions cover from attention, orientation, memory and executive function etc. than I give all 65 items to 30 stroke patient and collect their performance. Then by using winsteps software to determine the item difficult of each item. then by using of the rmt102t.html formula to construct the CAT.

Any comment on that procedures?
I also would like to ask will the person measure from winsteps as same as the formula from rmt102t.html?

thank you very much

MikeLinacre: Kong, www.rasch.org/rmt/rmt102t.htm estimates difficulties from a known set of item difficulties. You can do this after each item is administered. This will produce the same set of person abilities as Winsteps with anchored items.
For the CAT administration process, https://www.rasch.org/rmt/rmt64cat.htm suggests a practical procedure.

949. Rater severity and person scores

connert September 9th, 2007, 2:43pm: Am I correct that if you do an analysis with Facets that involves items, persons, and raters (for example, students write short essays on several topics that are scored by writing experts), that the measures given to persons takes into account variations in how strict the raters are?

MikeLinacre: Yes, you are correct, Connert. However you do need a judging plan with sufficient crossing of experts for the adjustment to be possible. This plan does not need to be elaborate. See www.rasch.org/rn3.htm

950. Rasch output interpretation

ning September 8th, 2007, 9:59pm: Hi all,

I'm wondering if anyone can help me out on interpreting some of the Rasch output. I have one item out of five that has the most misfit statistics. However, on ICC, this item doesn't seem to be troublesome. Rather, another item that has the good fit statistics has the different slope than any other items (across others). I don't understand why this is the case.



MikeLinacre: Ning, a suggestion:
Make the intervals on the x-axis narrower for reporting the empirical ICCs. The misfitting observations should definitely begin to emerge. It is often informative to gradually change from very narrow intervals to very wide intervals. This usually shows what interval is most meaningful for reporting the empirical ICC.

951. task difficulty

godislove September 3rd, 2007, 7:28am: Hi Mike,
Thank you for all your help in this website. You are wonderful!
I am designing a new study which I am thinking to use DIF. Our assessment tool is a tool that measures prosthetic control when performing different tasks. Now the tasks are not standardised and this means that the tasks are chosen by the patients. A child might choose free play and the rater observes the prosthetic control of the child when playing. The question is whether the type of task has an influence on the ability to control the prosthetic hand? It is reasonable to think that, for example, it is more spontaneous and easier to control the prosthetic grip in some tasks than in others.
The purpose of this study is to determine if the items are functioning in a similar way independent of in which activity the assessment is performed.

The method is to pick 3 hand functioned tasks over a group of 30 patients. I remember you said each group needs 60 patients in order to run a reliable DIF? My question is can I use Rasch analysis to analyse the influence of task difficulty on prosthetic control?

Thanks for your help in advance,

MikeLinacre: You are at the minimal end of things, but this is often the case with clinical samples where everyone available is included. Your sample is small, but they should give you a reasonable indication of what is going on. But don't be surprised if a later sample produces noticeably different findings.

952. Floor and ceiling effects

arne August 28th, 2007, 12:55pm: Hello all.

I wonder if anyone can help me with the following problem:

We are planning to study two small instruments for personality disorders (7-8 items, scored yes/no). The instruments are intended for clinical use.

We need to translate the instruments, but will only with difficulty (read: probably not) be able to recruit a clinical samle large enough to establish equivalence between the original and translated versions: (approx 300 subjects; 150 women/men). We therefore plan to use a non-clinical sample.

In clinical samples the prevalence of the disorder is around 0.50, in 'normal' samples around 0.10.

We have done some (crude) estimations of the distribution of summed scores, and find that the proportion getting a sum of zero is quite high (for the sake of argument: assume > .30 - .40).

(The "WinGen" package by Han, K. T., & Hambleton :
www.umass.edu/remp/software/wingen/, will possibly help us better estimate how different proportions of 'floor' influence the parameters).

Although there are different methods for scaling perfect or zero scores (Hambleton 1985), we have not been successful in locating studies where the problems of "floor" and "ceiling" have been analysed.

Are there some "rules of thumb" ?

Could anyone point us to literature re this question ?

Best regard

MikeLinacre: Arne, thank you for asking us for advice. It appears that your interest is in the clinical use of your instrument, not its use on "normals". Accordingly those scoring zero (normals) are irrelevant to your study. So why not omit them? Surely you would not have asked them to respond if your clinical sample had been large enough.

This also suggests that you use an estimation method that is robust against the sample-distribution (such as CMLE, JMLE or PMLE, but not MMLE). This is because the distribution of your norming sample is far from the clinical sample distribution that users of your instrument will encounter.

Dear Mike, dear all.

Many thanks for your kind answer, and suggestions.

We are un-sure about the proportion of zero scores.
Preliminary studies of sensitivity/specificity suggests a cut-off score of three or more for positive diagnosis (validated against another instrument).

How not-diseased will respond below this level again we don't know. As the literature suggests a net sample of approx. 300, Omitting a large proportion of scores will affect the size of our sample, as we are also plan to check for gender bias. (Luckily, we have not collected the data yet :) )

When we have established equivalency between instruments, we'll probably 'verify' the scale against a clinical sample.

We will ponder your suggestions re. estimation methods.

Again, thanks.

Best regards

MikeLinacre: Arne, perhaps the literature lacks clarity here. "As the literature suggests a net sample of approx. 300, Omitting a large proportion of scores will affect the size of our sample." - Surely the literature means "a net relevant sample of approx. 300". Those with almost extreme scores are rarely part of the relevant sample, particularly in a gender bias study.
How about simulating data like that you expect to collect to see if gender bias would be detectable? That would give you a stronger indication of the sample size you need.

953. Interpretation of the reliability index

SusanCheuvront August 23rd, 2007, 6:05pm: What is the correct interpretation of the reliability index produced by facets in the output of rater characteristics? I read somewhere that is is not an index of agreement among raters and that higher values indicate substantial differences between raters. Is this true? And what does this mean?

In my most recent analysis involving 6 raters, it produced a reliability index of .87. What's the best interpretation of that?



MikeLinacre: Conventional reliability indexes (like Cronbach Alpha) report "reliably different". Inter-rater reliability coefficients report "reliably the same". The reliability indexes reported by Facets are all "reliably different". So 0.87 means that your raters are reliably different in leniency/severity.
Unfortunately there is not a generally-agreed inter-rater agreement index, but Facets supplies the information necessary for computing most of them.

SusanCheuvront: What kind of inter-rater agreement index do you recommend? I've tried Kappa, but SPSS gets moody and won't calculate it unless the crosstab tables produced from the data are symmetric, which doesn't always happen. What could I take from Facets to compute an index of interrater reliability?

MikeLinacre: SusanCheuvront, thank you for your email. The first challenge is to define what type of inter-rater reliability you want. Since you considered Cohen's Kappa, it seems "reliable raters agree on choosing the same category".
You are using Facets, so specifying
Inter-rater = facet number
produces a column "Observed agreement percent". With a matching summary statistic below the Table.
"Expected agreement percent" is the value when the raters are acting like independent experts. The closer the observed value is to 100%, the more the raters are acting like "rating machines". Examination boards usually train their raters to act like rating machines.

SusanCheuvront: Thanks, Mika. A % agreement index would be perfect for my purposes. So what would my input file look like?

Facets = 2
Model = ? ? R
Inter-rater = 2
Labels =
1, Examinee
2, Rater
1 Steve
2 Jan

And, I have one additional question re data entry. If I want to include the items as a facet, how do I label them? There are 60 items on each test and they're scored 0, 1 by 2 independent raters.

3, Items


MikeLinacre: With the items, Susan:

Facets = 3
Model = ?, ?, ?, R
Inter-rater = 2
Labels =
1, Examinee
1-nnn ; for nnn examinees
2, Rater
1 Steve
2 Jan
3, Items
data =
1,1,1,2 ; person 1, rater 1, item 1, rating = 2

954. polytomous analysis

mdeitchl August 8th, 2007, 12:57pm: Hi - I am working with the polytomous Rasch model. Each item in my scale has the same 4 categories. I would like to run my analysis so the steps between thresholds of the categories are not made to be equivalent for any item - as well as for across the items. Do I need to change some default setting to allow for this type of analysis? Thanks Megan

MikeLinacre: Megan, thank you for your question.
It sounds as though you want to use the "Partial Credit" model, where each item is conceptualized to have a unique rating scale structure.
In Winsteps, this is done with ISGROUPS=0
In Facets, this is done with "#" in the Models= specification.
In RUMM, click on the highest "Full Model" on the list that is allowed.

955. Person subscale scores

Seanswf August 5th, 2007, 2:21am: Hi Mike,
I have a test with items organized according to learning objective. I used the ISUBTOTAL command in Winsteps to produce the average measures for each learning objective. Now I want to obtain the average learning objective measures for different subgroups of people (for example by classroom) and compare them to the previously obtained total group learning objective measures. What is the best method to obtain the separate person subgroup scores so that they are able to be compared to the total group scores.

MikeLinacre: Sean, not sure what is it you want exactly. Does PSUBTOTAL= do it? Or perhaps PSELECT= ? Can you tell us in more detail what you wnat for each person subgroup?

Seanswf: Sure Mike
I have items categorized by learning objective and can obtain ISUBTOTAL measures.

I have persons categorized by territory (place where they took the exam) and want to obtain ISUBTOTAL measures for each group which I can compare to the ISUBTOTAL report for the total group.

When I run the analysis with all persons then use PSELECT to isolate one territory the ISUBTOTAL report is the same as with all persons. I would like to know how each territory ISUBTOTAL measures differ from the total group ISUBTOTAL measures.

Does that help?

MikeLinacre: Seanswf, thank you for the further information. You "want to obtain ISUBTOTAL measures for each group". This can be done by
1. Doing separate analyses for each sub-sample of persons and items. But these will need to be equated/linked in some way to make the measures comparable.
2. Doing a "Differential Group Functioning" analysis, Winsteps Table 33. In this you specify the person group indicator (in the person label) and the item group indicator (in the item label) and the Table tells you how much better or worse the local performance is than the overall performance.

Seanswf: For your #1 could I accomplish this by conducting an analysis of the total group & create a PFILE, then use that PFILE as a PAFILE in separate analyses of each sub-group?

#2 - for table 33 Winsteps will plot the "relative measure" but not the "local measure" I think I want the local measure because I would like to calculate the probability of correctly responding to an item for the subgroup at the cut-score and compare that to the probability of correctly responding to an item for the total group at the cut-score.

MikeLinacre: Seanswf,
#1: PAFILE= certainly works. Perhaps that will give you what you are looking for.
#2: the measures are "relative" because Winsteps is computing interaction terms. You can add these to either the persons or the items as an estimate of "main effect + interaction".
You want "calculate the probability of correctly responding to an item for the subgroup at the cut-score". This could be based on a standard DIF analysis: item x person subgroup, will give you the local item difficulty.

Seanswf: Thanks again Mike. Based on your advice I was able to create the report I was looking for using the local measures to calculate probability of correct at cut-score and computing the difference from the observed P-value. I wanted to show instructors how well subgroups were achieving the criterion value on an item level and learning objective level (group of related items). For the later I will report the average P-values, expected probability & difference of the items measuring each objective.

PersonsItem one P-ValueExpected ProbabilityDifference
measure at cut-score (1.03 logits)
Total group-.99 .88.880.00
Subgroup1-1.15 .85.90-.05
Subgroup2-1.95 .95.950.00

Do you see any problems with this approach?

MikeLinacre: No problem, Seanswf - provided the instructors think this way!

Seanswf: A very vaild point, Mike. I will have to put my sales skills to work. Wish me luck! Thanks again for your guidance and wisdom.

956. Equivalence of test versions

helenh July 28th, 2007, 10:10pm: HI
Could anyone advice on how to analyse equivalence of randomly generated versions of a test in Winsteps?
For example, if we've got 16 sub-sections, 1-4 questions each generated from an item bank (not built with IRT in mind), how do we tell Winsteps to link all the versions generated?
hope this makes sense.

many thanks

MikeLinacre: Helen, is this 16 datasets each with different, but overlapping, subsets of items? If so, MFORMS= is the answer. Include in each person label a code for which dataset that person belongs to, and then you can do a dataset x item DIF analysis to check for stability of item difficulties across datasets. OK?

helenh: Dear Mike
In my case, I potentially have thousands of forms which are not labelled, i.e. for each of the 15 categories, out of a 250 item bank the systems randomly selects 1/2/3 items to generate a 50 item test. Of course, there are forms with overlapping items (or items that test the same part of the syllabus=a category).

Would MFORMS= deal with that scenario? The problems is that I'm being asked to evaluate a test which hasn't been pre-tested, and so far I've used Winsteps for fixed form tests with a clear map, so I'm a bit lost...

Thanks a lot for your help so far.

MikeLinacre: Helenh, your design sounds similar to a computer-adaptive test. Your data collection mechanism needs to store the responses in the form: "person id", "item id", "scored response". You could then analyze these data directly with the Facets program. Or your programming folks could write a simple computer program to convert these data into a rectangular data matrix. This is what we did for Winsteps Example 5, but there we used the original responses because there was only one scoring key.

957. Item point-biserial correlation

flowann July 13th, 2007, 7:32am: Good morning Mike,
in your guidelines for manuscripts I found the note about point-biserial correlations being greater than 0.30.
To my understanding there is a difference between the point-biserial correlation and the point-measure correlation, the latter is depicted in WINSTEPS item polarity.
Where do I get information about the point-biserial correlations of my data?
Thanks in advance!

MikeLinacre: Flowann, point-measure and point-biserial correlations are usually almost the same for complete data. When there is missing data, the point-measure correlation keeps its meaning, but the point-biserial correlation can be misleading. This is why the point-measure correlation is the default in Winsteps. If you prefer the point-biserial correlation, specify PTBIS=Yes.
For complete data, the point-measure correlation and the point-biserial correlation have the same interpretation.

958. Equating

Raschmad July 12th, 2007, 8:33am: Hi everyone,
For equating 2 test forms, the difference between the mean difficulty of the anchor items from the 2 analyses (shift constant) should be added to the person measures on the harder test. I have done a couple of equating projects with Winsteps. I just put all the data from the two tests with common items into one control file and ran the analysis and got person and item measures, assuming equating is done. In fact, I enquired about the issue and experts said that’s the way.

I was just wondering: does Winsteps recognize an equating file and automatically computes and adds the shift constant to all person measures estimated on the basis of the harder test?

Wright & Stone (1979: 117) demonstrate that item measures after equating by this procedure are not very much different from the item measure that we get by doing a combined analysis of all persons and items. They make this comparison “to assess the adequacy of this common item equating” and call the combined analysis the “reference scale”. So if this is the case then why should we go through the equating procedure?
Mike: could you please explain?

MikeLinacre: Raschmad, as usual you raise important questions.
When all the data is included in one analysis, this is called "concurrent" or "one-step" equating. When Wright & Stone wrote in 1979, this wasn't possible because the software of the time couldn't handle missing data. Now we can choose do separate analyses and equate them with shift constants, or combined analyses. The results should be statistically the same, except for the location of the local origin. It is usually much easier to do a concurrent analysis (Winsteps command MFORMS= can assist), but there are situations in which separate analyses make better sense. Google "concurrent equating problems" to see some webpages that talk about this.

959. Principal Component Analysis - Eigenvalue units

mve July 6th, 2007, 12:44pm: I am unable to obtain Table 23.0. When I request Table 23.0 I get Tables 23.1 to 23.26. Therefore, I can not obtain Eigenvalue units. Anyone having similar probs? Can I still do PCA without this table?
Many thanks Marta

MikeLinacre: Marta, thank you for your question. Table 23.0 is a summary of the subsequent sub-Tables of Table 23. Are you using a recent version of Winsteps? The current version, Winsteps 3.63.2, definitely produces Table 23.0. Your sub-table numbers suggest Winsteps 3.57.0 or earlier.

mve: Yes, our version is 3.35. Is there much benefit upgrading our version? I am new with Rasch and not sure to what extent it is needed... I would appreciate your views.
Many thanks again, Marta

MikeLinacre: Marta, a major feature probably not in your version of Winsteps is the interface to plot directly with Excel. Please see https://www.winsteps.com/a/WinstepsFeaturesA4.pdf for a summary of current features. www.winsteps.com/wingood.htm has a history of changes to Winsteps.

960. on ideal value of a and b

za_ashraf July 10th, 2007, 7:04am:

Could anybody sugesst which is the ideal number (or range)for a and b in a two-parameter logistic model, so that the item is good.

MikeLinacre: Za ashraf: in 2-PL IRT, "a" is the item discrimination. Higher is generally thought to be better, so there is probably no upper limit. At the low end, something like 0.5 may be a lower limit for practical use.
"b" is the item difficulty. Since the sample mean ability is usually set at 0, values further away than 5 are probably so far off-target that almost everyone is succeeding or failing on those items.

961. MnSq and ZSTD

garmt June 18th, 2007, 12:55pm: If I understand correctly, the ZSTD as reported in the Winsteps output is obtained from the MNSQ by means of the Wilson-Hilferty transformation:

ZSTD = ( MNSQ^1/3 - (1-2/9n) ) / sqrt(2/9n)

where n is the number of degrees of freedom. Hence, ZSTD should be a monotonously rising function of MNSQ, right?

Yet when I look at the values given by Winsteps, I sometimes see larger values of z corresponding to smaller mean squares. Here's an excerpt from my output:

| 101 31 32 4.01 1.03|1.13 .4|6.43 2.3|
| 18 4 32 -2.39 .56|1.39 1.1|3.15 2.1|
| 65 5 32 -2.10 .52|1.13 .5|2.81 2.1|
| 178 26 32 1.85 .49|1.29 1.1|2.74 2.4|
| 113 28 32 2.41 .57|1.08 .3|2.14 1.4|
| 29 27 32 2.11 .53|1.26 .9|1.99 1.4|
| 236 8 32 -1.41 .45|1.01 .1|1.98 1.9|
| 127 21 32 .83 .42|1.30 1.5|1.91 2.5|
| 60 27 32 2.11 .53|1.18 .7|1.88 1.3|

For example, an outfit MNSQ of 3.15 gives a ZSTD of 2.1, and a MNSQ of 2.74 gives a ZSTD of 2.4. Yet all persons have taken the same number of items, so the number of degrees must be equal, and therefore I'd expect the value of ZSTD to decrease with MNSQ.

Could somebody explain what's going on here? Am I completely missing the point?

MikeLinacre: Garmt, thank you for your question. You are correct. For a given number of degrees of freedom, the mean-square and its standardized form are monotonic. But the number of degrees of freedom is changing because they are estimated from the modeled distribution of the observations, not from counting the observations. See www.rasch.org/rmt/rmt34e.htm

garmt: Thanks, Mike, that explains a lot. I've gone through all the formulae, and it all makes sense, but there's one detail I don't quite understand: why is the NDF estimated from the variance of the infit rather than outfit mean square? I know that the unweighted mean square, being a proper chi^2, is chi^2-distributed, but I don't see a priori why this should be the same for the weighted infit mean square.

Also, does Winsteps produce an output table that contains the estimated NDFs (or q^2)? I'm trying to verify the calculations by hand, and I get the correct infit and outfit mean squares, but not the right zstd. I must be goofing up with the q^2, and a check halfway through the calculations would come in handy :)

MikeLinacre: Garmt, the d.f. are estimated separately for each mean-square reported, infit or outfit. And you are correct, the Infit mean-square is only roughly chi-square distributed. But the rough approximation has worked will now for around 40 years. But we are always on the alert for better fit statistics.

Khalid: Which one is most relevant to decide if an item is misfit, outfit MNSQ or ZSTD?
thanks in advance

MikeLinacre: Kahlid: thank you for your question which is fundamental to good measurement. In situations like this, it is always useful to think of the equivalent situation in physical measurement. ZSTD reports how certain we are that the measurement is wrong - but not how far wrong it is. MNSQ reports how far wrong the measurement appears to be - but not how certain we are that the measurement is wrong.

In physical measurement, we are usually more concerned about the size of any possible discrepancy ("measure twice, cut once") than about how certain we are that there is a discrepancy ("I'm sure I measured it right!"). If size of discrepancy is more important than certainty of discrepancy, then the MNSQ is more crucial than the ZSTD. But in much of statistics, only the certainty of the discrepancy is considered ("hypothesis testing"). The size of the discrepancy is usually ignored.

Khalid: Hi Mikelinacre, thank you for your answer, my aim is to create equivalent logits for student's scores over time and school level, so i'm in a processus of calibrating my mesure (5 tests over two yeras whith anchoring items).
my concern is who to interpret a high value of outifit ZSTD whith a normal value of outfit MNSQ (example ZSTD= 2.5 and MNSQ=0.87).
thank you in advance.

Khalid: I’ve just read that ZSTD is overly sensitive to misfit when the sample size is over 300; my data sample is over 45000, and I don’t know which rule can I apply to decide whether an item is misfit or not.
Could a sampling strategy over my data be effective?

here one of my outputs:

TABLE 14.1 SECONDAiry 1 year1 ZOU272WS.TXT Jul 6 16:23 2007
INPUT: 875 Students 12 Items MEASURED: 875 ETUDIANTS 12 Items 2 CATS 3.63.0
ETUDIANT: REAL SEP.: 1.19 REL.: .59 ... Items: REAL SEP.: 13.28 REL.: .99


| 1 89 827 2.02 .12 | .98 -.2| .75 -1.4| .38| 90.1 89.6| ITEM1 |
| 2 215 827 .69 .09| .99 -.1|1.13 1.4| .45| 79.1 78.7| ITEM2 |
| 3 305 827 .02 .08|1.05 1.3|1.00 .1| .48| 71.0 73.5| ITEM3 |
| 4 115 827 1.67 .11| .88 -1.8| .66 -2.4| .45| 88.4 87.1| ITEM4 |
| 5 692 827 -2.81 .10|1.02 .3| .96 -.2| .53| 84.2 84.7| ITEM5 |
| 6 304 827 .03 .08|1.08 2.0|1.02 .3| .46| 70.1 73.5| ITEM6 |
| 7 212 827 .71 .09| .98 -.4| .92 -.9| .47| 78.0 79.0| ITEM7 |
| 8 313 827 -.03 .08| .90 -2.9| .88 -1.9| .55| 77.3 73.3| ITEM8 |
| 9 542 827 -1.55 .08|1.04 .9|1.04 .6| .53| 73.2 74.7| ITEM9 |
| 10 395 827 -.57 .08|1.10 3.0|1.18 3.1| .46| 67.7 71.6| ITEM10|
| 11 324 827 -.11 .08|1.03 .7|1.08 1.3| .49| 71.6 73.0| ITEM11|
| 12 318 827 -.07 .08| .95 -1.4| .88 -1.9| .53| 75.5 73.2| ITEM12|
| MEAN 318.7 827.0 .00 .09|1.00 .1| .96 -.2| | 77.2 77.7| |
| S.D. 161.2 .0 1.24 .01| .06 1.6| .15 1.6| | 6.9 6.0| |

For items 4 and 10 would it be ZSTD or MNSQ the rule to decide whher tey are misfit or not?

MikeLinacre: Khalid. You can infer the findings of a sub-sampling strategy by looking at the nomogram plot at www.winsteps.com/winman/diagnosingmisfit.htm. Identify the mean-square line closest to your mean-square. You can then predict the ZSTD (y-axis) for any given sample-size (x-axis) by following that mean-square line.
Item mean-squares less than 1.0 have little influence on measure distortion, so your concern is with item mean-squares considerably above 1.0. Are these due to bad items, or due to misbehavior by the sample, e.g., random guessing?

Khalid: Thnaks MiKe, for item 10 it's the case (a bad item) but for the others no.
could the reason be the large sample data i use (875 students)?

MikeLinacre: Khalid, all the mean-squares in your Table are below the most strict upper-limit of 1.2 used for high-stakes mutliple-choice tests: https://www.rasch.org/rmt/rmt83b.htm - it is your large sample-size that is making the hypothesis test so sensitive.

Khalid: thanks a lot Mike, i did the sampling strategy and it worked, i found exactly the misfit items exactly the most incomprehensif items.
thanks again

962. logit scores in Winsteps

dudumaimon July 4th, 2007, 9:10am: I am analyzing delinquency scores with 15 items using a sample of 8,000 respondents. I was wondering if Winsteps have a function that allows seeing the logit IRT scores for each respondent? If yes, what is this function and how can I incorporate these IRT scores in an SPSS, SAS or STATA files for further regression analysis? Thanks.....

MikeLinacre: Dudumaimon, in Winsteps "logit scores" are called "measures" to distinguish them from "raw scores" which are called "scores". On the Output Files menu, select "PFILE=". You can write the individual respondent statistics to text files, Excel files or SPSS files.

963. Analyzing Complex Samples with Rasch

mdeitchl June 21st, 2007, 5:24pm: I have some data I would like to analyze by applying the Rasch model and using Winsteps software. The individuals in the sample were selected by a complex sampling scheme (cluster sampling). Does Winsteps offer a function for adjusting for the correlation of individuals within a cluster due to complex sampling schemes? Any suggestions on how to proceed? thanks. Megan

MikeLinacre: Megan, intriguing question! Winsteps usually has no problem with missing data, and also allows for weighting of cases, so it supports designs like census weighting.
Your data has person-dependence. This is not unusual. For instance, in medical samples, people with the same disease are dependent. So an early step in the analysis is to discover the amount of dependence and where in the sample it is located. Winsteps Table 24 - person residual analysis - indicates what is the biggest secondary dimension in the data. If this is very large, then it suggests splitting the sample in two for separate analyses. In a medical analysis, we discovered that "burn" victims are considerably different from "disease" victims. However, the relative number of "burn" victims was so small that clinicians opted for one measurement system rather than two. The "burn" victims somewhat misfit the overall measurement system, but the clinicians reported that all the measures were good enough for their purposes.

mdeitchl: Thanks, Mike, for referring me to table 24.99. I am mostly intersted in item specific (not person specific) analysis to assess the validity of the scale I am using. Will the cluster sample design affect the standard errors for assessing item fit and severity ordering of the scale questions? Is this important to adjust for? I am also wondering, if I run table 24.99 - and there is person correlation indicated - should I remove those dependent persons listed when I assess the validity of the scale? Thanks for your help! Megan

MikeLinacre: Megan, thank you for your questions.
For item fit, the more observations of an item, the more sensitive the fit tests.
Dependent persons: these will slightly lower the item standard errors but have little other influence on your findings.
Cluster designs rarely affect the item hierarchy. If in doubt, code the cluster number into each person label, and do a Winsteps Table 30 DIF analysis of person cluster vs. item.

964. Equating, anchoring

omardelariva January 18th, 2007, 4:05pm: Hello everybody:

I wanted compare two methods of equating: the first one was using anchored items with a no common group; second one was the method described in Best Test Design (Wright & Stone, 1979). I used kct data as the first form of exam, (18 items and 34 persons), then, I modified it and obtained a second data (15 items and 25 person). I got many question after I analyzed estimations, as you can see data of Table 1, if I would decide use method I (BTD) final estimation of , for instance R1, it would be-4.93 but If would use anchored test, final measure would be -6.59. Differences between measure from Form I and Form II of anchored items were displacements.
How should I construct an Item bank, using anchored measure or anchored measure+displacements (if I would decide employ method of anchored items)?
Should test if displacements are not statistically significants?

Anchor Measure
Form I Measure
Form II Measure
Final Measure (BDT method)
R1 -6.59 -1.72 -4.93
R4 -4.40 -1.72 -3.83
R9 -3.38 -2.98 -3.95
R13 1.95 2.42 1.41
R14 3.37 2.62 2.22
R15 4.80 3.07 3.16
R2 -6.59 -1.72 -4.93

Somebody can help me?
Thanks in advance.

MikeLinacre: Omardelariva, thank your for your participation.
Always do two separate analyses first, and then cross-plot the two sets of items measures. Look at the plot. This plot usually tells you immediately what type of equating procedure will be most successful.
In general, if one analysis is clearly of higher quality than the other, then anchor to the measures of the higher quality analysis. If they have the same quality, then do a BTD analysis. But there are many other considerations. For instance, is the equating "best fit" line at 45 degrees (simple equating) or does it have a different slope (Celsius-Fahrenheit equating). For some of these concerns, see www.rasch.org/rmt/rmt163g.htm

Khalid: Hi, my question is about vertical equating (anchoring) for a math test,
The data collected is the responses of students at different levels (secondary 1 to secondary 5) and for two years, for each year there’s a different version (version A: Year 1, version B: year 2).
SEC1 TEST A1 TEST B1 5 anchored items: A1-B1
SEC2 TEST A2 TEST B2 5 anchored items: A2-B2
SEC3 TEST A3 TEST B3 5 anchored items: A3-B3
SEC4 TEST A4 TEST B4 5 anchored items: A4-B4
SEC5 TEST A5 TEST B5 5 anchored items: A5-B5
4 anchored items each 4 anchored items each
pair 1-2 | 2-3 | 3-4 | 4-5 pair 1-2 | 2-3 | 3-4 | 4-6

So I have 5 linking items for each level between year 1 and 2, and 4 linking items each year between every level of the secondary school.

How can I equate my two test with Winsteps to produce equivalent thetas for students that can be equivalent over years and over levels?

Thank you

MikeLinacre: Khalid, thank you for your question. If you have each year and version in a separate rectangular dataset, then you can use the Winsteps MFORMS= instruction to combine them into one analysis.

Khalid: thank you for your answer,

965. Preparation of textfiles for perfforming DIF

flowann June 24th, 2007, 4:08pm: Hi Mike and everybody,

Trying to edit my wordpad data file for adding some demografic data to perform the DIF.
The codes of rating scales (items) are 1 to 5, the codes of demografics are 0 to 9. I assume that I cannot use the same codes, is this right?
Must I use letters? As in Knox Cube Test? for gender?
For instance: one demografic data has 9 different subcodes.

Thanks for help! :-/

MikeLinacre: Flowann, demographic codes for identifying DIF groups are part of the person label. They can be any letters or numbers that you choose, but they must line up in the same column or columns. For instance, if they are 0-9 in column 6 of the person label, then, in Winsteps: DIF = 6 or DIF = $S6W1

flowann: Mike, thank you for guidance!

966. model underfit / overfit

flowann June 23rd, 2007, 2:16pm: Hi everyone,
referring to the rule-of-thumb that 5% of persons and items (in a two-facets Rasch model) are accepted to over- and underfit the model, I would like to know if this means over- and underfit or over- or underfit?
Are there any references about this topic?

MikeLinacre: Flowann, thank you for your question. In Rasch analysis (and statistics in general), under-fit and over-fit are asymmetric. Underfit contradicts the model. This can be a big problem because the resulting measures can be substantively meaningless. Overfit indicates the model is over-parameterized (i.e., the parameters aren't independent), but this does not change the substantive meaning of the measures.
A better rule-of-thumb is "investigate underfit before overfit". Often eliminating the underfit also eliminates the overfit, because the overall fit of the data to the model changes when the underfit is removed.
There is a huge amount of literature on fit to the Rasch model. Look for papers by Richard M. Smith who has made a specialty of investigating this topic.

flowann: Hello Mike,
Thank you for answering!
The bigger threat is the under-fit, so I will mainly focus on it .
Thanks for the reference!

968. DIF

mikail June 21st, 2007, 2:46am: How DIP display plot from figure 30 winstep could be read and interpreted? The sennario was that the DIF contrast, t and probability indicated three out of ten items function differently across gender. However, the visually inspection
showed dispersion of items that insignificantly work differently (where the gap between both lines was very wide) and even more in number than significant items. Why?
How this plot can be interpreted and which of the plots is more significant (t-value, relative measure, and local measure)?

MikeLinacre: Mikail: statistical significance is indicated by t-values. However, you need to decide what hypothesis you are testing. The t-values shown in the Winsteps plots may not be the ones you want. Probably the pairwise significance tests reported by Winsteps, but not plotted, are what you want. They will be plotted in a future upgrade to Winsteps.
If items measures are far apart, but not reported as statistically significantly different, then the standard errors of the item measures are large, this is because t-value = measure difference / joint standard error
Usually we are not concerned about DIF unless both the measure difference is large and the t-value is significant. See the ETS Table at https://www.rasch.org/rmt/rmt203e.htm

969. When should I use Msq or Zstd

omardelariva June 21st, 2007, 12:06am: Hello everybody:
Ji squared statistics are affected by sample size. Usually, we use sample size greater than 60,000 cases. What statistics should we use in order to test item fitness? I mean, what sample size are adecuated to use Msq and Zstd?

Thank you.

MikeLinacre: Omardelariva: your sample size is so large that it is like looking at your data through a microscope. All misfit will look statistically significant. https://www.rasch.org/rmt/rmt63a.htm has a suggestion. Also https://www.rasch.org/rmt/rmt171n.htm.

970. Rack and Stack analysis

VAYANES June 18th, 2007, 4:00pm: Dear collegues,
I am interested in comparing Time 1 to Time 2 data about individual perceptions but I am afraid I have a lot of doubts concerning Rack and Stack.

I have 20 individuals and I asked them about 25 items in 2000 and then, about the same items in 2003. I wonder which measures to use and how the obtain them to draw the plots in Wright (2003). I thought in three alternatives after reading some examples in the literature:
1. To run the winsteps program considering two independently samples one of 20 ind.x25 items for each year. Then to plot the measures obtained for individuals and items in Table 13 y 17 in each analysis. In this case, we have one model and continuum for each year.

2. To run the winstep program considering one entire sample. I will put all the information in the same continuum and get comparable mesaures. I will run twice the program: one 20 indx50 items (rack) and 40 ind x25 items (stack). The measures to plot were those from each analysis.

3. To run the winstep with the sample from 2000, with 20 ind x 25 items and then to anchor some measures to run the analysis of 2003.

Would it be possible to do the rack and stack analysis as in alternative 1 or two? Would it be possible to compare the measures in case 1 or in two?.

Thank you very much for the information and help anyone could tell me.

MikeLinacre: Vayanes, these are important questions in making comparisons across time. The answer depends on what is the focus of your research.
1. If you are tracking the persons. Then you need to measure them in the same frame of reference at time 1 and time 2. Usually one of the two times is more decisive. In healthcare, it is time 1 because that is when treatement decisions are made. In education, it is usually time 2 because that is when success/failure decisions are made. Analyze the data from the decisive time point obtain the person measures. Anchor the item difficulties, then analyze the other time point to obtain the comparable set of person measures.
2. If you are tracking the differential impact of the intervention on the items, for instance, which items are learned and which aren't. Then rack the data.
3. If you are tracking the differential impact of the intervention on the persons, for instance which persons benefited and which didn't. Then stack the data.
4. If you are tracking how the instrument changes its functioning (Differential Test Functioning) between time 1 and time 2, then perform separate analyses.
Of course, you can perform more than one of these to answer different questions ....

VAYANES: Dear prof.Linacre,
Sincerelly, thank you very very much for your quick answer.Now I undestand a bit more the rack and stack analysis. However, I am afraid I still do not feel absolutely confident to do it. So I have to make you one more question, if you don't mind.

Related to your answer and the focus of my research, I have to rack (option 2 in your answer) and stack (option 3 in your answer) the data. In this case, do I have to follow the same process as in option 1 with the anchoring of measures or would it be possible to run the winsteps with one entire sample to get the same frame of reference for each analysis? For the stack, I will enter the individual twice (one answer for each year) and the item once in the sample, while for the rack, I will enter the item twice (one for each year) and once the individuals.

Thank you very much.

MikeLinacre: Vayanes, yes, the "rack" and "stack" are each one analysis.

In "stack" every person has two person records, one for time 1 and one for time 2.
Each item has one column which includes both person rows.

In "rack" every item has two item columns, one for time 1 and one for time 2.
Each person has one row which includes both item columns.

VAYANES: Dear prof.Linacre:
Thank you a lot for your answer, even more, for your QUICK answer. I now understand much better the rack and stack analysis. Thank you very much for your patient.
Best wishes,
Vanessa Yanes-Estévez
Universida de La Laguna

971. Negative items

mikail June 14th, 2007, 6:39am: Hello everyone...
I hope anyone can help to solve this problem.
I was recently analyzing language anxiety scale, and found that some items were worded negatively. Are these items have any impact on the overall realiabity of the scale? How can we reverse the items (practically). How these items can be interpreted? Could u please suggest any valuable references?

MikeLinacre: Mikail, negative items are often used in questionnaires to try to prevent response sets (respondents agreeing to everything). They can cause problems because some respondents won't notice the negative wording and will respond as though the items are positive. Most test analysis software includes methods for reversing the scoring of negative items.
There's a research note at www.rasch.org/rmt/rmt112h.htm

972. Test difficulty

Narziss June 13th, 2007, 1:27pm: Hi,
I analyze a test about the mathematical abilities of students. in the manual I couldn´t find a topic about the difficulty of a test. I think there must be some kind of statistic or index, that tells me if the test was to easy or difficult for the students. Can somebody tell me, if there is a statistic like that and how does the formula look like?

MikeLinacre: Narziss, there are two immediate ways to think about this numerically:
1. The difference between the average person ability measure and the average item difficulty measure.
2. If your data are complete, then the average score by a person on the test.
But averages can be misleading, so take a look at the sample distribution maps to verify that averages are good summaries.

973. Factor Analysis

mikail June 11th, 2007, 5:43am: Hello everyone...
I would like anyone to kindly explain how to use winsteps to determine the item factors. I have inspected the table and figure of factor analysis but I have not yet understood them. I know how the factor can be determined through SPSS, and its theoritical underlying. I would like to know whether the concepts are the same.

Total variance in observations = 48.3 100.0% 100.0%
Variance explained by measures = 15.3 31.7% 31.6%
Unexplained variance (total) = 33.0 68.3% 68.4%
Unexpl var explained by 1st factor = 6.5 13.5%

could you please explain briefly what the above informantion mean? Some items in the table are positive and other are negative, what that means?

I also need good refrences


MikeLinacre: Mikail, thank you for your question. I have you looked in the Winsteps Help? Also at www.winsteps.com/winman/principalcomponents.htm

974. Scaling

Raschmad May 22nd, 2007, 6:33pm: Dear all,
Someone has posted the following to a mailing list. It’s interesting and quite challenging for me. I didn’t feel confident enough to post my reply. I’d like to know what you think. Here’s my reply:
The poster is wrong. You cannot just assume that “A scaled score of 0 then would be equal to a z-score of -3.0, a scaled score of 15 would be a z-score of 0, and a scaled score of 30 would be a z-score of +3.0”. Some sort of equating or translation is necessary to bring your scores onto the framework of ETS’s framework
You should include some of their items in your test with known item calibrations, i.e., the calibrations of the items in their banks. And then anchor these items to their fixed parameters. Then you can transform your scale to the scale of iBT if you have their mean and standard deviation (15 and 5 apparently).
Or if equating is not important and you are just interested in similarity of scales the famous formula is this:
Bt=SDt (Bp-Mp/SDp)+Mt
Where Bt is ability on the TOEFL scale.
SDt is the standard deviation on the TOEFL.
Bp is ability on the practice test.
Mp is mean on the practice test.

Here is the original post:

I’m wondering if anyone would be kind enough to point out whatever errors there may be in my logic, and nudge me in the right direction, should I be off track.

I’m working with a well-known test-prep company. We have created several practice TOEFL iBT forms, and want to get all our scores on the same scale that ETS uses.

For both the Reading and Listening sections, ETS uses a scaled score that ranges from 0 to 30. Here’s where I’m wondering about my logic ...

Can the 0-30 scale be thought of as a z-score scale, ranging from -3.0 to +3.0 with a mean of 0, and a standard deviation of 1? A scaled score of 0 then would be equal to a z-score of -3.0, a scaled score of 15 would be a z-score of 0, and a scaled score of 30 would be a z-score of +3.0.

ETS SCALED (Mean = 15)

0 5 10 15 20 25 30
-3.0 -2.0 -1.0 0.0 +1.0 +2.0 +3.0

z-SCORE (Mean = 0.0)

This implies that each 5-point gain on the ETS score represents a z-score gain of 1 standard deviation.

(My ASCII drawing might blow up when I send this, but I hope you can see what I'm getting at.)

We know from some earlier work that our tests are very similar to a scored form of an ETS online practice test in terms of means and standard deviations. (We had students take our forms and the ETS test.) If we convert students’ raw test scores on our tests to z-scores, will putting them on the 0-30 scale be a good guesstimate of iBT scores?

(Example: Mean score on our Form A = 17. This is a z-score of 0. This then becomes 15 on the 0-30 scale.)

MikeLinacre: Raschmad, you correctly note: "Some sort of equating ...", but the poster appears to have done this (at least informally): "very similar ... means and standard deviations." But he can't assume the mean and S.D. of his sample are (15,5) on the ETS sale. If he can compute the mean and S.D. of his sample on his own test and on the ETS scale, then he can equate the two instruments (assuming the tests have concurrent validity).

Donald_Van_Metre: Hello,

I was the original poster of the enquiry at hand -- I posted it on the L-TEST list, and Raschmad posted it here.

I apologize in advance for my ignorance. I'm obviously in a little over my head ...

If I understand you Mike, you're saying this scheme will really only work if the means and SDs of our tests are 15/5, like the ETS scale. Otherwise, I have to do some equating. And I guess there's the rub.

Let's say for sake of illustration that one form of our test has a mean of 16, and SD of 6.

I was under the (apparently false) impression that the mean of 16 would be a z-score of 0, and be equivalent to the ETS mean of 15 (z-score 0). Further, that on our form, a score of 22 (16 + 6) would be a z-score of 1.0, and equivalent to an ETS scaled score of 20 (15 + 5 = z-score 1.0).

I guess I don't fully grok what's going on with z-scores. I thought that they gave a relative position with regards to the mean. If I'm one SD above the mean on test 1, can I not figure that value on test 2, and call them equivalent?

Sorry if I'm being dense!

You allude to an equating procedure. How would I go about doing that? Can you point me to some references?

Raschmad wrote:
Or if equating is not important and you are just interested in similarity of scales the famous formula is this:
Bt=SDt (Bp-Mp/SDp)+Mt
Where Bt is ability on the TOEFL scale.
SDt is the standard deviation on the TOEFL.
Bp is ability on the practice test.
Mp is mean on the practice test.

For my hypothetical example of mean 16/SD 6 for the practice test and 15/5 for the iBT, I seem to get

Bt= [5(16-6/6) + 15] = 23

Is that a closer approximation? How would I guage that?

At the end of the day, we're trying to provide students a "reasonable approximation" of their likely iBT score. We're not trying to predict with total accuracy. (That would be sheer folly!) Of course the more accurate we are the better, but we are not trying to say that X on our test guarantees you'll get Y on the iBT.

Would you reccomend going with the z-score method, Raschmad's formula, or something else?

I truly appreciate any feedback or advice you might be able to provide, and thank you for the time you've already taken with this.

Best regards,


MikeLinacre: Donald, welcome!
ETS tell us they used z-scores from their sample. But it no longer matters how ETS consructed their TOEFL scale as long as the TOEFL scale is approximately linear. Then the TOEFL scale is like temperature Celsius. Measures on our test are like temperature Fahrenheit. So, to compare temperatures, we must equate them. This equating can be done by measuring some people with both tests, and estimating equating intercepts and slopes. Then:
equating slope = TOEFL S.D. for our sample (not theirs)/our test S.D.
equating intercept = TOEFL mean for our sample - our mean * equating slope
TOEFL measure from our test = our measure*equating slope + equating intercept

Raschmad: Stuart Luppescu (1996) reports on "virtual equating", i.e., equating without any items and persons in common.


Raschmad: Drear Mike,
The formulae you have written are rather vague for me.
1- Isn’t intercept the difference between the two means? Why have you multiplied it by slope?
2- And shouldn’t the adjusted ability measure (in TOEFL frame of reference) be:
measure on our test- TOEFL mean for our sample divided by the slope+ intercept?
Why have you multiplied by slope instead of dividing?
Bt= Bp-Mt/Slope+ intercept
Thanks in advance

MikeLinacre: Raschmad, you are right. It depends on which variable is on the x-axis. So formulate the equating function to match the choice of axes that you prefer.

975. When to use DIF

Monica May 25th, 2007, 1:15am: Hi Mike,

I have a test data containing 297 participants from Grades 3 - 6. There are 68, grade 3, 93 Grade 4, 102 Grade 5, 34 Grade 6 respondents. They were all administered the same pilot test. I have also chosen to use a 2 phase analysis procedure according to Ludlow and Leary. First phase - answers after the last attempt is coded as not reached and item difficulties calculated; second phase - missing and not reached data are all coded as incorrect. Using item difficulties from the first phase, person abilities are calculated.
Keeping my method of analysis in mind, when is it appropriate to:
(a) run the analysis as grade sub-samples ie Grade 3 only, Grade 4 only etc
(b) analyse the entire data set. Tehn run a DIF by grade.
(c) do I compare item difficulties produced from Phase 1.
I am also looking at dividing the 2 tests into (1) Grade 3,4, and (2) grade 5 and 6.
Should I then be combining the data according to that grouping and look at the item difficulties.



MikeLinacre: Thank you, Monica. You ask "when is it appropriate ... ?". Are you asking "When do reviewers expect ..." or "When are the measures more meaningful ...."?
To discover what reviewers expect, look at similar studies that have been published in your field.
Measures are usually more meaningful when one set of measures are estimated (your phase 1) and then these estimates become the basis for all subsequent studies (called "anchoring" or "fixing"). All results are then directly comparable because one measuring system has been applied everywhere. You can produce reports for each Grade etc. that can be directly compared to the reports for the other Grades.
For your DIF study, definitely omit "not reached" responses, and probably omit missing responses. You want the estimates of item difficulty for the different groups to be based on real knowledge and ignorance, not on our inferences from their failure to respond.

976. Person separation index - a question

helencourt May 22nd, 2007, 12:16pm: Dear Mike
I was wondering if I could ask your opinion again?! I have been involved in writing a paper, and one of the reviewers has highlighted an issue regarding separation index. I wondered if I could check with you that my understanding of person separation index is correct?....

My understanding is that "The person separation index expresses the reliability of the scale to discriminate between people of different abilities. It is defined as the ratio of the adjusted person standard deviation to the standard error of the measurement (i.e. the variance not accounted for by the Rasch model), measured in standard error units. Recommendation is that the separation ratio should exceed 2"

Does person separation index differ accross the scale? (which the reviewer implied)...I thought that it was simply one number?

Thanks again for all your help Mike

MikeLinacre: Helen, "separation index" is a reformulation of the familiar "reliability" index, R,
Sep. = sqrt(R / (1-R)), so there is only one per person sample. It is an "across scale" average.

helencourt: Dear Mike

Thank you so much for clearing that up.

Thanks again!

977. item fit order table

godislove May 22nd, 2007, 12:56pm: Hi Mike,
I am checking to item fit order table (winsteps) and two items have both outfit and infit over 1.5. Then in table 10.3, I can see how persons responded in each response category in each item like %, outfit etc. I wonder why there is only outfit MnSq in this table but no infit MnSq? If an item has 3 big outfits in 3 categories, does this contribute to the total item outfit and infit or just outfit? Please advise. thanks.

MikeLinacre: Table 10.3 is the option/distractor Table, godislove. It stratifies fit by response category. My experiments indicated that INFIT is uninformative in this context, because each fit statistic is reporting only the observations reported in the category. INFIT would be useful if Table 10.3 also reported for each category the fit of the observations that aren't in the category, but, according to the model, should be.
Every observation contributes to both infit and outfit. But the weighting of the observations differs. On-target observations contribute less to outfit than to infit.

godislove: Hi Mike,
thank you for your reply!
I think I need your help to clear up my knowledge about infit and outfit.
if an item has acceptable infit but big outfit, this is likely due to lucky guess or careless mistake.
If an item has both big infit and outfit, then this item is likely to be confusing to the raters (the raters assess the patients with this item).
Am I right?

2 items in my 30 item instrument have this problem. I did a misfit diagnosis by checking table 10.3. The infit is the one make me think the definitions of both items are somehow confusing. That is my proof to my misfit diagnosis. but this table only show outfit. Does it mean that I have no way to find out if the reason for anitem being misfit? ie. either careless raters or wrong defined items.


MikeLinacre: Thoughtful question, godislove.
To see the functioning of the item, in detail, across the range of person measures, look at the empirical ICC on the Graphs plot. You may also want to investigate item x patient-group DIF.
If we are talking about rating scales for asesseng patients, then "lucky guess" or "careless mistake" don't seem to be relevant. In my analyses of patient assessments, nearly all misfit was due to the patients and little was due to the raters. It was difficult to construct items that were stable across patient diagnostic groups. Some items weren't stable (i.e., had big infit and outfit), but they were necessary for clinical reasons. Since any measurement is better than no measurement, we had to accept items that had mean-squares around 2.0 or even higher, provided that they worked well enough for most patients.

978. sample size & power calculations

mve May 16th, 2007, 4:09pm: I am trying to work out the sample size & power calculations needed to use Rasch analysis on 3 validated questionnaires that currently use a Likert scale. The questionnaires have 14, 26 and 10 items with 5 response options.
I have read that Rasch models are able to produce robust estimates for sample sizes of 100. Can anyone direct me to any reference where I can get further advice?
Many thanks

MikeLinacre: Certainly, Marta. A starting point is www.rasch.org/rmt/rmt74m.htm.

979. Help!  Can anyone clear this up??!!

helencourt May 11th, 2007, 11:40am: Hi everyone

I need some help! I am writing up my section on the Rasch model for my thesis. As I have been thinking about the Rasch model equation (for the dichotomous model) I have confused myself somewhat! Could anyone clear this up for me?:

An example: if person ability(Bn) is 2 and item difficulty(Di) is 1, then, according to the rasch model, the probability of a correct response is 0.73. however, if person ability is 4 and item difficulty is 2, the probability of a correct response is 0.88.

In BOTH cases the person ability was TWICE the item difficulty, so WHY is the probability of getting a correct answer DIFFERENT for each example?? I thought that the rasch model allowed measurement on a linear scale??!!

Can anyone help me and provide a simple answer?!



MikeLinacre: Helencourt, thank you for your question. Rasch measures are linear like degrees-Celsius on a thermometer, or height measured from the floor. The Rasch measures are in "log-odds units", which are a direct transformation of probabilities of success for dichotomous items. So we can verify your arithmetic:
If P is the probability of success, then the odds of success are (P/(1-P)) and the log-odds of success are log(P/(1-P)). The Rasch dichotomous model specifies that:
Ability - Difficulty = log(P/(1-P))
so that P = exponential(Ability-Difficulty) / ( 1 + exponential(Ability-Difficulty))
Your first example, P = exp(2-1)/(1 + exp(2-1)) = 2.7 / (1+2.7) = 0.73
Your second example, P = exp(4-2)/(1 + exp(4-2)) = 7.4 / (1+7.4) = 0.88
Does this help answer your question, or .....

helencourt: Dear Mike
Thank you so much for your reply. I really think I am not getting a simple concept here...I am still a wee bit confused. I understand that the Rasch measures are in log odd units. But what I still don't understand, re the example, is why the probability of a correct response is different for each case EVEN THOUGH the person ability is twice the item difficulty in both cases. If a person has double the ability than in required by the item, then surely, if this is a ruler, then the probability of a correct response will be the same, whatever the initial vales of person ability and item difficulty. Sorry to be asking more questions! Helen

MikeLinacre: Perhaps this is the answer: on linear, interval, additive, measurement scales, one unit difference between person and item means the same amount of performance no matter where on the scale it is. ability 2 - difficulty 1 = 1; ability 4 - difficulty 3 = 1, etc.
But in Georg Rasch's (1960) original multiplicative form of his model, performance was based on ability/difficulty. So
P = (ability/difficulty) / (1 + (ability/difficulty))
so, for your example P = (2/1) / (1+ (2/1)) = 2/3 = 0.66,
and the same for 4/2

helencourt: Dear Mike

Thank you again for your help. I am clear that the scale is additive, so one unit difference between person ability and item difficulty is the same wherever you are on the scale. I can see how the person ability and item difficulty are on linear scales (via log transformation from odds of sucess) - but, going back to my original example, where the person ability was TWICE the item difficulty in both situations, the fact that the probabilities of a correct reponse reponse is different for both, seems to suggest that there is something fundamentally flawed with the Rasch model? i.e. that the relationship between person ability and item difficulty is not linear. If this is correct, and the relationship is not linear, then how is this operating like a ruler? Hope that makes sense! Helen

MikeLinacre: Measurement is often far from obvious! Physical science has been struggling with it for thousands of years (since ancient Babylon), and we social scientists have only recently started taking it seriously.
Let's move your example to the heights of dogs. One dog (called "difficulty") is 1 foot high and another dog (called "ability") is 2 feet high. Then one dog is 1 foot higher than the other. Both dogs grow to double their height. Now one dog is 2 feet high, and the other is 4 feet high. The different is 2 feet. Studies indicate that if a dog is 1 foot taller than another dog, it has a .73 probability of winning a dog-fight. But if a dog is 2 feet taller than another dog, it has a .88 probability of winning a dog-fight.
This situation parallels ability and difficulty in logits. OK?

helencourt: Dear Mike

Thank you so much for your patience - I think I understand at last! Thank you for your time, this has helped me immensely. Best regards, helen

980. PCAR

Raschmad April 27th, 2007, 8:48am: HI folks,
I administered the same test to two different samples and ran PCAR with Winsteps.
The variance explained by measures for one sample is 80% and for the other 70%. They are very close to their modeled values though.
1. Can one claim that the test is more unidimensional for the group with larger variance explained by the measures?
In another test the variance explained by measures is only 45% and close to its modeled value.
2. Does it mean that the test is narrow or what?
3. What does it mean when the cluster of contrasting items change for the two samples?

MikeLinacre: Raschmad, the "variance explained by the measures" depends on the person and item dispersion, not on the dimensionality. We can see that because the observed and modeled values are close together. This means that if the data exactly fit the Rasch model (i.e., were unidimentsional), the variance explained would be almost the reported values.

Multidimensionality is indicated when the contrasts in the PCAR are big enough to indicate a secondary dimension. Usually this needs a contrast size of at least 2.0, and a noticeable ratio of first contrast variance to meaure-explained variance.

When the cluster of contrasting items changes, it probably means that the contrast has a noticeable random component, so the first contrast is likely to be an accident of the data, not a secondary dimension.

Raschmad: Hi Mike,
"What does it mean when the cluster of contrasting items change for the two samples?"
How about if the change in the samples is systematic?
For instance, males vs females, low-ability vs high-ability, natives vs immigrants?
can the chnage in the clustering items be raleted to instability of the test construct for different samples?
Thanks a lot

MikeLinacre: Raschmad: Yes, we expect to see changes in the clusters for substantively different groups. For instance, on a language test, native speakers vs. second-language speakers. Native speakers will have relatively more ability on the spoken vs. the written items. So we expect to see a change in the spoken vs. written clustering of items on the PCAR plot.

981. equating tests with different # response options

maggi_mack May 5th, 2007, 7:11pm: Hi all - A colleague wants to equate two versions of the same short psychological screening measure with the only difference between scales being the number of response options (2 vs. 4). Both scales were given to the same group of people. The end goal is to develop a "algorhthm" to convert scores between the two scale formats. This seems like a situation for common person equating to me. WINSTEPS had a nice description of this in its documentation. I'm wondering if I am missing anything conceptually? I was reading a couple of papers on the web that said you had to dichotomize the polytomous response version of the test as the first step. This doesn't make sense to me. Thanks!

MikeLinacre: Maggi_mack, yes, common-person equating would be the simplest way to go. This is probably a Fahrenheit-Celsius situation, because a change in rating scale almost always means a change in test discrimination.
Dichotomizing rating scales was the recommended technique 20+ years ago before there was polytomous software. But there is no need to dichotomize now, and many reasons not to ...

982. I need some reference about s.d. difficulty

omardelariva May 4th, 2007, 4:32pm: Mike Linacre:

You reply was very helpful ( at topic Same bank...) but you wrote that I need to make a kind of correction related with s.d. of sample ability and s.d. difficulty test, Could you give me some references in order to I consult them?

Thank you

MikeLinacre: Omardelariva. No correction for S.D. is needed for your score table. My validation of your score table would be more accurate if I knew the S.D. of your person sample and the S.D. of your item sample. My validation would be yet more accurate if I knew the measures of each of your persons and the difficulties of each of your items. If I knew all these measures, I would expect my score table to exactly match your score table, because then we would be doing exactly the same computation. OK?

983. Same bank, different length test and even s. size

omardelariva May 3rd, 2007, 10:56pm: Good evening:

I made four different test, I obtained items from the same bank but the length of ach test was different (nine, eigth, ten and fourteen items). After they were administered to different population with dissimilar sizes (393, 366, 514, 514 persons) Now I have constructed a matrix with expected scores but Does it have sense? Could the difference between sample sizes and test length alter my conclusions?

Sample size
393 366 514 514
Lenght Test 2004 2005 2006 2007 year
9 2004 2.77 3.15 4.75 3.68
8 2005 2.16 2.51 4.03 3.01
11 2006 2.58 3.05 5.23 3.75
14 2007 4.74 5.46 8.36 6.46

MikeLinacre: Omardelariva, thank you for this little puzzle.
The sample sizes don't matter for this, but the sample means and S.D.s do.
Let's make the extreme assumption that the person samples have S.D.s = 0, and also the samples of items have S.D.=0, then we can compare the scores directly as binomial trials: logit = log (score / (length-score)).
The means of rows will give a logit test difficulty and the means of the columns will give a sample test difficulty. We can then use these to see where there might be unexpected values in the data array. Here are my results.
They indicate that your values fit well, even with my extreme 0.0 S.D. assumption. So your computation is probably correct.

Persons Your reported values
Length Test 2004 2005 2006 2007 year
9 2004 2.77 3.15 4.75 3.68
8 2005 2.16 2.51 4.03 3.01
11 2006 2.58 3.05 5.23 3.75
14 2007 4.74 5.46 8.36 6.46

Persons My logit values using reported values and 0.0 S.D.s
Length Test 2004 2005 2006 2007 Logit mean
9 2004 -0.81 -0.62 0.11 -0.37 -0.42
8 2005 -0.99 -0.78 0.02 -0.51 -0.57
11 2006 -1.18 -0.96 -0.10 -0.66 -0.72
14 2007 -0.67 -0.45 0.39 -0.15 -0.22
Logit mean -0.91 -0.70 0.11 -0.42 -0.48

Score expected values from logit means
Length Test 2004 2005 2006 2007
9 2004 2.69 3.11 4.87 3.70
8 2005 2.15 2.51 4.04 3.01
11 2006 2.63 3.08 5.13 3.74
14 2007 4.80 5.49 8.28 6.45

Score differences between reported and expected
Length Test 2004 2005 2006 2007
9 2004 0.08 0.04 -0.12 -0.02
8 2005 0.01 0.00 -0.01 0.00
11 2006 -0.05 -0.03 0.10 0.01
14 2007 -0.06 -0.03 0.08 0.01

984. Information about PAIR algorith...

Andreich April 28th, 2007, 2:59pm: Hello. A have a few questions, maybe you can help me. The situatin is next...

As I know, RUMM2020 software use PAIR algorithm. Does anybody know, where I can get full information about this procedure. I will be very thanks for any materials or url-links.

P.S. Sorry for my bad english.

MikeLinacre: PAIR is described in "Rating Scale Analysis" (Wright & Masters) www.rasch.org/rsa.htm
Also in "Rasch Models for Measurement" (Andrich, Sage)
and "Pairwise Parameter Estimation in Rasch Models", Zwinderman, Aeilko H.
Applied Psychological Measurement, vol. 19, no. 4, pp. 369-375, December 1995

Andreich: Thank you for reply!.. And one more question: may be you know any open source program code based on this algorithm?
I can calculate the initial Di, but I can not understend how to elaborate it.

MikeLinacre: Don't know of any open source. Basically, the overall pairwise category probabilities of 0-1 vs. 1-0 are made to coincide with the overall pairwise category frequencies of observing 0-1 vs. 1-0. The Newton-Raphson iteration to improve the estimates is straight-forward. See the published sources.

985. all raters exhibits central tendency, why?

joycezhou April 25th, 2007, 3:29pm: thanks for your attention
i assign each rater a separate rating scale and try to inestigate the extremism/ central tendency. the ressults are rather difficult to interprete. all of the raters exhibits central tendency. Would anybody provide reasonable explanatins to this?
any comments on this question will be greatly appreciated.

MikeLinacre: Please tell us more about your rating scale and the rating situation. For instance, if your rating scale is "strongly disagree, disagree, neutral, agree, strongly agree", then perhaps your respondents don't have strong viewpoints about your prompts. Is this "central tendency" or "bland questions"?

joycezhou: the rating adopted Jacob's seven-point scale. ratings were given on five domains, whcih were anchored at 0 in the analysis.
is there any possibility that this situation is due to the inapproriate sampling?
my sincere thanks for your kindness.

MikeLinacre: Joycezhou, is this your rating scale?
grade 1 - excellent
grade 2 - very good
grade 3 - good
grade 4 - satisfactory
grade 5 - unsatisfactory
grade 6 - poor
grade 7 - very poor
If so, what do you predict to see for your process?
If it is for reinspection, then we expect to see 6 and 7 very rarely. And, if the process is cost-dominated, we expect to see 1 very rarely and 2 rarely.
Consequently, we would only expect to see many ratings of 3 and 4, fewer of 5, and even fewer of 1,2,6,7
This could look like "central tendency", but it is not.

986. Cross-rating Study Design with FACETS

joe April 24th, 2007, 3:58pm: At my school, we have a standardized English writing assignment, with class teachers as the raters. We have about 15 teachers teaching a total of ~51 classes with a total of ~3000 students. Each student chooses 1 of 2 possible writing prompts to respond to. We have been using a single-rater grading system but we're now hoping to combine some level of cross-rating with FACETS to provide adjusted scores which are hopefully more fair (i.e., adjusted for differences in rater leniency/severity and for differences in task difficulty). We also hope to use the cross-ratings to identify how consistent individual raters are, so that we can develop a training program to improve rater consistency (reliability).

Next week, we're about to begin a pilot study of all this. Our plan now is to RANDOMLY choose 10% of the papers graded from each class and have them cross-rated by a different teacher. The problem is we haven't found any studies attempting exactly what we're attempting. Our questions are very basic and include:

1) Is random sampling the best way to sample here?
2) Is 10% likely to be enough? Is there a way to design the study (sampling different percentages of papers from different teachers) that would allow us to find the minimum number of papers we need to have cross-rated in order to make our adjustments?
3) Is there any other obvious thing that I haven't mentioned that we should know about?
4) Is anyone aware of any good references which could help us out?

I'll be taking the coming www.statistics.com course on FACETS, where I'll hopefully get a better idea of how to handle all the date we'll generate. Now, we're basically focused on simple design questions...

Any and all feedback would be greatly appreciated! Thanks!


MikeLinacre: Joe: Random sampling is great if you don't have any better information. But you probably do! You already have the papers graded once. So stratify the papers by that grade and random sample within strata. But you probably also have gender and ethnicity and other characteristics of the students. So also stratify by those.
You could probably construct a sampling plan so that every important combination of performance and demographics is examined.
Percent isn't important, but absolute number is. Aim for 10 essays in each important cell of your sampling plan, 30 would be better, but that is probably too many.
The big problem with your design is the "1 prompt out of 2". How are you going to compare performance on them? Suppose they are "Prompt 1. My day at the zoo" and "Prompt 2. Einstein's theory of relativity". Then we would expect the fun guys to choose Prompt 1 and the nerds to choose Prompt 2. Your situation is probably not so polarized as this example, but you need to check it.
We ran into the problem of unequivalent essay prompts on a widely-administered standardized test. The data had been analyzed as though the prompts were equivalent, resulting in findings so obviously incorrect that even the politicians noticed.

987. observed category measures

godislove April 23rd, 2007, 9:18pm: Hi Mike,
Can you explain what 'observed category average measures' mean? I pressed the category average and one item has 0-2-1-3 instead of 0-1-2-3. what has happened? Have the raters used the scale wrong? but in what way?

MikeLinacre: We identify who were the people observed in each category, and then average their measures. So, 0-2-1-3 means that the average measure of the people rated "2" is less than the average measure of the people rated "1". This is surprising, and contradicts are idea that the average measure of people rated "2" should be higher than the average of those rated "1".
So either some generally-high-performers were rated "1", or some generally-low-performers were rated "2", or both. Perhaps if you look at the unexpected responses in Tables 10.4, 10.5, 10.6 you will see something interesting.

988. Variable map or item map

godislove April 23rd, 2007, 12:11pm: Hi Mike,
which map should I use to present the data? item map or variable map? All the items on the item map line up as a straight line but in the variable map, three items are located next to each other.
please advise.

MikeLinacre: It sounds as though your items have difficulty measures that are close together.
So, if you want to emphasize that the items have approximately the same difficulty then it is the map with them on the same line, but if you want to emphasize that there is an item hierarchy, then you want a map with the items on different lines.

989. non-center facet and the central tendency issue

joycezhou April 23rd, 2007, 6:00am: thank you for commenting on my questions:
i have three questions
the first one is concerned with the non-centr facet. i notice that traditionally we set the "examinee"facet as the non-center facet. however, there are some researches adopting the "rater"facet instead.what is the difference between them?
the seond one is, in the manual book of FACET,p166.it says that"model each rater to have a personal understanding of the rating scale.those with very low central probabilities exhibits extremism. those with high central probabilities exhibits central tendency" and "anchor all items at the same difficulty, usually o, raters who best fit this situation are most likely to be exhibiting halo effect.would anyone please specify this crereria? in what ways can we judge that s rater exhibits low central probabilitie or halo effect?
the third one is: i am very intrested in using the facet to investigate the 3 common rater effects: severity, central tendency, and halo effect. is there any previous research that i can refer to?
thanks again for reading and answering my questions!

MikeLinacre: Thank you, Joycezhou.
1. The choice of non-centered facet makes no difference to fit statistics. It merely adds or subtracts from all the measures. It is the decision where to put the zero-point on your measuring scale. Choose the zero-point to be the one that is most informative for what you need to communicate. For instance, when measuring the height of mountains it is better for communication to measure them from sea-level than from the center of the Earth.
2. May I suggest you set up a facets specification file with anchored measures. Then construct response strings corresponding to the behavior you want to investigate. You will see what happens.
3. Please see the reference list in Facets Help or "google" for what you need. www.winsteps.com/facetman/references.htm

joycezhou: many thanks for your advice.

990. reasons for being a misfit item

godislove April 22nd, 2007, 6:57pm: Hi Mike, I am doing different checks on the possible reasons for being a misfit item.
first, I check the point-biserial correlation. it is positive.
Second, I check the category probability curve, it is in order.
Third, I took out the misfit persons, the item is still misfit after removal of misfit persons.

CAn you suggest something more I can do by Winsteps?
I read a paper use BIgsteps to plot the number of categor thresholds at each level of ability against ability. Does Winsteps have this function? Can this plot tell me if the item is misfit?

MikeLinacre: Godislove, you could look at the unexpected responses in Table 10.5 and Table 10.6. Or look at all the responses in Table 11.
If you have a small data set, then the Scalogram can be informative: Table 22.

Winsteps has all the features of Bigsteps and many more. Can you provide a reference to the paper so I can identify what the authors did?

godislove: Thank you for your reply. I really appreciate it.
the paper I mentioned is by M Itzkovich et al. - RAsch analysis of SCIM II.

MikeLinacre: Godislove, thank your for posting the paper, which appears to be at http://home.uchicago.edu/~tripolsk/SCIM%20Rasch%20August%202001.pdf.

The statistics in their Table 1 are the standard ones in Winsteps and Facets Tables, apart from "emergent categories" and "disordered steps". It appears that "emergent categories" is the count of modal categories when the items are modeled using a "partial credit" model. "disordered steps" is the count of non-modal categories. Figure 1 are the model category probability curves. Figure 2 shows measures distributions. Figure 3 reports the Rasch-Andrich thresholds in an unusual way. Instead of these plots, I recommend reporting Rasch-Thurstone thresholds on a standard item map. The difference is that Rasch-Andrich thresholds refer to adjacent categories, but Rasch-Thurstone thresholds refer to performance along the whole rating scale - which is what readers would think that Figure 3 is reporting.
The most challenging aspect of Rasch measurement is communicating our findings!

991. dimension map

godislove April 16th, 2007, 6:38am: Hi Mike,
I have runned my data on Winsteps to evaluate an instrument measures prosthetic control. 97 subjects on a 30 item instrument at a 4-point scale. Every subject performed a bimanual task that is chosen by the subject. Hence, if the subject chose something easier, they might get a higher score.

The PCA is as followed:
Total variance in observations = 2942.4 100.0% 100.0%
Variance explained by measures = 2914.4 99.0% 98.8%
Unexplained variance (total) = 28.0 1.0% 100.0% 1.2%
Unexplned variance in 1st contrast = 3.1 .1% 11.1%
Unexplned variance in 2nd contrast = 2.8 .1% 10.1%
Unexplned variance in 3rd contrast = 2.7 .1% 9.6%
Unexplned variance in 4th contrast = 2.4 .1% 8.6%
Unexplned variance in 5th contrast = 2.3 .1% 8.1%

Which figure should I look at when it comes to unidimenisonality? Shall I look at the variance explained by measures or unexplained variance in 1st contrast.

The expected variance is 98.8% and the empirical is 99%. Does this mean the dimension is even better than the model expected? I think something wrong with it.

Another concern I have is about the influence of other variables. With classical statistics, the researcher has to control the other factors or variables so that the latent variable can be measured. Do you think if we let the subject pick a task he likes, then then there is no way to measure the latent variable, i.e. prosthetic control.

Please advise. Thanks.

MikeLinacre: Thank you for your questions, Godislove.

Rasch unidimensionality can be viewed from several perspectives. The first is "variance explained by the measures". This is 99%. Only 1% of the variance is the unexplained variance used by the Rasch model to construct the measurement framework. I've never seen such a high percent explained by the measures. My suspicion is that there is something artificial about the data. Is there considerable dependence among the items on the instrument, such that they are producing data with an almost Guttman pattern?

Yes, the observed Rasch variance explained 99% is higher than that predicted by the model 98.8%, another indication that there is overfit in the data. Guttman-pattern data produces overfit.

You asked "if we let the subject pick a task he likes". A crucial question is "What do you want to measure?" When the subjects choose their tasks, the assumption is that they will choose tasks that they perform well, and so you will get a measure of optimal subject performance. If you assign tasks, you will get a measure of average subject performance (or worse).

In any case, you will need to model for task difficulty, so restrict subject choice to a few well-defined tasks, which you can then control for. And, if possible, have some or all subjects do two tasks.

godislove: Thank you for your precious advice. What you said was actually what I have been wondering.
I have read many articles and only the one you suggested (during the Rasch course) used the variance explained by the measures to check the dimension. That is also what I have learned during the course. But, then, why there are so many papers only use the fit of items (0.5-1.5 MnSq) and if all the items are between this range, then the scale is unidimensional? Are they wrong? There are 3 misfit items in my 99% unidimensional scale, how come this can happen? if misfit exist, how can I have 99% ?
Shall I do the same as the others, i.e. just take out the misfit items and rerun the analysis and conclude that all the items are fit now, so unidimensionality is demonstrated.
Please advice.

MikeLinacre: There are some important conceptual issues here, Godislove.
First, Rasch requires randomness in the data in order to construct a linear measurement framework. In your data there is very little randomness. This is usually a sign that the overall situation is highly constrained. (But this is bringing to the analysis experience from analyzing other datasets.) The statistical procedure itself works with the data you have and produces the results you see. If the inferred constraints operate to restrict off-dimensional behavior (as apparently in your case), the data will become even more strongly unidimensional.
Second, Infit and Outfit statistics are parameter-level statistics. They can detect whether individual items and persons have unexpected response patterns (relative to the other response patterns in the data), but they are relatively insensitive to response-pattern discrepancies across many items or persons. It is these wider discrepancies that are usually thought of as multi-dimensionality. These are what the PCAR detects.
Third, Infit and Outfit are relative fit-statistics. Their average mean-squares are usually close to 1.0. So a range like "0.5 to 1.5" means "from a Rasch perspective, the response-level data string fits this parameter estimate about as well as the other response-level data strings fit their parameter estimates."
Fourth, a symptom of over-constraint on the data is that the logit distances between parameter estimates become large. Usually, as the data approaches perfect predictability (i.e., a Guttman pattern), the logit values of the parameters increase from their usual range of less than 10 logits to a ranges of 40 and more logits.
Does any of this help you?

godislove: Hi Mike,
I am still confused about Mnsq and PCA.
can I say like this? Mean square is used to indicate whether individual items have unexpected response patterns (relative to the other response patterns in the data), items, for example, with MnSq >1.5 and a corresponding ZSTD >2.0 are considered as ‘misfit’. This suggests a deviation from unidimensionality. MnSq, however, is relatively insensitive to response-pattern discrepancies across many items or persons.
But, I found the following statement in the Winsteps website.
Poor fit does not mean that the Rasch measures (parameter estimates) aren't linear. Misfit means that the estimates, though effectively linear, provide a distorted picture of the data.

So, MnSq >1.5 suggests a deviation from unidimensionality but items are still linear? how does it work?

If PCA is fine, but items are misfit, can I still say that the instrument is unidimensional?

MikeLinacre: Godislove, the Rasch measures are forced to be linear! The fit analysis is a report of how well the data accord with those linear measures. So a MnSq >1.5 suggests a deviation from unidimensionality in the data, not in the measures. So the unidimensional, linear measures present a distorted picture of the data.

Large MnSq indicates a large local deviation from unidimensionality, for instance, random guessing. But this is not what is usually meant by "multidimensionality". A dimension is something shared by several items. MnSq are poor indicators of this. PCA is better.

992. Difference between Facets and Winsteps?

rblack7454 February 23rd, 2007, 3:56pm: Hello,

I am a beginner with Rasch modeling and plan on attending a workshop in the near future. The workshop is going to use winsteps software. I was wondering what the difference is between Winsteps and Facet? Also, is it possible to run Rasch models on SPSS?



MikeLinacre: Ryan, thank you for your questions.
Winsteps analyzes persons and items.
Facets analyzes persons, items and raters and other more complex data sets.

SPSS: it is possible to do Rasch analyses with SPSS but not easy. A place to start would be with log-linear modeling. TenVergert E, Gillespie M, & Kingma J (1993) Testing the assumptions and interpreting the results of the Rasch model using log-linear procedures in SPSS. Behavior Research Methods, Instruments & Computers 25(3) 350-359.

rblack7454: Thank you so much. That was very helpful!

DAL: What other differences are there between Winsteps and Facets? I know that Winsteps can do some graphs eg developmental pathway, that Facets cannot, but does it have some functions that Facets hasn't got?

MikeLinacre: Thank you for your question, Dal.
Winsteps is optimized for a rectangular data set (i.e., a two-facet design). It has many analytical and reporting capabilities that match that design, including over 30 output table families (most have subtables) and numerous output files. You can see a summary at www.winsteps.com/a/WinstepsFeaturesA4.pdf

Facets can analyze two-facet designs (so producing essentially the same measures, etc., as Winsteps) but also many other designs. Its output is much less sophisticated. You can see a summary at www.winsteps.com/a/FacetsFeaturesA4.pdf

My recommendation is to use Winsteps whenever possible, and only resort to using Facets when Winsteps can't solve the problem. You can see an example of this with the Olympic Pairs Figure-Skating data. It is naturally a 4-facet analysis (pairs x judges x programs x aspects), but a two-facet analysis (judges vs. everything else) neatly focuses the findings on the point of contention, the judge behavior. www.winsteps.com/winman/example15.htm

DAL: Thanks for the answer! I must be in a minority in that I've become acquainted with Rasch through Facets rather than Winstep.

Interesting example, I'm wondering now if I should have a go at converting the multi-faceted data to two facets in the same way. Would certainly help with the analysis.

993. DIF for a short form

Matt April 19th, 2007, 1:10pm: Good Day Professor!

Are we allowed to eliminate items from a scale base on DIF results? (If our goal is to present an invariant and short version of the scale)
Do you where to look for reference on that matter?

Thank you


Matt: Sir!

I finally found one article where they did eliminate 6 items base on DIF.
(Mackintosh, Earleywine, & Dunn, (2006) Addictive Behavior, 31 1536-1546)

Any other to suggest?

Thank you

994. Pdelete=

Raschmad April 19th, 2007, 9:26am: Mike,
If I want to delete a group of persons from the analysis like males, females, immigrants etc., what should I do? These groups all have codes in the persons data lines.
for 'pdelete=' command you have to specify the entry number of the persons. However, in larg data files this is cumbersome, if these groups are all scatterd thru the data file.

MikeLinacre: Raschmad, sounds like you need to use PSELECT=
If F indicates Females in column 3 of the person label, then
selects them.
If your codes are F, M and U for unknown, and you want to select M and F:

995. Zstd and t

godislove April 17th, 2007, 8:47am: Mike, thank you for your reply to my questions. I really appreciate it.
In Winsteps, the bubble chart is in t value but the Zstd is in the misfit table. They are the same thing, right? (you have clarified this for me during the course). can you write in my paper that this is the same and report them as they are from Winsteps?

Is Zstd a test of the significance of MnSq?

MikeLinacre: The answers are all "yes", Godislove.
MnSq is a chi-square statistic divided by its degrees of freedom. We could express its significance directly as a probability value, but these are awkward to look at, plot, etc., (particularly because we are interested in probabilities along the whole range from zero to 1) so we use the equivalent unit-normal deviates instead. These are in the ZStd column. Unit-normal deviates are Student's t-statistics with infinite d.f.

996. Difference between scores

omardelariva April 13th, 2007, 6:07pm: Hello, forum members:

I am trying to explain difference between scores in two tests obtained from the same item bank, i.e. what proportion of the difference of scores is produced by student ability and by test difficulty. I designed two 18-item tests, the first one was administered to group 1 and the second one was applied to group 2. Test I was the reference, therefore item difficulty mean was 0.00, Test II had an item difficulty mean of -0.29. In the other hand, the group 1 and 2 had an ability of -0.47 and -0.44, respectively. First statements were: Test II had a minor item difficulty mean and group 2 had a bigger ability mean. But, I would like to explain this difference in term of score.

I made estimations of expected score for each group in two tests, it is summarized in next table

Test I Test II
Group 1 7.10 8.30n.a.
Group 2 7.12n.a. 8.50

n.a. not administered

After reading the table I got this question. If I would try to explain that difference of 1.4 correct response, what proportion of the difference is produced by student ability and by test difficulty?

I hope somebody can help me.

Thank you.

MikeLinacre: Omardelariva, good question! Score-to-measure arithmetic is non-linear, but your scores are near the centers of the tests, so the arithmetic is usually almost linear.
On Test 1, the performance of the two groups is estimated to be almost the same (7.10, 7.12) [as we would expect with almost identical mean abilities, -0.47, -0.44], but on Test 2 it is noticeably different (8.30, 8.50). Similarly, with a test difficulty difference of -0.29, we would expect the score difference for a group between test I and test II to be around 1.2 (as it is for Group 1), not 1.4 (Group 2).
So, assuming the equating and the expected score calculations are correct, there must be big differences in the variance of the two sets of item difficulties and/or the two samples. If you can give us the score or measure S.D. for each set of items and for each sample, then we can answer your question ....

omardelariva: Mike:

I appreciate your quick response. I marked in bold the observed scores because the usual way that many persons without a background in IRT compare results it is only watching mean scores. Next, I submit the WINSTEPS output.

Group 1, Test I

| MEAN 7.3 18.0 -.47 .56 |
| S.D. 2.7 .0 .84 .07 |
| MAX. 18.0 18.0 4.44 1.87 |
| MIN. .0 18.0 -4.63 .52 |
| S.E. OF Alumnos MEAN = .00 |

| MEAN 36136.9 88963.0 .00 .01 1.00 -2.7 1.00 -2.4 |
| S.D. 17456.9 .0 1.00 .00 .06 7.7 .10 8.0 |
| MAX. 74905.0 88963.0 1.21 .01 1.12 9.9 1.22 9.9 |
| MIN. 16448.0 88963.0 -2.35 .01 .92 -9.9 .87 -9.9 |
| REAL RMSE .01 ADJ.SD 1.00 SEPARATION123.45 Reacti RELIABILITY 1.00 |
| S.E. OF Reactivo MEAN = .24 |

Group 2, Test II

| MEAN 8.5 18.0 -.44 .55 .99 .0 1.01 .0 |
| S.D. 3.0 .0 .87 .05 .19 .9 .32 .9 |
| MAX. 17.0 18.0 2.87 1.05 1.91 3.7 5.95 3.7 |
| MIN. 1.0 18.0 -3.45 .52 .51 -3.0 .34 -2.8 |
| S.E. OF Alumnos MEAN = .00 |

| MEAN 8.5 18.0 -.45 .55 |
| S.D. 3.0 .0 .88 .07 |
| MAX. 18.0 18.0 4.14 1.85 |
| MIN. .0 18.0 -4.72 .52 |
| S.E. OF Alumnos MEAN = .00 |

| MEAN 24503.3 51955.0 -.29 .01 .99 -2.2 1.01 -1.5 |
| S.D. 9414.7 .0 .89 .00 .08 8.1 .12 8.4 |
| MAX. 37572.0 51955.0 1.21 .01 1.23 9.9 1.35 9.9 |
| MIN. 9233.0 51955.0 -1.55 .01 .90 -9.9 .88 -9.9 |
| S.E. OF Reactivo MEAN = .22 |

Thank you for your concern.

MikeLinacre: Thank for this interesting problem.
You have provided all the necessary statistics so that we can simulate the person and item distributions for Group 1, Test I and Group 2, Test. Then, since Test I is equated to Test 2 with equating constant -0.29 (and, we assume, equating slope = 1.0), we can apply the the sample distribution of Group 1 to Test II and Group 2 to Test I.
These computations involve obtaining expected scores based on the products of the two assumed normal distributions of persons and items for each Group-Test combination. This is awkward to do analytically, but can be approximated with an Excel spreadsheet. I'll work on this later today ....

997. Taking the average

Rasch April 11th, 2007, 5:04pm: I just wrote an exam comprised of 6 stations, 4 of the stations are worth 20% and 2 stations are worth 10%. When I recieved my results, the weighted average of the stations was not the average I was given. I was told "We utilize Multi-facet Rasch modeling to analyze scores. This statistical procedure generates logit scores which are rescaled to a mean of 500 and a standard deviation of 100 over all examinations. Thus it is not possible to sum or perform standard mathematical procedures to recreate the overall score."

Here are my scores for the stations and the weight of the stations:

A.590 20%
B.160 20%
C.311 20%
D.354 20%
E.372 10%
F.302 10%

I don't understand if Rasch modeling is used in only obtaining the scores for each station or for an overall average mark. Can someone advise me as to what the average of the above stated should be. Thank you

MikeLinacre: Rasch(!): Thank you for your question. Let's assume that your reported station scores are all on the same "ruler", i.e., that the station scores have been equated, so that 300 at station A represents the same performance level as 300 at station B. Then we would expect your overall average mark to approximate the weighted average of the numbers you give, even for scores based on Rasch measurement. But this may not be the case if the measurement precision at the different stations is noticeably different. Then we would need to figure the precision into the averaging.
But if "590 at Station A" means "590 out of 600 at the easiest station", and "160 at Station B" means "160 out of 600 at the hardest station", then a performance level of 590 at Station A may be equivalent to a performance level of 160 at Station B. To obtain your overall average performance, we would need to know much more about the station scores.

Rasch: Thank you for your prompt reply. I don't know if the stations are on the same ruler, but I was informed that the minumum performance level for each station is 350 and the overall minimum performance level is also 350. However the overall average that I was provided was 348.

I hope this extra information helps. Thank you again.

Rasch: I have just found out that the station scores are equated

MikeLinacre: Then your overall score is approximately the weighted average.

998. Ratio Vs Interval Scales

Raschmad April 6th, 2007, 8:21am: Dear Mike,
I have a question about interval scales.
Consider 4 students with raw scores of 98, 91, 66 and 28 on a 100-item test.
The odds of success for these students would approximately be: 50, 10, 2 and 0.4 respectively.
This means that each student is 5 times as good as the next one. This is the ratio multiplicative scale.
Since the ratios of these odds of success are constant, namely 5:1 they reflect equal steps on the logarithmic scale. If the log of 50 is 4 (to whatever base) then log of 10 is 3, the log of 2 is 2 and the log of 0.4 is 1. This means that on the interval additive scale they are equally distanced.
I don’t understand this. How come 4 persons are each five times as good as each other but they are placed on equal intervals?
Suppose there are 4 rods of lengths 50, 10, 2 and 0.4 meter. Each rod is 5 times as long as the next. However, they are not equally distanced.
What do I miss here?
Please clarify this.

MikeLinacre: Raschmad, if you want to express ratios on an equal-interval scale, then you take their logarithms. So, if you want the ratios of your rod lengths to be on an equal-interval scale, then the scale is log(50), log(10), log(2), log(0.4). This was an insight of Gustav Fechner in 1860. He perceived that in some situations "counts" are the additive unit of analysis, but in other situations "log(counts)" are the additive unit of analysis. Here are some similar situations: for some analyses "seconds of time" make most sense, in other applications "log(seconds)". In some applications "dollars of money" make most sense, in other applications "log(dollars)".

Clarissa: This rather sounds oversimplification of the Rasch model.
To me it means that if you want person measures on an interval scale, compute the odds of success (Number of items right/number of item wrong) and then take the natural log and there it goes.
Then why do we have complicated Rasch softwares and a sophisticated issue, namely, ESTIMATION?

MikeLinacre: Clarissa: log(right/wrong) works perfectly when all items have the same difficulty. Estimation becomes more complicated when the items do not have equal difficulty, and yet more complicated when everyone does not take every item, and even more so when the items have rating scales (polytomies) and are not merely right-wrong (dichotomies). And so on ....

999. Average Measures

DevMBrown April 5th, 2007, 7:53pm: Hi,

I'm trying to come up with an interpretation of the average measure of a group of people.

Here were my initial thoughts: given a predetermined, calibrated scale of dichotomous items, and a group of people with a known average person measure, say B, then we can treat this group as if it were a single person and calculate the expected score for the group on an item of difficulty D by 0*P(x=0) + 1*P(x=1) = exp(B-D)/(1+exp(B-D)), the same expected score for an individual of person ability B. However, after playing around with some data, this situation only occurs if B=D or the group of people consist only of people with measure D (i.e., 0 standard deviation of the measures). Similarly, two groups with an equal average measure B can have drastically different expected scores on the same calibrated item. So I would say my initial thoughts were wrong.

Any one have thoughts on the interpretation of an average measure?

MikeLinacre: Thanks for your question, DevMBrown.

If your data fit the Rasch model for both groups, then each group's total observed score on each item should match (statistically) each group group's total expected score on each item. To verify this, you could do a DIF analysis of group vs. item.

The challenge when predicting a group score is that scores are non-linear with measures, because scores have floor and ceiling effects which measures don't have. So "N * score by student of average ability" does not equal "sum of N scores by students of individual abilities".
If your groups are approximately normally distributed, then you can use a PROX-type formula to predict the score of the group:
T = (Mean ability - Item difficulty) / sqrt ( 1 + (SD of group)^2 / 2.89)
Score for group = N * exp(T) / (1+exp(T))

DevMBrown: Thanks again, Mike!

Based on the approximation formula that you gave and the fact that two groups with the same average measure should have the same raw score on a given item, we can assume then that two groups that fit the rasch model and have the same average measure should have the same standard deviation. That is, the standard deviation of a group of measures should be a function of the average measure of that group. Any thoughts on how to write the function SD_of_group(Mean_ability) explicitly?

MikeLinacre: Quote: "groups that fit the Rasch model and have the same average measure should have the same standard deviation."
Yes, if two groups have the same average Rasch ability and the same average raw score (on complete data), then they will have (statistically) the same standard deviation. Rearrange the equation above, to obtain the expected SD of the group, given the mean ability, the item difficulty, and the observed raw score.

DevMBrown: Hi again, hope you had a good weekend.

My last post was the result of a huge misinterpretation on my part of your first reply. I incorrectly thought you were saying that two groups of equal size that fit the rasch model and have the same average measure should have the same expected score (without any knowledge of SD). From there I concluded that two groups with the same average measure should have the same SD (from the formula).

To reframe my original concern, given two groups of equal size, that fit the rasch model, have approximately normal measure distributions, and have equal average (person) measures, is it possible for these groups to have different expected raw scores (say have different SD), or does fitting the rasch model imply a certain SD and therefore average measure implies a unique expected raw score?

Thanks again, Mike.

MikeLinacre: Quote: "does fitting the rasch model imply a certain SD"
In principle, the Rasch model makes no distributional assumptions about the person ability distribution or the item difficulty distribution. This is one of the strengths of the model.
Some Rasch estimation methods assume normality of one or both distributions.
Winsteps makes no such assumptions.

1000. what represent person's ability in rasch model?

sOUMEN April 6th, 2007, 12:39pm: Need urgent help in understanding the Rasch output!!!

let me clarify the problem: We are trying to measure respondent's consumption power based on their durables holding. For 300 respondents and 12 durables we have data in the following format.
Dur1 Dur2 Dur3 Dur4 Dur 5 Dur6
Resp1 Yes Yes Yes No Yes No
Resp2 No Yes No No Yes No
Resp3 Yes No Yes No Yes Yes
Resp4 Yes Yes Yes No Yes No
Resp5 No Yes No No Yes No
Resp6 Yes Yes Yes No Yes No


My questions are:
1) I used a software called Winstep. What should I check from the putput that denotes the purchasing power of respondents? is it "Score" or "KNOX CU measurement" or something else?
2) we also used SAS using a code avl in internet. what output should I check? is it "emperical bayesian estimate" or something else?

In both the cases the above mentioned output are not giving any different result to the respondent score. I mean respondent's score are assigned based on total number of durables. The score is not discriminatry for different consumer holding pattern as long as the number of durables are same.

would thank your quick response!

MikeLinacre: SOUMEN, thank you for your questions. In your data, the respondents are the rows, so their Rasch measures are the row measures.
You write: "In both the cases the above mentioned output are not giving any different result to the respondent score."
In the Rasch model, the marginal (total) raw scores are the sufficient statistics, so for each possible raw score by a respondent there is a corresponding Rasch measure with your "complete" data.
There are important differences between raw scores and Rasch measures.
1. The Rasch measures are linear. The raw scores are non-linear.
2. The Rasch measures are described with fit statistics indicating the extent to which the data conform to Rasch expectations. The raw scores (usually) do not have such fit statistics.
In your analysis, first look at the item measures and their fit statistics. Does the order of the durables make sense? This is the "construct validity" of your survey. Does each durable fit the Rasch model? This indicates whether the durable meets the expectation that the lower-scoring durables are purchased predominantly by the higher-scoring respondents. You may want to omit durables that don't follow this pattern from your study.
If the item measures make sense, look at the person measures. Do they fit the Rasch model? If you have respondent demographics, investigate to see whether there is the same ordering of the durables across all demographic groups. DIF (differntial item functioning) analysis is a convenient way to investigate this.
Rasch analysis encourages to ask questions about your data, and extract information from it, that is usually much deeper than the typical raw-score or cross-tabulation analysis.

1001. start with winsteps

danielcui March 12th, 2007, 6:30am: hi, everyone,

I am a freshman in Rasch analysis and just get Winsteps yesterday. Does anyone know how to get a quick start with the software? and what's main functions I need to look at when interpret data? Is there any workshop available for Winsteps and Rasch?Many thanks

MikeLinacre: Danielcui, welcome! You are jumping in at the deep end!
The next in-person Winsteps workshop is in March in Chicago: www.winsteps.com/workshop.htm
The current online Course is just completing. The next is in August.
Are you familiar with "Applying the Rasch Model" by Bond and Fox? This is a good starting point for Rasch analysis. The matching tutorials are at www.winsteps.com/bondfox.htm

Dnbal: Mike
Kewl site...thanks for all your help and am going to go through these tutorials...what a great idea...you guys are really good.

testolog: Hi, dear colleges!
Who may help me to do the first step in Winsteps?

1002. DIF- how to?

ary April 4th, 2007, 5:27am: Dear Prof.
I have some trouble in getting DIF contrast. My friends and i have tried the new Winstep version 3.63.0.When using the older version 3.57., we dont have any trouble in extracting DIF contrast. We used this instruction :$S4W1.

With the new version, the table produced is kind of wieard.

Instead of

1 2
1 2
1 2
1 2

please help me to read this output.

MikeLinacre: Ary, thank you for your post.
looks like Table 30.2 - the comparison of groups with the overall mean.
1 2
1 2
looks like Table 30.1 - the pairwise comparison of groups.
For more details about the Tables, please see the Winsteps Help for Table 30.

ary: Thank you so much for your prompt reply.
Table 30.1 shows
1 2
1 2
1 2
That's the reason that I'm a bit lost. Normally, it would just come out.


ary: Sorry, stated the problem wrongly...
Table 30.1 does not produce the pairwise comparison..
1 2

MikeLinacre: Ary, this is my Table 30.1, is yours different? It shows pair-wise DIF comparisons.

DIF class specification is: DIF=@GENDER
| F -6.59> 2.14 M -7.58> 2.09
| M -7.58> 2.09 F -6.59> 2.14
| F -6.59> 2.14 M -7.58> 2.09
| M -7.58> 2.09 F -6.59> 2.14
| F -6.59> 2.14 M -7.58> 2.09

1003. category function of each term

godislove April 3rd, 2007, 12:05pm: Hi, I would like to know which function in Winsteps that show the graph (category function) of each item? Now when I press category function, then I get the graph of Person [MINUS] Item MEASURE of all the items.

MikeLinacre: Helen, thank you for your question. You probably want the "Graphs" menu. Then click on "Display by item". Then click on the yellow button "Prob.+Empirical Cat. Curves"

1004. Minimum sample size

oosta March 23rd, 2007, 7:34pm: I would like some guidance on the minimum number of persons needed to do Rasch analysis. I realize that the answer is probably, "It depends." But maybe you can give me some ballpark figures. Let's assume a 100-item multiple-choice test. Feel free to state your own assumptions as well. What's the minimum sample size under those assumptions?

I have a few related questions as well?

What factors affect the minimum sample size needed?

If I am equating two test forms, do I need many more persons?

If all of the 100 items use the same 7-point rating scale (i.e., Rating model), does that increase the minimum sample size greatly or just by a few persons (compared to a dichotomous test)? Does the rating model add 6 degrees of freedom (compared to the dichotomous model)?

If each of the 100 items use a different 7-point rating scale (i.e., partial credit model), does that increase the sample size greatly? Does the partial credit model add 600 degrees of freedom (compared to dichotomous items)?

MikeLinacre: Oosta: please see www.rasch.org/rmt/rmt74m.htm for minimum sample size recommendations.
Equating tests forms: depends on your equating design. Usually, equating does not alter sample sizes.
Rating scales: you need at least 10 observations in each rating scale category in the dataset for item calibration stability.
"Add 6 degrees of freedom" - a 7-point rating scale reduces the degrees for freedom in the data by 5 because 5 more parameters are being estimated for a given number of observations. Thus L dichotomous items estimate L-1 parameters (usually). L 3-category items estimate L parameters. L 4-category items estimate L+1 parameters, ..., L 7-category items estimate L+4 parameters.
Partial credit: now it is 10 observations for each category of each item, which is usually much more difficult to achieve. For a 7-point rating scale, we now estimate 6 parameters per item, less 1 for the local origin: L*6-1 parameters. So the d.f. relative to dichotomies are reduced by (L*6-1 - (L+4)) = (L-1)*5

oosta: Thanks very much!

It looks like there are three things to consider (at least) when determining the sample size requirements.

1. The RMT article you mentioned says about 100 persons for dichotomous items (for +- 0.5 logits accuracy at 95% confidence interval).

2. For Rating scales, 10 responses for each scale category over the entire test. For partial credit, 10 responses per category for each item.

3. Degrees of freedom. For Rating scales, nItems + (nCategories - 3). For Partial Credit: nItems * (nCategories - 1).

I'm not 100% sure about which criteria to use and whether I combine them. Here's what I *think* you are saying. For multiple-choice tests, use criterion 1 (as shown in the RMT article). For rating and partial credit models, you have to consider criteria 2 and 3. Use whichever gives you the larger sample requirement. For example, for a Rating scale with 100 items and 3 categories, the two criteria give me the following sample-size requirements: (2) probably about 70 (assuming a 15%, 70%, 15% frequencies split among the categories), (3) 200. In this example, I assume that my sample-size requirement would be 200 because it's larger than 70.

oosta: Oops. My example should have said a sample size of 100 for the Ratings model. A partial-credit model would be 199.

1005. CAT

Wandall March 22nd, 2007, 8:49pm: I have read the MESA Memo 69 with the subtitle "A Methodology Whose Time Has Come". This is thrue - at least in Denmark. I'm a civil servant in the danish ministry of education and i have been participating in the developement of a large scale Danish national CAT-programme. We are launching the system 1. May this year.

I think that the Memo was very enlightning and i would like to ask

A. If anyone has the knowledge of other/newer articles on the net on the subject (same style - not to technical) and/or
B. If anyone has the knowledge of other places in the world where CAT is used for systematic evaluation of pupils in primary/lower-secondary school.


MikeLinacre: Wandall, glad to hear of your interest. The place to look is the Northwest Evaluation Association www.nwea.org

1006. Code in the tutorials

Dnbal March 22nd, 2007, 7:11pm: In Ch 2 tutorial the code after the & end is

a 37+2 ; 12 item labels, one per line
b 56+4 ; Please use item text, where possible!
c 1+4 ; so I've added some dummy item text
d 27.3+34.09
e 4 1/4 + 2 1/8
f 2/4 + 1/4
g 4 1/2 + 2 5/8
h 86+28
i 6+3
j $509.74+93.25
k 2391+547+1210
l 7+8

I have not been able to see the logic in it. Can someone explain?

I have got it to work but would like to understand what it is doing....

MikeLinacre: Denny, the lines after &End are descriptions of the items on the test. One description per line. This is an addition test with 12 items. Question "a" on the test is: "What is 37+2 ?". Question "b" on the test is "What is 56+4 ?".
There will be as many lines here as there are items on your test. For each item you need a brief description, called a "label". These will make the output of your analysis much more meaningful.

Dnbal: Mike
I knew labels usually followed the &END. I thought they ment something like when you do tables and have 0011 ect meaning no 1 and 2 tables but tables 3 and 4. But then I should have realized they were the questions. Duh
Thanks so much


1007. determine significant difference

godislove March 22nd, 2007, 9:19am: Mike, if I compare to 2 sets of data from the same group of patients after a treatment, how can I use Rasch to determine any significant difference in improvement after the treatment? Can you suggest any papers I can read. Thanks.

MikeLinacre: There's a short note at https://www.rasch.org/rmt/rmt183p.htm "When does a Gap between Measures Matter?" This may help you, godislove.

1008. Linear Raw Scores

Raschmad March 16th, 2007, 8:30am: Hi all,
I was always under the impression that the reason why raw scores are not linear is that some items are easy and some items are difficult, therefore, one more right item does not add equal increment along the ability continuum,i.e., the increment in a measure as the result of getting one more right answer is not equal for all the items.

However, I read some time ago the reason why raw scores are not linear is that in classical test theory we try to have all the items with pvalue of 0.50, that is, equally difficult items. If we make the test with items which are evenly spaced in difficulty, i.e., if we include items with pvalues of 0.20 and 0.80 we approximate linearity with raw scores.
This is totally different from my understanding of linearity.
Could you please clarify it?

MikeLinacre: Raschmad, if you draw the raw score-to-measure ogive for almost any test, you will see that the central 80% or so of raw scores are approximately linear. The plot for the Knox Cube Test is shown at www.winsteps.com/winman/table20.htm. It approximates linearity.
We can construct a test in which all linear person measures have equal spacing by carefully selecting item difficulties. The item difficulties are more spread out at the center of the difficulty range and closer together at the ends. But it is not a practical test design.

Raschmad: Dear Mike,
I’m aware of the increasing ogival exchange between raw scores and measures, especially at the extremes.
However, if we have a set of items all with pvalues of .50, then each item theoretically elicits the same amount of ability from the examinees. Two persons with raw scores of 10 and 15 are 5 times the ‘ability of an item’ apart, and so are two persons with raw scores of 40 and 45, because we have uniform units, namely, the ‘ability of an item’. It’s like having two baskets of apples when all single apples are of the same weight. Taking apples from one basket and putting them in the other always linearly increases the weight and value of the basket, i.e. the weight of an apple times.
The major problem with raw scores is that right answers are not all the same size. But when items are all equally difficult, then they are all the same size. For me, this means linearity.
Having items of differing difficulty levels, as I understand, is the antithesis of linearity. I think this, in fact, defeats all attempts towards constructing linear measures with raw scores.
Could you please clarify this.

MikeLinacre: Yes, Raschmad, you are presenting the position expressed by E.L. Thorndike in the first modern psychometric testbook published in 1904. I discuss his logic at https://www.rasch.org/rmt/rmt143g.htm - The Rasch Model from
E.L. Thorndike's (1904) Criteria.
If you can agree that linear measures of person ability can be constructed from tests consisting of sets of items of the same difficulty, then linear measures of item difficulty can be constructed from samples consisting of persons of equal ability.
We can then express the two relationships algebraically, combine them, and produce linear measures of persons and items from tests and samples of persons and items of different measures.

1009. On Item Discrimination parameter

za_ashraf March 5th, 2007, 9:31am: I am Indian student doing research related with IRT. I will be much greatful if you can provide a mathematical definition for Item discrimination parameter a in a two/three paremeter logistic model. How this is differed from Reliability coefficient?
Thanking you


MikeLinacre: Z.A. Ashraf, in IRT the item discrimination parameter is the slope of the item characteristic curve. There is one discrimination parameter for each item.
Reliability is the statistical reproducibility of the relative placement of scores or estimates of the person sample on the latent variable, so Reliability = "true" variance / observed variance. There is one reliability for each administration of a test to a sample.

za_ashraf: Thanks for prompt reply.
In the case of estimation of Item parameters, the discussion starts with mentioning an examinee with the ability level theta and while drawing ICC, the ability values are from -3 to +3. can u mention some of the books, describing basic statistical theories related to the issues?

MikeLinacre: Z.A. Ashraf, please see http://en.wikipedia.org/wiki/Item_response_theory

za_ashraf: Sir,
While i tried to formulate an estimation method for two parameter logit model , using EM algorithm , the values are not converging. The programme is written in C++. could you suggest any method for solving this problem.

z. a. ashraf

MikeLinacre: You are correct. The 2-PL IRT values don't converge. You must:
1. Constrain the sample to a known distribution, e.g., the unit normal distribution.
2. Constrain the discrimination parameters to a known range, e.g., 0 to 2.
3. Only iterate a few times.

These drawbacks to IRT estimation are well documented in the literature, e.g., Stocking, M. L. (1989). Empirical estimation errors in item response theory as a function of test properties. (Research Report RR-89-5). Princeton: ETS.

1011. category stats and OUTFIT

Isabel March 9th, 2007, 11:01am: Hello there
I was wondering why you only consider OUTFIT mean-squares when examining a rating scale? The category statistics of facets give you only this statistic. Whereas with items and subjects I get both infit and outfit, and I understand that in the first instance I would look at infit. I have been unable to find a reference for this and/or figure out why this might be myself (being only a Rasch beginner...!) :-/
Many thanks!

MikeLinacre: Isabel, thank you for this perceptive question. There are many fit indicators that could be shown in the Winsteps output, so I display the ones that I've found to be the most informative. For categories, it is the OUTFIT statistic. In my experiments, the INFIT statistic did not provide enough useful extra fit information to merit inclusion. OUTFIT mean-square is a conventional chi-squared divided by its degrees of freedom, so it is usually easier to explain to a statistically-minded audience than is INFIT. But you can compute the INFIT mean-square (and many other fit indicators) using the information in the XFILE=

1012. Difficult of a test

omardelariva March 8th, 2007, 1:52am: Good evening, everybody:

I wanted to compare difficult of two tests. First I carried out a process of equating betwen of two test, at this point both tests were in the same scale. Then, I thought in definition of difficult of a item, item differential function and expected score and mixed them. My final conclusion were, if I have a test of n items then difficult of an exam will be the point of ability where characteristic curve of test equals to n/2; in addition, if I wish to measure globally how differents are two test I would obtained it comparing the areas under each characteristic curve.

My question: I am crazy?... No, I meant, How wrong am I?

MikeLinacre: Thought-provoking questions, Omardelariva. Yes, one indicator of the relative difficulty of two tests is the difference between the abilities expected to score 50% on each test. We would expect this would be approximately the same as the difference between the average item difficulties for each test. But the "ability for a 50% score" is probably easier to communicate to a general audience.
The area under the test characteristic curve for a test of dichotomous items would be A = sum(log(1+exp(B-Di))), which looks like an information function. This could be worth further investigation .... You may have the seeds of a useful Paper here.

1013. Item order

janescott March 7th, 2007, 1:12am: Hi there,

I am developing a multi-dimensional scale measuring people's level of engrossment with a film. I am hoping people can advise me with regards to the ordering of items, as I can't seem to find a relevant article.

Within each dimension (let's consider feelings for example), I am exploring several different feelings someone might have to a film, building difficulty/intensity by using 3 items to explore each theme.

So for example, I have items such as....

I felt good
I felt happy
I felt elated

which measure happy feelings

and items such as......

I felt apprehensive
I felt scared
I felt terrified

to measure fear

My question relates to how these should be ordered. Do I keep them in their triplets with ascending levels of endorsabilty / difficulty? Which is good because it makes the person think, "yes, I felt happy when I watched the film", but realising they are being pushed to the next level, they might more thoughfully consider, "but did this film make me feel elated?"

Or do I mix the items up so "I felt good" and "I felt apprehensive" (ie. 2 easily endorsed items) are answered first, and then the harder items (eg. elated, terrified) are answered last

Or do I mix the items up completely randomly so I might have scared followed by good followed by something else - hence mixing up both the feelings and the level of difficulty

Any advice you could give me would be most appreciated as I need this resolved really quite soon!

Many thanks,


Jane Scott

PhD Candidate
School of Marketing
University of New South Wales
e: jane@student.unsw.edu.au
w: www.marketing.unsw.edu.au

MikeLinacre: Jane, think of your questionnaire as a conversation with a customer. The conversation would start with lighter, easier questions and then probe more deeply. It would also tend to move from topic to topic. One way to see how this works is to try out your questions in an interview of a friend. Ask your friend to pretend to be a suitable well-know person. You will soon see what makes the conversation flow best and elicits the most meaningful answers.

1014. Random test generation from item bank

Seanswf February 28th, 2007, 10:27pm: Hello,

I am analyzing a computer delivered test of 10 items with winsteps. Each item presented to the student is randomly chosen from an item sampling group (ISG). ISGs contain items of similar content. The total item bank contains 27 items. My problem is that no two students receive the same set of items.

Here is a sample:
1= correct
0= incorrect
9= not presented


Is Rasch measurement an appropriate method of analysis?
How can person & item measures be interpreted if no one receives the same test?
Any guidance on how to deal with data is greatly appreciated.

Thanks in advance.

MikeLinacre: Seanswf, thank you for asking. The analysis of your data is strightforward.

In Winsteps, your control file would be:
Title="Random Test"
Item1=1 ; column of first item response
NI = 27 ; 27 item-response columns
CODES=01 ; valid response codes are 0,1. "9" are treated as not-administered
NAME1=1 ; change this to be the column of your person information
(your data go here)

And here is a little piece of the output

| 1 4 10 -.55 .72| .64 -1.2| .57 -1.1|
| 2 3 7 -.38 .85| .52 -1.5| .46 -1.4|
| 3 3 8 -.98 .80|1.40 1.2|1.46 1.1|
| 4 6 10 .35 .70|1.17 .7|1.10 .4|
| 5 4 9 .00 .70|1.07 .4|1.04 .3|
| 6 7 10 .89 .73|1.53 1.5|2.20 2.1|
| 7 7 9 1.53 .87| .68 -.7| .48 -.6|
| 8 6 9 .53 .75| .72 -.9| .64 -.8|
| 9 5 9 .26 .70| .95 -.2| .95 -.2|
| 10 5 8 .26 .76| .71 -1.3| .65 -1.0|
| 11 8 10 1.92 .84|1.24 .7|2.06 1.3|
| MEAN 5.3 9.0 .35 .77| .97 -.1|1.06 .0|
| S.D. 1.6 1.0 .82 .06| .32 1.0| .58 1.1|

Thanks Mike. I ran the data and as you can see the reliability is low. Any thoughts on how to improve?

| MEAN 6.6 10.0 .79 .78 1.00 .1 1.02 .1 |
| S.D. 1.9 .0 1.01 .13 .20 .7 .48 .8 |
| MAX. 9.0 10.0 2.71 1.12 1.64 2.7 6.37 2.7 |
| MIN. 1.0 10.0 -2.61 .64 .56 -2.5 .29 -2.4 |
| S.E. OF PERSON MEAN = .03 |

MikeLinacre: Seanswf, "test" reliability is directly connected to the number of responses each person makes. You can use the Spearman-Brown Prophecy formula to predict what the reliability would be if each person responded to more items. For instance, double the number of responses per person, and the reliability increases from 0.33 to 2*0.33 / (1+0.33) = 0.5.

Seanswf: Thanks Mike that is helpful.

1015. DPF and specific objectivity

z99.9 February 22nd, 2007, 5:31pm: Hello to all.
I would really appreciate if someone could give me the answer on a tough (I think) question. It's quite a bit I'm looking for an answer and I can't find a way out.

Does the specific objectivity holds also for the contrast values of an interaction?
In other words, in a 3-Facet model, and comparing the contrast value of subject 1 with the contrast value of subject 2, am I allowed to say that this comparison is indipendent from subjects' ability?

MikeLinacre: z99.9, thank your for your questions.
By definition, differential person functioning is a violation of specific objectivity, because it states that the person is functioning at different levels within the same analysis. We have to know which part of the analysis we are talking about to identify what performance level the person is exhibiting.
In the usual 3-facet model, such as subjects x items x raters, the subject measures are specifically objective, because they are as statistically independent as possible of which items and raters the subject encountered.

z99.9: Dear Mike,
I really appreciate your prompt response.
I apologise, but I'm not sure of my undertastanding.
My model is a subject x items x conditions of response (two levels). One condition is much easier than the other.
If I compare the contrast value of a subject with the contrast value of another subject, am I allowed to say that this comparison is indipendent from subjects' OVERALL ability?
Thank you for your suggestions.

MikeLinacre: z99.9, thank you for asking for clarification.
If you have performed this analysis in a standard Rasch manner, the contrasts are independent of the subjects' overall abilities,

1016. Inter-rater reliability and kappa

Tiina_Lautamo February 16th, 2007, 7:20am: I have a data were 12 raters have rated children's play skills.
6 cases together and everyone 6 cases independently (72). There is only one misfitting rater.
My main question is that should I also report rash kappa?
Exact agreements =50,7% and expected = 42,8% and there I have counted kappa value 0,138.
How we should intepret the value? What are the acceptable values?

MikeLinacre: Tina, thank you for asking.
Your results appear to accord with those usually reported for raters who are behaving like independent experts. Your raters are agreeing slightly more (51%) than the Rasch model predicts (43%). There is probably a psychological reason for this slight overagreement, perhaps the human social proclivity to be agreeable!
We are concerned if their ratings match less than the model predicts because this can mean there is some unwanted source of variance in the raters (such as a misunderstanding of the rating procedure).
We may want much higher agreement (90+%) if the intent is to have the raters behave like rating-machines.

1017. Dimensionality and PCAR

Raschmad February 14th, 2007, 10:11am: Hi all,
I performed PCAR on a set of data under 2 conditions, once when the locally dependent items were deemed as independent (100 items) and once as they were bundled into polytomies (4 items). The results as far as unidimensionality is concerned are somewhat contradictory. In the polytomous analysis the variance explained by measures is considerably larger (I’m not sure if this can be interpreted as a sign of being more unidimensional). In the dichotomous analysis the “modeled variance explained by measures and modeled unexplained variance” are different from the empirical by 3%, while in the polytomous analysis they are different by only 0.7%. This evidence indicates that when local dependence is taken care of, the data is more unidimensional.
However, the more important statistic, i.e., “unexplained variance in 1st contrast” is much smaller in the dichotomous analysis.
How can one put these 2 pieces of information together? Which data set is more unidimensional?

STANDARDIZED RESIDUAL variance (in Eigenvalue units) for polytomous items

Empirical Modeled
Total variance in observations = 28.7 100.0% 100.0%
Variance explained by measures = 24.7 86.1% 86.8%
Unexplained variance (total) = 4.0 13.9% 100.0% 13.2%
Unexplned variance in 1st contrast = 1.7 6.0% 42.9%
Unexplned variance in 2nd contrast = 1.1 4.0% 28.6%
Unexplned variance in 3rd contrast = 1.1 3.9% 28.0%
Unexplned variance in 4th contrast = .0 .0% .3%
Unexplned variance in 5th contrast = .0 .0% .2%

STANDARDIZED RESIDUAL variance (in Eigenvalue units) for dichotomous items

Empirical Modeled
Total variance in observations = 520.9 100.0% 100.0%
Variance explained by measures = 420.9 80.8% 83.5%
Unexplained variance (total) = 100.0 19.2% 100.0% 16.5%
Unexplned variance in 1st contrast = 5.0 1.0% 5.0%
Unexplned variance in 2nd contrast = 4.2 .8% 4.2%
Unexplned variance in 3rd contrast = 4.1 .8% 4.1%
Unexplned variance in 4th contrast = 3.6 .7% 3.6%
Unexplned variance in 5th contrast = 3.4 .7% 3.4%

MikeLinacre: Raschmad, thank you for performing this analysis. It suggests several experiments.

A. Cross-plotting the person measures between the two analyses.
We expect a loss of randomness with the polytomies, so the data should have become more Guttman-like. Consequently the polytomous person measures should be less central (more spread out). The slope of the line will give an indication of the loss of randomness.

B. The difference between "empirical" and "modeled" for the polytomies suggests that estimation convergence may not have been reached. The polytomies have 26 categories each, so estimation convergence will be slow. Please be sure you have GROUPS=0 and are running plenty of iterations. Tighten up the convergence criteria if necessary.

C. Another indication of the impact of the polytomies on the person measures:
1. Take the 25 items of one of your polytomies. Analyze them by themselves. Save Table 20, the score Table or the SCOREFILE=. This shows the measures corresponding to every score on the 25 items.

2. Take your polytomous analysis. Run it with GROUPS=0, so each polytomy is estimated with its own scale structure. Look at Table 3.2. "Score to measure AT CAT". This shows the measure corresponding to every score on the polytomy.

3. Compare the two sets of measures. They indicate:
a) the effect of the loss of randomness due to combining the dichotomies (which we investigated in A.
b) the distortion of the measures on the target polytomy caused by the other polytomies.

D. Yet more fun! Make up a dataset of the 25 items in the polytomy and the 26 category polytomy. Weight the polytomy at 0:
26 0
Do the combined analysis. Look at the fit statistics for the polytomy. Since it is a summary of the 25 dichotomies it should considerably overfit. Its infit mean-square will indicate how much of the randomness in the dichotomies has been lost.
For instance: INFIT MNSQ = 0.3 means that the polytomy retains only 30% of the average randomness in the dichotomies.

Please let us know what you discover. This would be a thought-provoking research note for Rasch Measurement Transactions, and, enhanced with more simulated and empirical datasets, a great Journal paper for a methodology Journal.

Raschmad: Dear Mike,
I did the analyses that you suggested with great interest and enthusiasm.
Here we go:
A. Person measures from the polytomous analysis are more central (SDv.=0.71) and person measure from the dichotomous analysis are more spread out (SDv.=1.26).
I cross-plotted the person measures from the 2 analyses. The slope of the empirical line is 0.56.
B. Actually, it is for dichotomies that there is difference between “empirical” and “modeled”. For polytomies they are almost the same as you can see in the PCAR outputs above.
The estimation for the polytomous analysis converged after 48 iterations for dichotomies after 7 iterations (WINSTEPS’ default convergence criteria).

C3. I compared raw scores to measure from the 2 analyses. The measures from the dichotomous analysis are more spread out.

Dichotomous analysis Polytomous analysis
Score Measure Measure
2 -3.51 -2.69
12 -0.20 -0.22
24 4.47 1.99

The interesting thing is that the polytomous analysis favours the low-ability students. The mid-ability students get almost the same measures regardless of the analysis and high-ability students are favoured by the dichotomous analysis. Does this have any psychometric reasons?

D. I did the analysis. Indeed the polytomy which is a summary of the 25 dichotomies grossly overfits.

.03 -9.9 .06 -9.9

From your reply I got the impression that the dichotomous analysis is more reliable (dictionary meaning of reliability not its psychometric meaning). However, I thought the opposite should be true because we are taking care of the local dependence among items by performing the polytomous analysis.
Thanks a lot

MikeLinacre: Raschmad, thank you for sharing the details of your analysis. This would make an interesting Research Note for Rasch Measurement Transactions.
The Winsteps default convergence criteria may not be tight enough for your long polytomous rating scale. These are difficult to estimate because of the dependency between the Rasch-Andrich threshold locations.
For dichotomies vs. polytomies, it is a matter of "competing goods" as Aristotle would say. We want to eliminate excessive local dependence, but we also need to maintain enough randomness to construct a measurement framework.
We could eliminate all local dependence from any set of L items by replacing them with a polytomy with L+1 categories. But we would also eliminate the randomness within the L items that supports the estimation of distances between those items.
In your dichotomous vs. polytomous analysis, there has been a change in the "length of the logit". By analogy, the dichotomous analysis is in Fahrenheit, but the polytomous analysis is in Celsius. Cross-plot the person measures to see what is really going on. www.rasch.org/rmt/rmt32b.htm

1018. Item Reliability and DIF

Matt February 13th, 2007, 3:12pm: Sir
I am a french-Canadian student and I am working on a scale validation for my theisis.
I conduct analysis on my data and I obtained a reliability for subjects of .96 and .99 for items.

How can I interpret reliability for items?

I also intend to conduct DIF analysis. Do you have any syntax example that I can use to build my own?

Thank you very much.

Regard, Mathieu

MikeLinacre: Matt, high item reliability means that the item difficulties are statistically reproducible (which is what we want). So your sample size is large enough.
If your item reliability is low, then you probably need a larger sample.

DIF analysis: what software? The exact procedure varies.

Matt: Thank you Sir for your first anwser.

I am using WINSTEP for my DIF analysis.

I have one more question, do you have any criteria for the subject and items separation index?

Once again, thank you. It is very appreciated that you take this time for us.


MikeLinacre: For the items, you need a big enough sample that the reliability is well into the .9...

For the sample, it depends on the area of application. The longer the test, the higher the reliability. If you plan to make pass-fail decisions. You need a reliability of at least 0.8. If you need to make more differentiations in ability level, then at least 0.9.

Winsteps has DIF analysis built-in. See Table 30.

1019. Data entry

SusanCheuvront February 12th, 2007, 4:29pm: Hello,

I just got FACETS a few days ago, and am trying to figure out how to use it. I use Winsteps quite a bit, so I'm assuming this isn't much different. I have a few questions on data entry. The examples in the program seem to have way more information than I need and it's all a bit cofusing. I have a simple 2-facet model--examinees and raters. Each examinee is rated by two raters. As far as data entry, can I do it this way:

001, 5, 3 Examinee #1 is rated by rater #5 and scores a 3.
001, 12, 2 Examinees #1 is rated by rater #12 and scores a 2.



MikeLinacre: Yes, that will work, Susan.
So your Facets specification file will include:
1, Examinee
2, Rater
001, 5, 3 ; Examinee #1 is rated by rater #5 and scores a 3.
001, 12, 2 ; Examinees #1 is rated by rater #12 and scores a 2.

1020. Exporting to Excel

DAL February 12th, 2007, 6:00am: Hi there, I'm sure this is a very basic question, but my method of exporting Facets output files to excel is time-consuming and messy. I've searched the facets help file but there is nothing on this topic (or I'm not searching for the right keyword). There must be a more efficient way!

At present I use the txt import wizard - fixed width - then muck around with lines etc, but it never comes out nicely.

What does everybody else do?

MikeLinacre: DAL, thank you for your question.
Writing a Facets scorefile to a temporary file with tab-separation will probably work better. Then
paste into Excel
if it doesn't format correctly, then "text-to-columns" will probably do the job.
In the next update to Facets, I'll add an export to Excel option to the Scorefile dialog box.

DAL: Thanks! I'll give it a go.

DAL: Hi Mike,
I'm having problems making my files tsv. Whether I paste from a temporary file or from the screen it automatically makes it space delimited, and there isn't a 'save as' option with 'tab delimited' as far as I can see. What's the trick?

MikeLinacre: DAL, not sure what version of Facets you are using. Is it the early freeware version?
In current version, "Output Files" menu, "Score file" gives you many output options.
See https://www.winsteps.com/facetman/scorefilescreen.htm

1021. Evaluate new items in CAT studies

Inga February 3rd, 2007, 5:02am: Hi~ I am participating in a computerized-adaptive testing project. We have given subjects our existing item banks along with additional items (newly-developed items). Wonder if there is any statistical methods to justify whether those additional items fit into the exisitng item bank (besides factor analysis, fit-statistics, theoretical basis)? Meanwhile, how to decide which items met criteria as anchor items (besides stable item parameter calibrations in a cross-plot)? Thanks a lot!


MikeLinacre: Inga: Looks like you are doing just about everything for your pilot CAT items. Do their empirical measures approximately match their intended ones? Are the distractors functioning as intended?

For the anchor items (pre-existing items in the item bank): Have their difficulties drifted? Do they still fit in a reasonable way?
Also: "Is their content still relevant?" - but this is usually detectable through drift ...

Anyone else have suggestions?

Raschmad: George Ingebo (1997) states that the accuracy of linking can be best evaluated by a 3rd test. Suppose you want to bring test A to the framework of test B. You have estimated that a shift constant of say, 0.50 logit is required. Now use a 3rd test C. Here instead of directly going from A to B, you go from A to C and then from C to B.
Estimate the shift constant for adjusting A to C, suppose it’s 0.80 logit. Then compute the shift constant for adjusting C to B, say it’s -0.30 logit. The algebraic sum of these shift constants should approximately be the same, as the shift constant when you directly adjust A to B (0.80-0.30=0.50).
The other issue is the characteristics of the anchor items for adding new items to the bank. WINSTEPS manual says they should be spread along the difficulty continuum.
“Anchor values may not exactly accord with the current data. To the extent that they don't, they fit statistics may be misleading. Anchor values that are too central for the current data tend to make the data appear to fit too well. Anchor values that are too extreme for the current data tend to make the data appear noisy”.

Wright Stone (1979: 96) give fit statistics to evaluate the quality of the link.

Mike is of course the right person to comment on these.

Good luck!

zhongen_xi: I have designed another method for linking items and for checking whether the items chosen are good for linking on the performance dimension. Assume that all the items in your bank have their respective facility values, or pass rate. Take the inverse of the facility value as item hardness. Administer a group of items from your bank together with some additional newly developed items to a well chosen group of test takers. Recalculate the item hardness for each item, old and new. Calculate the mean item hardness of all the old items for both the original test taker group and the new test-taker group. Find the ratio betweem the two. This ratio is the linking factor for the newly developed items. To check the fitness of the old items, calculate the difference in item hardness of all the old items for both group of test-takers, and calculate the standard deviation of the difference. This standard deviation can be used as a criteria for checking the goodness of old item for anchoring purposes. It's apparent how it can be used.
Zhong'en Xi

1023. Checking link quality

joe January 20th, 2007, 10:52am: Dear all,

We linked 2 exams and tried to check the quality of the link using the Chi-square formula below, found in Grant Henning's A Guide to Language Testing (pp. 133, 186):

[SUM (diA - diB - G)]2 (N/12) (K/K-1)

Where, G = the translation constant (mean difference between the link items)
diA = diff estimate for given link item from test A
diB = diff estimate for given link item from test B
N = number of persons in linking sample
K = number of items in the link

For our link, K = 10, N = 3414

The problem is that due to the large N size, the result is always very high - over 100 in this case. Is this an appropriate formula? More generally, are there other ways to check the quality of a link?

Would it be better to merge the 2 tests into a single matrix and use DIF to see if the linked items performed the same way? Or, just use the fit statistics to check the merged matrix?

Thanks in advance for any help on this!


MikeLinacre: Joe, thank you for your question. You have shown a global test of the equating constant based on common items. Global tests always have practical difficulties. Imagine that you have a link of 20 items. 19 link exactly and 1 item is considerable different. Then the global test will report the link is OK. But inspection of the results will show that the 20th item must be dropped from the linking set of common items. On the other hand, imagine that the 20 items have a small random scatter about the equating constant. Too small to have any practical consequences. With a big enough sample, the global fit test will fail.

So, do not use global fit tests. Use pair-wise item-level t-tests to screen the items. But before that, produce a scatterplot of the two sets of item difficulties. Do the points fall on a statistical straight line? If so, the equating will be successful. If they don't, then you need to do further investigation into the differences between the two versions of the test and the two samples of examinees.

joe: As always, thanks a lot, Mike.

Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse Rasch Measurement Theory Analysis in R, Wind, Hua Applying the Rasch Model in Social Sciences Using R, Lamprianou Journal of Applied Measurement
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch Rasch Models for Measurement, David Andrich Constructing Measures, Mark Wilson Best Test Design - free, Wright & Stone
Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias Diseño de Mejores Pruebas - free, Spanish Best Test Design A Course in Rasch Measurement Theory, Andrich, Marais Rasch Models in Health, Christensen, Kreiner, Mesba Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen
Rasch Books and Publications: Winsteps and Facets
Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes Rasch Models for Solving Measurement Problems (Facets), George Engelhard, Jr. & Jue Wang Statistical Analyses for Language Testers (Facets), Rita Green Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind Rasch Measurement: Applications, Khine Winsteps Tutorials - free
Facets Tutorials - free
Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan

To be emailed about new material on www.rasch.org
please enter your email address here:

I want to Subscribe: & click below
I want to Unsubscribe: & click below

Please set your SPAM filter to accept emails from Rasch.org

www.rasch.org welcomes your comments:
Please email inquiries about Rasch books to books \at/ rasch.org

Your email address (if you want us to reply):


FORUMRasch Measurement Forum to discuss any Rasch-related topic

Coming Rasch-related Events
Oct. 6 - Nov. 3, 2023, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Facets), www.statistics.com
Oct. 12, 2023, Thursday 5 to 7 pm Colombian timeOn-line workshop: Deconstruyendo el concepto de validez y Discusiones sobre estimaciones de confiabilidad SICAPSI (J. Escobar, C.Pardo) www.colpsic.org.co
June 12 - 14, 2024, Wed.-Fri. 1st Scandinavian Applied Measurement Conference, Kristianstad University, Kristianstad, Sweden http://www.hkr.se/samc2024
Aug. 9 - Sept. 6, 2024, Fri.-Fri. On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com