Old Rasch Forum - Rasch on the Run: 2010

Rasch Forum: 2006
Rasch Forum: 2007
Rasch Forum: 2008
Rasch Forum: 2009
Rasch Forum: 2011
Rasch Forum: 2012
Rasch Forum: 2013 January-June
Rasch Forum: 2013 July-December
Rasch Forum: 2014
Current Rasch Forum

202. Item person map

indranibhaduri October 20th, 2010, 3:14pm: I am interested in a detailed description of item person map.I know that the items and the person are mapped on the same scale,on one side we have the items and on the other side we have the persons,I have generated an item person map the 'x' symbols that are denoted for persons,how many do they represent.Also in the map there are some 'x' representing people but there are no corresponding items,so without the response to the items how did i map the persons.I hope i am able to communicate what i am looking for.Would you please give the details.

MikeLinacre: Thank you for these questions, Dr. Bhaduri.

Each "X" is one person or one item.
If you have many people or items they are indicated by "#". Under the map it says something like: EACH '#' IS 4.

The position of each person (or item) "X" is located based on all the responses by the person (or to the person). For instance, if a person achieves 50% on the test, then that person's X will be in the middle of the item distribution. If an item has a p-value (success rate) of 50% then than item's X will be in the middle of the person distribution.

SK: My sample has 150 participants. The item person says each '#' is 2 but when I count the # and multiply by 2, I don't get 150. Can u please explain? Thank u.

Mike.Linacre: SK, please show us your item map so that we can see.

5 . +
4 . +
3 # T+T
-2 . T+ GO TO GO TO
-3 +T GO ON
EACH "#" IS 3. EACH "." IS 1 TO 2

SK: Sir, will the dot always represent 1 to 2? There are dots on my map but no description for it. Here is part of it.

.## | 6-1
-1 .# S+S
# |
| 1-2
.# | 7-2
.# |
# T|
-2 +
# |T 8-1
| 5-1
# |
-3 +
# |
-4 +
EACH '#' IS 2.

Mike.Linacre: SK, each dot is 1.

Sorry, will show this in the next version of Winsteps.

SK: Thank you sir.

216. Strength and weaknesses of Psychometric

ary December 7th, 2010, 12:18am: ::) Hello and a very good day to all. I was just wondering if anyone could recommend to me a few good critical articles related to the issue mention above... especially those that scrutinize the application of psychometric in behavioral science....

Thank you

MikeLinacre: Ary, what do you want?

a) Critiques of different psychometric methodologies, such as IRT, SEM, CTT ...
b) Critiques of the application of psychometric findings to practice, such as the use of IQ tests and SAT tests for decision-making.

ary: Thank you Prof Mike for your prompt reply...
I'm looking more on the issues related to the second one... misunderstanding...misuse... misinterpretation....critiques.. the reason why some individuals are against/ distrust the application of psychometric in measuring performance... achievement/failure.
Papers that really scrutinize and mistrust the application of psychometric :)
If you dont mind sharing, I do like some of your advise..:)
From your experience Prof, how do we go about.. mellow down ... in explaining all the important technical information, numbers and graphs.. that we are so used to ... so that people that are not so inclined to number are more ready to listen... more open
Thank you

MikeLinacre: Ary, it sounds like you are interested in the debate between "quantitative" and "qualitative" assessments. http://wilderdom.com/research/QualitativeVersusQuantitativeResearch.html

http://gking.harvard.edu/files/good.pdf answers criticisms.

https://www.rasch.org/rmt/rmt183d.htm presents a schematic look.

ary: Thanks Prof Mike for your prompt reply. By the way prof, as i was just browsing through, I saw that someone has quoted you and given the idea that very low person reliability is acceptable in instrument development and that the problem is not with the instrument. "Reliability is not associated with the instrument, it is a characteristic of the test scores or person measures for the sample you are testing".
I really hope that I have not misunderstood the whole concept. Thank you prof.

MikeLinacre: There are two aspects to test reliability, Ary.

1. The precision with which the instrument measures = the standard errors of the ability estimates (or raw scores). This is dominated by the test length = number of items. It is associated with the instrument.

2. The standard deviation of the sample being tested. This is independent of the instrument.

There are situations in which low test reliability is acceptable, but it sounds like I have been quoted partially and out of context. So, that quote of me is misleading. Where does it originate? I would like to correct it.

Ben Wright suggests that a sample-independent reliability can be calculated. https://www.rasch.org/rmt/rmt144k.htm

rsiegert: Hi Mike

I read a letter to the editor that you wrote recently in a European rehabilitation journal. In it you referred to two different approaches to Rasch analysis one typified by RUMM and the other by WINSTEPS. It was very interesting to read you describe what I had observed in the literature but not been able to verbalise so succinctly. Thanks for that enlightenment.

But my point is that you referred to a tradition of quality control in industry/manufacturing when describing the "WINSTEPS' school of thought. I worked with a statistician colleague a while back who also used to talk about this model as an alternative to a strictly statistical significance testing way of working. I just wondered if you could please provide one or two key references or works on this alterntive mode that I could find out more about this school of thought.

Many thanks

Richard Siegert 8)

MikeLinacre: Richard, the father-figures of quality-control statistics are W.A. Shewhart and W.E. Deming. Their practical work centers on control charts and "tolerances", which are not so interesting to us. Their philosophical work focuses on improving the quality of industrial products by rethinking the ways in which defective products inform the manufacturing process. Deming's book "Out of the Crisis" contains numerous examples. Search www.amazon.com for "W. Edwards Deming" to see lots of exciting titles.

The father of statistics, Ronald Fisher, also addressed this issue somewhat, but it did not relate much to his primary area of statistical application, fields of potatoes!

ary: Dear Prof Mike, Thank you again for the explanation.
Regarding the quotation, my sincere apology, the mistake is on my side. The message is more of ..unclear conceptualization and expression of the concept.
I totally in agreement with Richard, your letter help me to understand and explain things better.
I have face a few situation where people question the difference of the outcomes from these 2 programmes, and I knew 1 is more statistically strict/rigorous than the other but does that make the other less meaningful? Now I can explain better. Thank you.

MikeLinacre: Ary, RUMM is more statistically rigorous than Winsteps, but Winsteps is more practically useful. This situation often occurs when academia encounters the business world. You may be interested in "Quality by Design: Taguchi and Rasch", https://www.rasch.org/rmt/rmt72j.htm

ary: Thank you again Prof. May you have a Beautiful and a bless day ;D

jerrysmith: In my opinion there is no any weakness of psychometric testing but if it is done by inexperienced person then it might be give wrong result otherwise psychometric testing is very recruitment tool. It helps in improve human resources of any organization.

Emil_Lundell: Excuse me, jerrysmith, are you talking about a specific theory, model or method or are you merely referring to quantitative study of individual variation?

Emil_Lundell: Weaknesses of psychometrics as often applied in social sciences

Thomas J. Scheff (2011) writes that there is an ongoing fraud in attitude/behavior research, he criticizes the misuse of statistics and notes that some research has not showed any progress during the last 50 years. He also notes that many studies are not based on "clearly defined concepts" and that quantitative researchers blame the participants for inconsistency between attitude and behavior instead of questioning their own measurement instruments.

Duncan devoted a chapter to criticize psychometrics, he stated that psychometrics has produced a correlational psychology of inexact constructs. He seems to conclude that true score theory is applicable in physical science but not in social science. He also criticized the use of factor analysis and reliability indices of internal consistency (1984. pp. 200-218).

Schumacker & Linacre (1996) summarizes critique against factor analysis. Factor analysis should only be used when factors are uncorrelated or when the researcher can't make sense of the data.

Duncan's reasoning about "staticism" (1984. pp. 226-227) suggests that researchers who believe that factor analysis is a measurement model, that reliability indices show what's succesful measurement and that statistics are a complete or sufficient basis for scientific methodology are incompetent.

The combination of these sources leads, more or less, to a "decapitation" of how many researchers apply psychometrics in social sciences. The mistrust of psychometrics is not independent of the methods researchers use.


Duncan, O, D. (1984). Notes on social measurement: historical and critical. New York: Russell Sage Foundation.

Scheff, T. J. (2011). The catastrophe of scientism in social/behavioral science. Contemporary Sociology: A Journal of Reviews, 40(3), 264-268. Available: http://csx.sagepub.com/content/40/3/264

Schumacker, R. E., Linacre, J. M. (1996). Factor analysis and rasch analysis. Rasch Measurement Transactions, 9(4), 470. Available: https://www.rasch.org/rmt/rmt94k.htm

293. Ordinal to interval?

wlsherica May 16th, 2010, 1:58pm: Dear Mike:

I still confused with the process about transform ordinal scale into interval scale when we used Rasch analysis. Could you give me some hints about this?

Thank you very much.


MikeLinacre: Wisherica, the raw observations we make "naturally" are ordinal. "This is hotter than that". "This is longer than that." Then we have to apply a technique to convert the ordinal observations into interval measurements.

For length, it appears that the ancient Babylonians were the first to master "intervalling". They used a "60" based system which we still see in 60 minutes in the hour and 360 degrees in a circle.

For heat, it took from 1600 to 1800 to refine the "intervalling" process as "temperature".

For social-science observations, Georg Rasch discovered the "intervalling" process, know to us as the "Rasch model". His insight was that we base the "intervalling" not directly on the ordinal data itself, but indirectly on the ordinal data through its probability of being observed. So the Rasch model is a "probabilistic model".

A simple example of this process at work is "Log-odds in Sherwood Forest" https://www.rasch.org/rmt/rmt53d.htm

Does this give you some hints, or are you looking for something else, wisherica?

wlsherica: Thank you so much, Mike.

It's really helpful.

I'll read the paper you recommended and try to do item calibration by hand from BEST TEST DESIGN.

MikeLinacre: Wisherica, you may also want to look at Mark Moulton's Excel spreadsheet:

wlsherica: Thank you very much for your help!!!!
Now I have some ideas about the process of item and person estimates!

wlsherica: Dear Mike:

The excel demo file was really helpful!

I had a question about person and item measures in the file you provided.

Why did we get different final measures of person and item measures in "P2. Final Results" sheet and the bottom of "P1. Rasch Algorithm" sheet?

For example, the final ability of PERSON A in P2. was -2.91, but ability of PERSON A in P1. was -2.84. Why were they unequal?

Many thanks!

MikeLinacre: Wisherica ... In Excel, please press your F9 key to recalculate the spreadsheets. The values will then agree.

wlsherica: !! Wow... amazing... @o@

Thank you for your help, Mike.
Now I catch the point!!!!!

Leo: Realizing that I am late to this thread, I would still like to respond. I am new to the issue of test scaling, but not to the issue of scaling more generally. There seems to exist tremendous confusion around the interpretation of the parameters of the Rasch model. What I don't think helps a lot is reference to stories that are wrong to a pretty disturbing extent, as for example, the story of Robin hood on the www.rasch.org website.

Example: Because Robin hits the oak 10/12 and Will only 6/12 it is concluded for each 10 hits of robin, there are 6 hits of Will, on ALL kinds of trees. This can be true, so lets go with this assumption. It means that the likelihood of hitting a tree is proportional to the likelihood of hitting the oak, say k*10/12 for Robin, and k*6/12 for Will, so that the k cancels out by dividing.

Under this assumption, however, we cannot say that Will is three times as likely of MISSING all kinds of trees (which is what the story likes us to believe). The ratio of probabilities of missing is: (1-k*10/12)/(1-k*6/12), which is only equal to 3 in case k=1, i.e., for the oak. For any other k, the ratio is different.

The story refers to this as a "paradox", but it's not as shown above. More disturbing however is that it's used as a starting point to make a claim for interpreting rasch parameters on a interval scale. Confusing? I'd like to think so.

Does the website www.rasch.org have any authority?

Mike.Linacre: Thank you for your insights, Leo.

I'm the author of "Log-Odds in Sherwood Forest". The intent of that research note is to emphasize that "counts or proportions of successes" and "counts or proportions of failures" cannot be stable indicators of ability or difficulty. "Odds of success" (or "odds of failure" or "log-odds of success" or "log-odds of failure") can be stable indicators of ability or difficulty.

Log-odds have the additional advantage that they can be additive.

That research note expresses in nontechnical terms the mathematics of Georg Rasch in his "Probabilistic Models for some Intelligence and Attainment Tests" (Copenhagen, 1960 & Chicago, 1980).

Leo, the development of Rasch methodology has been greatly assisted by robust criticisms. Kenneth Royal, The Editor of "Rasch Measurement Transactions" will undoubtedly welcome a vigorous critique of "Sherwood Forest". Email a Word document to Editor -/at\- rasch dot org

471. Test Information Function

arthur1426 November 28th, 2010, 7:43pm: To Whom It May Concern:

Within the winsteps.com site, it was reported that, "In practice, the values of the information function are largely ignored, and only its shape is considered." However, Baker (2001) reports: The maximum value of the amount of test information is
approximately 4.2, which yields a standard error of estimate of .49...Since this test has only ten items, the general level of the test information function is at a modest value, and the precision reflects this.

I guess my questions would be: is there a standard by which to judge the amount of information that a test provides. I understand that, in evaluating the TIF, one is concerned with WHERE along the latent trait one finds the most information. But surely, if you have two TIFs with approximately the same shape, and one has a value of 10 and the other a value of 4, the later test is less precise?

Any input on this matter would be greatly appreciated.


MikeLinacre: Thank you for your question, Robert.

At any point on the latent variable, the standard error of a person estimate (on the complete set of items) is the inverse square-root of the Test Information Function. So,

TIF = 4, person measure S.E. = 0.50 - this is equivalent to a typical test of 25 dichotomous items
TIF = 10, person measure S.E. = 0.32 - this is equivalent to a typical test of 63 dichotomous items
TIF = 100, person measure S.E. = 0.10 - this is equivalent to a typical test of 625 dichotomous items, so is rarely observed in practice!

The total amount of information provided by a test (the area under the TIF) is determined by the number of items and the number of ordered categories within each item. So, more items -> more information, and polytomies have more information than dichotomies.

The desired shape of the TIF depends on our intentions. If we want a wide test with uniform precision (= uniform S.E.), then we want a wide flat TIF. If we want a test that is precise (= lower S.E.) at cut-points, then we want the TIF to peak at those cut-points.

Baker (2001) reports a TIF of 4.2 (S.E. = 0.49) with a 10 item test. If this was a 10-item dichotomous test, then a typical maximum TIF value would be 2.0. So my guess is that Baker's 10 items are polytomies, perhaps 5-category Likert items.

ppires85: Hello,

I have somehow a question about TIF also, but currently on it's shape. I have some data on a small sample of 350 subjects and I got a weird TIF that gets it's maximum value of 6.50, but it gets verry narrow, at TIF=3.0. It's weird because I even got "U" shaped curves, so I fear I'm having fit problems because of sample size and quality maybe. I know these curves won't be perfect bell-shaped, but what I had was somehow grotesque.



MikeLinacre: Thank you for your question, Pedro. The Test Information Function (TIF) is independent of the fit of the data to the model (any model). It is dominated by the distribution of the item difficulties on the latent variable. For instance, if we have 5 very easy items and 5 very difficult items, then the TIF will be bimodal with one peak above the easy items and one peak above the difficult items.

ppires85: Thank you professor! Maybe if I want more information about the people with hability located at 0 logit I should aim for average difficult items? Another question, where can I obtain the TIFs' information like the correct score for the ammount of information in my test and items? I said 6.5 by eye sight simply on the graphics.

Another point is that I'm making some analysis on a kind of screening test. It's a test oriented to assess child development. So we do expect that the items should be very easy and with small variability for the responses. Although what I realized is that the Item Information Functions seem all equal with no change from item to item. Should it be because since we expect the subjects to respond mostly right to the questions, we are aiming for basically the same hability score and thus we have a small to none variability also in item information? To tell the truth all Item Information Functions are equal for each dimension I tested here. Where can I obtain also the IFFs specific data?

Thank you again professor!

Mike.Linacre: PPires, the information functions are based on the parameters values. They are independent of the number of observations or the targeting of the items on the sample.

If you know the IRT model, and the parameter values, then you can compute the item information functions. Sum the item information functions to make the test information function. See http://echo.edres.org:8080/irt/baker/chapter6.pdf

550. item difficulty calculation

kulin August 25th, 2010, 3:36pm: Kulin, this is a great project!

How are the two sets of items, and the two sets of students, related ?

If the two sets of items share some items, then one analysis of all the data is best (using common items).

If the two sets of persons share some persons (and the studies were conducted reasonably close together) then one analysis of all the data is best (using common persons).

If the two sets of items are different, and the two sets of persons are different, then you will probably need to use a "Fahrenheit-Celsius" virtual equating of the two data sets:

1. Do two separate Rasch analyses.
2. Identify 5 or more pairs of items (one in each study) which are approximately equivalent in content.
3. Cross-plot the Rasch difficulties of the pairs of items.
4. Use the best-fit line through the pairs to convert the Rasch difficulties for the items in study 2 into Rasch item difficulties for study 1.
5. Do the regression analysis using study 1 item difficulties for all the items.

OK :-)

kulin: Dear Mike,

Your answer is a great help for me - THANK YOU.

The overall characteristics of the students sample were similar in both cases (representative samples of eighth grade students from my country). But there were no students who participated in both studies :'(...

The assessment frameworks were similar in both studies (but not identical). But completely different items were used :'(...

The difference in Rasch item difficulty (ONE ANALYSIS-SEPARATE ANALYSES) is about 10%...

It seems like I will finally have to perform the test equating procedure.

Thank you for encouraging me - it is my first big project...

MikeLinacre: This is an awkward situation, kulin.

Usually we like to think that the two samples are randomly selected from the same population, so that the samples will have equal means and variances. Unfortunately this does not appear to be true in your project.

kulin: Dear Mike,

Both studies used a stratified, multistage, cluster-sampling technique and the students samples were representative (samples came from the population of eighth grade students in both cases and from the same country).

I carried out the test equating procedure. The slope value is 1.1, the y-intercept=0.515 and the x-intercept=0.468.

Using the best fit line I could re-calculate only the item difficulties for the study-B items that have an analogue in the study A item sample (I know the x-value in these cases).

In the approximation that the slope value is near 1 I can simply re-calculate the item difficulties for all study B-items. I should re-calculate item difficulties for ALL studyB-items, shouldn't I???

I'm sorry for bothering you - I hope that this is my last question...

MikeLinacre: Yes, Kulin. Rescale all the Study B person abilities and item difficulties into Study A values, then all the measures from Study A and Study B are directly comparable.

Juan: Dear Mike

I have a similar project, academic readiness, where I would like to test the validity and reliabilty of a likert-type survey taken by two seperate cohorts (2010 and 2011). In each case I got the item measures of each of the 22 constructs which I subsequently compared visually with a graph in excel. The item measures compared very well with each other. Is this comparison legitimate?


MikeLinacre: That is a valid comparison, Juan-Claude. Seeing is believing! But if you want to make the comparison more formal, then correlate the two sets of item difficulties.

Juan: Thank you for the advice Mike.

I'm currently developing "risk profiles" based on each factor's item measures, correlated with academic achievement and retention. We want to identify students in need of support at various levels.

As a side note I am registering for the advanced Rasch course starting in July.

MikeLinacre: Wonderful, Juan-Claude. Another reason to look forward to July :-)

570. Weighting or Deleting Items

uve December 13th, 2010, 7:07pm: Mike,

Students will take our tests and then after we'll notice errors on the answer key. We spot these quicky looking at the correlations. I could delete the item in my analysis, but those students did answer the question. This becomes even more confusing when I begin anchoring items. I can't decide whether I should delete or weight certain items as 0. Or does it make much difference?

MikeLinacre: Deleting an item or weighting it at zero are the same for the person estimates, Uve. Responses to those items are not included. Deleted items have no difficulty estimates or fit statistics. Items weighted 0 do have those statistics.

uve: Since I am more focused on the items and anchoring, it sounds like weighting would be better.

oosta: Uve:

I don't quite understand why you would keep the item if it's a bad item and you are not scoring it anyway.

571. Constructing a Better Test Version

uve December 27th, 2010, 8:13pm: Dear Mike,

I have 4 questions:

1. I think I understand what the Test Information Function is, but I'm not sure I know what to do with it. So what exactly is this used for typically?
2. Same with item information.
3. Is the peak of the item information on the x axis the same value as the item measure?
4. I know this next question is very open ended, so forgive me, but based on the item maps, if I find that a test is poorly targeted and I have a large item bank with logit values based on previous test versions or perhaps field tested in some manner from which I can choose to create a new version, can Winsteps guide me in creating this new version which will target better?


MikeLinacre: Uve, test information is not usually important for Rasch model decision-making. It is important for 2-PL and 3-PL models. In the Rasch model, the item information for every dichotomous item is the same, and it peaks at the item difficulty. So we can infer the test information by looking at the distribution of the items. The test information is higher where there are more items.

For a dichotomous item, the item information is p*(1-p) where p is the probability of success.
Its peak at p=0.5 is 0.25. It is above 0.2 from p=0.3 to 0.7. This tells us that for practical purposes, any item within in 1 logit of a person (p=0.73) has effectively the same item information as a perfectly targeted item. This is useful information for item selection. We can randomize the selection of an item within 1 logit of the optimum and not lose much testing efficiency, but gain considerably in test security, item exposure, and avoidance of CAT "item tracking" (giving the same stream of items to many examinees).

In general, the most statistically efficient test targets the items on the persons = success rate of 50%. The most psychologically efficient test has a success rate of around 80%, so the items are targeted about 1.4 logits below the persons. (Psychological efficiency = minimizes response misbehavior, and persons have a good feeling after completing the test: "The test was challenging, but I did my best.")

uve: Mike,

Maybe I am mistaken, but I seem to recall that Winsteps can simulate a 2 or 3PL model. If so, are there syntax commands for this?

I seem to also recall that one advantage of the 2 and 3PL models is that they can provide probabilities for guessing on an individual item based on the ability level of the person. Is this correct?


MikeLinacre: Uve, Winsteps can estimate these values: 2-PL discrimination, 3-PL discrimination and guessing, 4-PL discrimination, guessing and carelessness. But Winsteps does not use those values when estimating the person measures (thetas) as 2-PL etc. do.

There is rarely enough relevant data to estimate the lower asymptote, c, in a real 3-PL analysis, so its value is often set at c = 1/(number of options). When the 3-PL item parameters have been discovered, the probability of a correct answer (which includes lucky guessing) for any individual on any item can also be estimated.

572. Contrast

uve December 27th, 2010, 8:27pm: Dear Mike,

I have been reviewing your PDF's from the last Further Topics course now that I have more time. I was hoping you could clarify a statement you made regarding contrasts on the "further3cc" slide 128 where you stated you were wrong in your initial speculation about where the contrast was in your example. I found it difficult to understand what tipped you off it was revealed by the 3rd contrast.

Is it the loading values? In order for a contrast to be valid do they need to be close in value but in positive/negative loadings? For example: .51 and -.51. I realize we should also have the item labels so we can make further decisions, but basing this primarily on loading, is there a general range that is acceptable? For example: would .51 and -.41 be acceptable? Again, I'm not even sure if loading had anything to do with your decision or not.

As always, thanks for your help.


MikeLinacre: Uve, my speculation was that the contrast (dimensionality) was between "Information" and "Interest" items. So I looked down Table 23 until I found a plot which split the Information and Interest items. The size of the item loadings did not matter, provided that they were opposite. Table 23.13 showed the split I had expected to see in Table 23.3.

We are interested in the amount of variance explained, not in the size of the loadings. Loadings are correlations with a hypothetical variable. If we were interested in the size of the loadings, we would do the usual common-factor analysis tricks (commonalities, rotations, obliquenesses). We are interested in identifying the groupings of the items which explain the most residual variance. The loadings assist us with this, but only for stratifying the items.

573. winsteps stopped working

Jane_Jin December 20th, 2010, 5:57am: Hello,

I ran Winsteps with a control file, it always stopped at the step of "processing Table 30.4. And a window popped out saying "Winsteps.exe has stopped working, a problem caused the program to stop working correctly, please close the program". I couldn't figure out what's wrong with it. Can anyone help me out on this. Thank you!

MikeLinacre: Jane, yes, I will do my best to help you ... :-)

What version of Winsteps are you using?

Please email your Winsteps control and data file to me, so that I can replicate the problem ...

Jane_Jin: Thank you Mike, I've sent the email with data and control files. ;)

MikeLinacre: Thank you, Jane. Problem solved. It was a bug in Winsteps 3.66. The current version, Winsteps 3.70, runs fine.

574. Cheating

uve December 9th, 2010, 4:44pm: Dear Mike,

It has come to my attention that perhaps many of our students were given copies of one of our biology tests in advance as a study guide. The number of students who scored proficient seems abnormally high. I'm feeling a bit overwhelmed attempting to see if there is a great deal of misfit in the data because we had over 1,100 students take the exam. I've looked at tables 6.1-6.6, but there doesn't seem to be anything there that stands out. Winsteps identified about 34 persons that had unexpected correct response strings but that doesn't explain the huge increase in proficiency. It could be that the teachers have suddenly done an outstanding job and we're just seeing the fruits of their labor. I would love for that to be true so I'm just trying to make sure I can't disprove this. I know there is no "possible cheating" table in Winsteps--this is a process, but with so many scores I'm having a hard time knowing where to start. I would greatly appreciate any advice you have on this matter.


MikeLinacre: Uve, a technique like https://www.rasch.org/rmt/rmt61d.htm has proved insightful. Those who were given copies would tend to fall in a different quadrant from those who did not.

For each pair of students, try plotting: "average of the two raw scores" against "% of questions they both answered correctly".

uve: As always, a million thanks. But as always, answers often create more questions. I understand your link in theory, but I can't quite wrap my head around how I would match each possible pair of over 1,100 students on 50 questions to each other to see if one or more pairs share an outlier level of common response strings given their ability levels. Is this possible using the scatterplot function in Winsteps? Or should I use a different approach?

MikeLinacre: Uve, the response-string pairing is not a job for Winsteps. But perhaps it should be!

We load the data file into an array in memory, and then do a double-loop to match every response string with every other response string. This produces two numbers:
(1) the average of the two counts of correct responses (one count for each response string) = x-axis coordinate
(2) the count of correct responses that both response strings share = y-axis coordinate

This produces 1100 * 1099 / 2 (x,y) co-ordinates. These we can plot.

This is a straight-forward task in most computer languages or with an Excel VBA macro.

uve: Mike,

I would highly recommend adding the feature if possible. In the U.S. the issue of teacher effectiveness is becoming a hot topic. More and more assessment data are being used in part to measure this. Many teachers feel the pressure and unfortunately some of them resort to unethical tactics, like "encouraging" students to mark certain answers, or as in our case, providing test booklets ahead of time as study guides for the very same test. Cheating is going to become a much bigger problem and we are even seeing it on the state assessments. I think any tools we can have at our disposal to help us in this type of investigation would be extremely helpful. Thanks again.

MikeLinacre: Thanks for the encouragement, Uve. Yes, this type of response-comparison will be a useful addition to fit analysis.

575. Deciding that items are equivalent

connert December 10th, 2010, 3:39am: So if your have a questionnaire with several items that have multiple responses and you want item difficulty to be the same for all items, how do you test for that? I have considered ANOVA but am not sure that is the most appropriate or even a valid test.

oosta: What do you mean by multiple responses? Can you provide an example of an item?

Why doe you want item difficulty to be the same for all items?

MikeLinacre: Connert, are you asking for a hypothesis test that items have the same difficulty (after adjusting for measurement error)?
If so, this is a "fixed effects" chi-square test.
See the box in https://www.rasch.org/rmt/rmt62b.htm

connert: Thanks for the response Mike. My situation is that I have 5 vignettes that are different in some details but otherwise the same. For each vignette the respondents give a polytomous response. I would like to merge the data and treat the 5 vignettes as 1.

MikeLinacre: Is this what you are intending?

There are 5 vignettes, each with a rating scale from 1 to 5.

A. We merge the vignettes to produce a super-vignette with a rating scale from 5 to 25.
The rating scale for the super-vignette is unlikely to function optimally.


B. We average the vignettes to produce a mean vignette with a rating scale from 1 to 5.
The rating scale will probably function well, but much measurement information will be lost, so the test reliability will diminish considerably.

576. Speeded tests

oosta December 9th, 2010, 9:37pm: I am analyzing the response data for a speeded test. It is a multiple-choice math test. There are 65 items. Almost everyone (N=160) reached the midway point, but the number of people reaching each item after that goes down. Only 10% finish the test.

I assume that, when Winsteps estimates the item difficulty parameter, each response has equal influence in the calculations. I hope that is correct. In my speeded test, I don't want the items near the end of the test, which few people answer, to have the same influence as the items at the beginning which almost everyone answers.

Do you have any words of wisdom for analyzing speeded tests?

MikeLinacre: Speeded tests are a challenge to the analyst, Oosta.
The first question to answer is: "What is the test intended to measure?" Often the test constructors can only answer vaguely: "arithmetic ability", but not specifically enough to answer "Then why did you choose a speeded test instead of an unspeeded test?"

If they really want a speeded test, then unreached items are definitely wrong.

But if the speeded aspect is incidental, then we want to code to different types of skipped items.
1) Intermediate skipped items are wrong - scored 0.
2) Unreached items are not administered - scored neither right nor wrong
An approach is two different codes: "S" for intermediate skips. "U" for unreached items. Then use the different scoring rules.

Do this for investigating fit, item difficulty etc.

If you want something more elaborate, you could code:
correct answers: 2
incorrect and "S": 0
"U": 1

oosta: Thanks!

In this case, it was not meant to be speeded. This was a pilot test, and the administration time was simply too short. I was not involved in the administration time decision; I would have set the administration time so there was time for almost everyone to answer all items.

Because this administration was a pilot test, the scores do not count. Therefore, we are interested only in getting accurate item stats. So, your first approach works best in my situation (skipped=0, unreached=not administered).

MikeLinacre: Yes, that looks correct to me. You may have some problem with guessing towards the end of the response strings. If so, please use CUTLO= to screen out lucky guesses.

oosta: There is no penalty for incorrect answers. So, I wonder if I should use CUTHI as well--at least for analyzing the operational test data. Some of the hi-ability people (as well as everyone else) might randomly guess on the last few items if they are almost out of time. I know I would guess. If I have 30 seconds left, and I have 10 more items to answer, then I am going to quickly answer those items without even reading them.

Alternatively, I suppose I could, perhaps using Winsteps Table 6.4 (Most Misfitting Response Strings) and resorting the columns so they are in item entry order, roughly determine at which item an examinee probably started randomly guessing. Then I could change the random responses (at the end of the test) to missing in the raw data.

MikeLinacre: Yes, oosta. CUTHI= is intended to remove careless mistakes (incorrect responses to very easy items), so speeded responses on the last few items can generate these.

Data editing is a slow process. https://www.rasch.org/rmt/rmt61e.htm describes the process performed by one analyst.

577. Concurrent Equating

Raschmad November 21st, 2010, 11:50am: Dear Mike,
Why wasn't it possible to do concurrent, one-step equating 40 years ago?
Was it because of the computers or estimation methods that were not robust to missing data?


MikeLinacre: Good question, Raschmad.

Classical Test Theory is based on complete, rectangular datasets. So combining two overlapping datasets into one analysis requires the not-administered data points to be imputed. This was attempted, but it is difficult and contentious. These is no agreement about which method of imputation is "correct".

Rasch methodology was able to accept missing data starting with the "pairwise"estimation method (now used in RUMM). This method was suggested by Georg Rasch, and pioneered by Bruce Choppin in the 1970s. But the first Rasch (or IRT, I believe) program that could routinely process missing data was Microscale, which Ben Wright conceptualized about 1983 (and for which I wrote the software).

Surprisingly, the main impediment was not computers or estimation methods. It was the assumption by theoreticians and practitioners that datasets should be rectangular and "complete". Most Rasch theoretical literature continues to be written from that perspective. So, for instance, we see Rasch papers containing concepts expressed in terms of matrix arithmetic (on complete rectangular datasets) that cannot be applied when there is missing data.

A subsequent development, with the "Facets" software in 1986, was allowing data designs in which the data are rectangular but some cells have more than one observation in them, or where the data are not rectangular.

578. book on computational algorithms

marten November 9th, 2010, 7:49am: Hi

I wish to write my own estimation programs in java, perhaps using UCON or JMLE; and then perhaps later add some DIF and fit features.

What book is the best entry in the area of estimation algoritms? Would it still be Wright and Masters Rating Scale Analysis?

At this stage, I'm not particulary interested in substantive meaning, just the estimation algoritms.



MikeLinacre: Yes, Marten. Wright and Masters "Rating Scale Analysis" is the best for Rasch polytomous estimation.

For dichotomies, JMLE is straight-forward -
if your data are a complete rectangle = https://www.rasch.org/rmt/rmt91k.htm

marten: Thanks Mike

I will therefore read the Green, Orange and Biege books;they didn't give your book a very colorful cover :)

I am creating a simple self contained little item banking tool in which I want to incorporate a simple estimation algorithm; there will be some missing data.

In regard to your courses which are appealing, what proportion would you say is focussed towards general rasch techniques and principles and how much on winsteps? That is, how much use will the courses be if don't use winsteps on completion?



MikeLinacre: Marten, if you can decode BASIC then estimation part of www.rasch.org/memo40.htm will be helpful.

My Online Courses are aimed heavily at the practical application of Winsteps and Facets to perform Rasch analysis. David Andrich's Courses are the same thing for RUMM2030.

For general Rasch techniques and principles, the large amount of documentation at www.rasch.org/memos.htm is a great resource

marten: Thanks Mike

That should keep me busy for a while.



580. Item difficulties, "weighting", and Rasch scores

pjiman1 October 25th, 2010, 6:53am: I am working with a student who is trying hard to understand Rasch. One of her questions is let's say there is a 6 item scale, with 5 items that are all easy, and 1 item that is difficult. Person A gets 3 of the 5 of the easy items correct, and the 1 difficult item correct. Person B gets 5 out of the 5 easy items correct, but did not get the difficult item correct.

She feels that in Rasch, Person A's person measure would be higher than Person B only because Person A happened to get the difficult item correct while Person B did the difficult item correct. In her view, this means that the difficult item is "weighted," meaning that if a person gets the difficult item correct, the person's person measure is pulled upward. In her view, this scenario implies that the Rasch person measures give the more difficult items more "weight" than the easy items. However, if a person by chance happens to get the more difficult item correct, the score is pulled upward, which may not be an accurate reflection of the person's true score. She is concerned that Rasch "weighs" difficult items more favorably than easy items when computing the person measure.

Her original question is pasted below:
(1) How do item difficulties contribute to a person's ability estimate in Rasch analysis? I thought ability estimates were simply the logarithmic transformation of a raw score (so 90/100 becomes the log of 90 to 10), but this doesn't account for the difficulty of the items an individual endorsed.

(2) If item difficulties DO contribute to person ability estimates in Rasch, is it correct to say that more difficult items are "weighted" more (e.g., that more difficult items contribute more to the ability estimate than easier items)?

For instance, say there is one rare psychological symptom of depression that tends to occur in patients with higher levels of other depressive symptoms. However, conceptually, this rare symptom is not considered the "most important" marker of depression... it's more of an occasional aberration. Why should this symptom have more "weight" in the person's Rasch ability estimate just because it is rare?

My response to this is first, I think she is mixing classical test theory definitions with the Rasch model. The term "weighting" is not the same in classical test theory as it is in the Rasch model. I think she is interpreting this "weighting" issue as similar to a "bias" issue, where scores are biased upward because a person happens to endorse more of the difficult items than the easy ones. So my first response is that she needs to put aside classical test theory definitions when learning about Rasch. I do concede that it is much easier said than done.

Second, raw scores under classical test theory should not be considered as having the same properties as person and item measures. A person measure is best thought of as a location along the variable's continuum. That is, a person measure of a 5 under the Rasch model should not be thought of as getting 5 items correct. Rather it should be thought off as a marker along the variable's difficulty, with a 5 indicating a higher ability than a 4. Rasch person's scores should be thought of as locations along a line, rather than as direct computations of the number of items that a person got correct under the raw score framework.

Third, in a sense, I would agree with the student that yes, if a person happens by chance to get 1 of the 5 difficult items correct, that person's score under the Rasch model would be higher than someone who got none of the difficult items correct. That is why we have fit statistics to gauge whether the person's response pattern is expected or unexpected. So we should not just take these Rasch person measures at face value; we also have to examine the fit statistics to see if the observed response pattern fits expectations.

Fourth, her interpretation works under her example. With only 1 difficult item, there is a greater chance that the Rasch model would indicate that the person's person measure is high if the person happened to get that item correct. Her description is actually an indication that the instrument is faulty; there are not enough difficult items to give a sufficient estimate of a person's ability in the upper range. Actually what needs to happen, is the instrument should include an additional 4 difficult items so that the number of items and their difficulty levels are spread throughout the continuum so that a person's score in the upper range is not based just on the chance that the person gets the 1 difficult item correct. More difficult items means better estimations of the person measure, and less chance that scores are 'weighted' or 'biased' upward just because a person answered correctly to just one difficult item.

Are there other explanations I can give her to clarify her concern? I am trying to find a way to approach her concern from the point of view of classical test theory but she continues to maintain that difficult items get more "weight" than easy items in the Rasch model.

Thanks for wading through this.

MikeLinacre: OK, pjiman1. This may be easier than it seems.

In Classical Test Theory (CTT), every different way of scoring 5 on the same 6 dichotomous items is reported with the same score, "5". It is the same in Rasch measurement: every way of scoring 5 on the same 6 dichotomous is reported with the same Rasch measure, xxx logits.

Suppose someone succeeds on the hardest item, but fails on an easy item. Is the credit for succeeding on a hard item greater than, the same as, or less than the debit for failing on the easy item? Think of a driving test, a test for nuclear-reactor operators, a test for brain surgeons, .... In general, we don't know whether the credit is greater than the debit or not, and nor do the statistics, so all ways of getting the same score on the same items "count" the same.

But, in Rasch measurement, the fit statistics can be very different. So when we want to investigate the pattern of responses, we look at the fit statistics.

https://www.rasch.org/rmt/rmt32e.htm is a mathematical derivation of the Rasch model from the requirement that the raw-score be the "sufficient statistic", regardless of the pattern of responses.

pjiman1: Thanks Mike for your explanation. I think we are getting close and your explanation helped us understand our question better.

In your explanation, when you say "the same 6 items", are we saying that these 6 items all have the same difficulty? If so, then we agree with you that the if Ann gets 5 out of 6 items correct and Bill gets 5 out of 6 items correct, even though their response patterns differ, they will get the same Rasch person score.

But if the items are of different difficulty, consider the scenario below:

Easiest Ann Bill
1 Correct Correct
2 Wrong Correct
3 Correct Correct
4 Correct Correct
5 Correct Correct
6 Correct Wrong

In the scenario above, will Anne’s and Bill’s Rasch Person measure be different because the items have different difficulty estimates?

My student thinks no, that the person measures will be the same for both Ann and Bill because each person got 5 out of 6 items correct, regardless of the difficulty of the items.

I say yes, the person measures will be different, that Anne’s person measure will be higher than Bill’s because Anne got the most difficult item correct and Bill did not, however, keep in mind that Anne’s person measure will also have a higher misfit statistic.

My example is that if Ann clears a 1 ft hurdle, then fails to clear the 2 ft hurdle because she accidentally tripped, but then clears the 3 ft, 4ft, 5ft and 6ft hurdle, Ann is still the better high jumper because Bill cleared all hurdles except the 6ft hurdle. However, Ann’s score is odd and gets a higher misfit statistic than Bill. So we really should consider both the person measure and the misfit statistic. Rasch provides both statistics so we consider not just the actual score, but how well the score fits the Rasch model.

much appreciated, as always.

MikeLinacre: In your example, the Rasch person measures are the same, Pjiman1. The credit for jumping the higher jump is balanced by the debit for failing on the lower jump!! But the misfit statistics are very different.

Same raw score made any way on the same set of items => same Rasch measure.

581. Implementing PROX for rating scales

Renselange October 27th, 2010, 9:43pm: I am trying to write my own 3 facet parameter estimation program, for fun (watch out Facets!). Right now I'm wrestling with "PROX for polytomous data" in RMT where it says "Mn .. summarize(s) the distribution of relevant logit difficulties {Di+Fk} encountered by person n"

I can see where the Di come from, but what about the Fk? That is, if the Fk are defined as usual (loc of equal prob of cats k and k-1), then do I just take F0 = 0 and add as is indicated there?

Using the convention that Sum(Fk) = 0, this seems counterintuitive since a rating of 0 would typically yield a higher contribution to the person measure than the next higher one. Or, are different assumptions made here, such that, say, F0 = 0 < F1 < F2 < ..... (which would follow from the defs of Fk in the RMT article). Or, is there something else here that escapes me?

MikeLinacre: Rense, PROX for rating scales is awkward. There is no assumption that the thresholds are ordered.

Sum(Fk)=0 is used to define the relationship between the item difficulty and the rating-scale thresholds.

F0 is common to every term, so it cancels out algebraically. It is usually set F0=0, but I find that F0=-40 (or lower) is convenient for estimation. A low value for F0 reduces the chance of floating-point overflow for long rating scales.

With modern computing speed and power, PROX is no longer advantageous. Jump straight into your choice of more computationally-precise algorithm!

582. Valid Comparisons When Anchoring

uve October 15th, 2010, 6:32pm: Dear Mike,

I have recently begun the process of anchoring our new tests, version 2, to version 1 tests last year using Winsteps. The group taking these tests is different each year, i.e. looking at the performance of algebra each year. All is going well. I have several questions:

How do I anchor future versions? For example, next year do I anchor version 3 to version 1 or version 2? If version 3 only has common items with version 2, then it seems we have break in the ability to compare performance across the years. So must I always make sure future test versions have something in common with version 1? If not, how do I ensure I can make valid comparisons across the years?

After anchoring version 2 to version 1, if my colleagues want to know if version 2 is better targeted, do I provide the person-item maps for version 2 with or without the anchoring?

As always, thanks so much for your help.


MikeLinacre: Uve, the first step is always to analyze the new data without anchoring. This verifies that everything is correct with the new data.

Let's imagine we have 3 years.
Year 1. This is a free analysis. We reported it.
Year 2. We want to report this in the Year 1 frame-of-reference. So we anchor it as much as possible to Year 1. We report the anchored analysis. All Year 1 and 2 items now have an anchoring difficulty in the Year 1 frame-of-reference.
Year 3. We want to report this in the Years 1+2 frame-of-reference. So we we anchor it as much as possible to Year 1 and anchored Year 2. We report the anchored analysis. All Year 1 and 2 and 3 items now have an anchoring difficulty in the Year 1 frame-of-reference.

When anchoring common items in Year 3 (some may come from Year 1 and some from Year 2) we use the the Year-1+2-frame-of-reference anchor values. But then we look at the anchor displacements. If any of these are noticeably large, they suggest that those "common" items are no longer "common", but have changed to become new items in Year 3. The changes can be due to a change in the Curriculum, or in society, or in the way in which the item is presented. For instance, one common item became much more difficult because it was split between two pages when it was printed in the new test booklet.


uve: Mike,

The part of this process that's throwing me off a bit is the anchor file I use for the Year 3 analysis. I hope I'm understanding you correctly, but it sounds like the anchor file I use with Year 3 is going to have the item measures that were produced from the item measure output from the Year 2 to Year 1 anchoring. Let me provide you an example that's grounded in some real data. Year 1 of an English test had 50 items 20 of which were used in Year2. I anchored these, plotted them, and discovered only 10 were within the 95% CI boundaries. These ten were then used for the anchoring and the ancored item measures output file was produced. Now let's say in Year 3 only 8 of the original ten will be used on the test. I plot these and find only 5 that can be used. I'm worried that I'll run out of items within a few years. Would it be possible to bring in an item that was used in Year 1, not Year 2 but was used in Year 3? Also, in the Year 3 anchor file, do I use the item measures as reported, or do I have to adjust them given the displacement values. Thanks again.


MikeLinacre: Uve, after year 2 is anchored to year 1, we will have item difficulties in the same frame-of-reference for all year 1 items and all year 2 items. Some of these items will be year 1+ 2 common items (and so will have the same anchor values for both year 1 and year 2).
In year 3, we will include some items from year 1, and some from year 2, and some items may be be in the final set of common items for year 1 and year 2.
Do a free analysis of year 3.
Cross-plot the year 3 difficulties against the year 1 and year 2 item anchor values. For some year 3 items, there may be different year 1 and year 2 anchor values. Cross-plot those year 3 items against both the year 1 and year 2 anchor values.
Choose the subset of items which form the best anchor set. This will include: some year 1 items, some year 2 items, and some year 1 + year 2 items.

uve: Got it, thanks!

583. TFILE=

RaschNovice October 14th, 2010, 2:41pm: I'm trying to produce output for the following subtables


in my control file. But I'm not getting table 2.6 at all, and I'm getting way too much output for 23.0...getting output for every contrast.

How can I just get the output that I want?

The Novice :)

MikeLinacre: Sorry, RaschNovice, TFILE= is only partially implemented in Winsteps. For most entries, only the Table number is used (not the sub-table).

For Table 2.6, this is a program malfunction, which I will repair. Please request only Table 2.

RaschNovice: Hi Mike,

I have 42 scales to produce, so I was really happy to find this feature. My intention was eventually to produce a .bat file that could just process and summarize all my work automatically as a nice little booklet.

Would really like to see this get into future versions of Winsteps.

The Novice

MikeLinacre: RaschNovice - am working on this ...
TABLE.SUBTABLE (for all tables and sub-tables)

RaschNovice: Yay! Thanks, Mike, for continuing to upgrade such an awesome program! :)

584. Simulation

Raschmad October 12th, 2010, 2:42pm: Dear Mike,
I used Winsteps simulation to check if the unexplained variance in PCA of residuals is large enough to form a dimension. I simulated 10 datafiles and named them mysim.txt. Then I repeated the analysis with my real data and in the extra specifications I entered mydata=mysim.txt. Is this procedure correct?
The analysis gave only one set of outputs which were different from my real data outputs.
I did not get 10 different outputs to check the variation in the 10 simulations.
Should I run the analysis 10 times using each of the mysim.txt files?

MikeLinacre: Thank you for your question, Anthony.

For multiple simulations and analyses, please use Windows "Batch" (background processing) mode:

Performing multiple simulations in Batch mode. Here is an example from Winsteps Help:

1. Use NotePad to create a text file called "Simulate.bat"

2. In this file:

REM - produce the generating values: this example uses example0.txt:

START /WAIT c:\winsteps\WINSTEPS BATCH=YES example0.txt example0.out.txt PFILE=pf.txt IFILE=if.txt SFILE=sf.txt

REM - initialize the loop counter

set /a test=1


REM - simulate a dataset - use anchor values to speed up processing (or use SINUMBER= to avoid this step)

START /WAIT c:\winsteps\WINSTEPS BATCH=YES example0.txt example0%loop%.out.txt PAFILE=pf.txt IAFILE=if.txt SAFILE=sf.txt SIFILE=SIFILE%test%.txt SISEED=0

REM - estimate from the simulated dataset

START /WAIT c:\winsteps\WINSTEPS BATCH=YES example0.txt data=SIFILE%test%.txt SIFILE%test%.out.txt pfile=%test%pf.txt ifile=%test%if.txt sfile=%test%if.txt

REM - do 100 times

set /a test=%test%+1

if not "%test%"=="101" goto loop

3. Save "Simulate.bat", then double-click on it to launch it.

4. The simulate files and their estimates are numbered 1 to 100.

585. interpretation of dimensionality

mats October 8th, 2010, 6:05pm: Hi,

I am analyzing ratings from a sample of 231 patients with bipolar disorder There are two subscales in the scale, one for depression and one for mania type symptoms.

The "raw variance explained by measures" in the depression subscale explained 60.5 % of variance and in the mania subscale 50.4 %. The eigenvalues for the first principal component were 2.0 and 1.8 respectively.

How should this be intepreded in terms of noise (random variation) and unidimensionality in the subscales?. Are there any rules of thumb for variance explained by measures? For my paper, are there any references?


MikeLinacre: Mats, you don't tell us how many items are in the scale, but let's assume there are at least 10 items. If so, your scale is effectively unidimensional.

The "raw variance explained by measures" is directly connected to the variance of the person measures and the item difficulties: https://www.rasch.org/rmt/rmt201a.htm Figure 4.

The first principal component has an eigenvalue of 2.0. This is somewhat larger than the expected value: https://www.rasch.org/rmt/rmt233f.htm - but only the strength of 2 items - probably not large enough to call a secondary "dimension", but perhaps a secondary strand (like "addition" and "subtraction" on an arithmetic test). We assume that when you look at the items which load at the opposite ends of that component, it is "depression" vs. "mania".

mats: Thanks Mike.

It is actually two scales used together. 9 items in each. I analyze them seprarately.

I´m not really shure how to intepret the results for variance explained by measures. How should a low or high percentage be interpreted? Are there any rules of thumb?

MikeLinacre: Mats, since "variance explained by the measures" is related to "variance of person measures + variance of item measures", it is easier to interpret the two measure variances.

Sometimes we want high item measure (difficulty) variance, e.g., for a test intended to measure a wide-range of performance. Sometimes we want low item measure variance, e.g., for a test intended to represent performance on a narrow range of tasks, such as arithmetic skills for a supermarket cashier.

Sometimes we expect high person ability variance, e.g., at admission to a training program. Sometimes we expect low person ability variance, e.g., at discharge from a training program.

Mats, how about analyzing your two sub-scales together? Then the dimensionality analysis would be much more interesting :-)

586. Export Table 29 Data to Excel

dmulford October 5th, 2010, 8:27pm: Hello!
I am trying to do some item analysis on a chemistry placement test that we have here at Emory. What I am most interested in is the data from Table 29 and would like to export it to excel. Essentially, I would like to create an Empirical Option Curve (as seen in the graphs menu) for each of my questions, but I want to see it as a plot again "Person Measure" instead of "Measure Relative to Item Difficulty"

Any help would be appreciated!


MikeLinacre: There are several approaches, Doug.

1. Perhaps on the Graphs screen, clicking on "Absolute x-axis" will do what you want.

2. Table 29 (Empirical ICC etc.) can be plotted from the XFILE=, which shows details for every response. The XFILE= can be output directly to Excel, and you can do it selectively, one item at a time.

dmulford: Hello!
Thanks for the help. The absolute scale didn't quite do what I wanted. The XFILE= seems to put the data out, though it doesn't have the frequency of response data that Table 29.1 has in it. I can set up some histograms in excel, but with a 45 item test with 4-5 options on each one I am looking at having to generate over 200 histograms each time I want to analyze the data. I know that winsteps is already doing this as it is generating table 29.1. I am hoping to find a way to extract that data. Any suggestions?


MikeLinacre: Doug, perhaps using the Graphs "absolute" scale, and "copy data to clipboard" gets you the data you want. Then paste ctrl+v to Excel.

Choose a suitable empirical interval on the Graphs screen for your needs.

BTW, the free "R" statistics is good for producing multiple graphs to disk. http://www.stat.auckland.ac.nz/~paul/RGraphics/chapter1.pdf - Winsteps can output the XFILE= in R format.

dmulford: Thanks for you help! I think I am getting to where I want. In the graphs screen, the Option Probability is plotted against the "Measure." (Using an absolute scale) I think I am just a bit unclear on what Winsteps is calculating here and am unsure what the "Measure" is. I understand the Person Measure (and in fact the shape of the curves is the same for me if I manually calculate these graphs relative to Person Measure though the scale is different (-4 to 4)). Can someone please help me to understand what the "Measure" x-axis is on the Empirical Option Curves. I am analyzing a multiple choice test.

Thanks so much!


MikeLinacre: Doug: The "measure" is the location on the latent variable. Every person has an "ability" measure. Every item has a "difficulty" measure. Ability measures and difficulty measures are all locations on the same latent variable. Here is a great picture of this in action:

dmulford: Great! Thanks!


587. scientific definition of "ability" and "item diffi

msamsa September 18th, 2010, 6:44am: What is the scientific definition of "ability" and "item difficulty" according to Rasch model?

MikeLinacre: Thank you for your question, Msamsa.

"Person ability" and "item difficulty" are the names we use to identify points on the latent variable.

In most applications of Rasch models, the persons are the objects of measurement and the items are the agents of measurement.

msamsa: Thank you Dr. Mike Linacre for the replying

588. Large sample non-parametric test for nominal data

catriona.hippman September 19th, 2010, 6:24pm: Hello,

I would really appreciate help with deciding what an appropriate alternative is to a chi square test for a large sample size. In a sample size of ~300, I want to compare frequency of four possible responses between two groups (nominal, not ordinal, data) to determine whether the pattern of frequency of response differs between the two groups. I have read that using a chi square test in a large sample size increases the chance of committing a type I error, and so am seeking an alternative analysis strategy. I know I can also lower the alpha level to offset the chance of committing a type I error, but I am unsure by how much it should be lowered?

Any help would be fantastic!
Thank you,

MikeLinacre: Thank you for your question, Catriona.

300 is a medium-size sample, Catriona. Are you sure that you have a large-sample problem? With modern computers, large samples are often millions.

catriona.hippman: Hi Mike,

Thanks for your reply! I am actually not sure that I have a large-sample problem. I submitted a paper in which I used a chi square test to analyze a sample of ~300 and was informed by a reviewer that "the chi-square is definitely not the appropriate statistical test to use with a sample exceeding 300". I have been puzzled by this response and have been seeking a solution to address the reviewer's concerns... suggestions?

Thanks again,

MikeLinacre: This is puzzling, Catriona. It sounds like you have a 2x4 contingency table, a sample of 300, and the reviewer tells you a chi-square test is not appropriate! It sounds ideal to me. The usual problem is when the sample-size is so small that we get zero cells.

Sorry, I have no idea which alternative test the reviewer would think is appropriate. Keep the chi-square in your paper, and in your response to the reviewer, please ask for specific guidance. And then please tell us :-)

catriona.hippman: Thanks again Mike! I will let you know the outcome :)

589. MCMLM in Winsteps?

RaschNovice September 16th, 2010, 5:07pm: Winsteps can calculate the mixed-coefficients multinomial logit model (MCMLM)? This is mentioned in Bond and Fox (p.259), and appears to be implemented in Conquest.

My understanding is that this would be the most appropriate model to analyze hierarchical personality structures like the Five Factor Model. Yes?

MikeLinacre: RaschNovice, MCMLM is not implemented in Winsteps. The findings reported by higher-order models, such as MCMLM, are usually much more difficult to interpret (and to explain to others) than those of unidimensional Rasch models. So my advice would be to start with a piece-wise unidimensional analysis of your data and then build up, if necessary, to more complex models.

Think back over past 100+ years of scientific advance. Compare the progress of the physical sciences and the social sciences. The physical sciences try to explain their data, one piece at a time, with simple relationships. Social sciences try to explain their data, all in one analysis, with complex relationships. Which approach has been more successful?

RaschNovice: Okay, suppose then I have six subscales for Neuroticism, as proposed by the Five Factor Model. I want to pull all the subscales into the same frame of reference. How do I do that in Winsteps?

MikeLinacre: Thank you for this question, RaschNovice.

First, let us confirm that there really are six subscales :-)

1. Analyze all the data together in Winsteps.
Identify each item with a subscale code as the first letter of its item label.

2. Verify that all the item correlations are positive (Diagnosis Menu A.).
Reverse code any items with negative correlations:
IFREFER = FFFFFFFRRRFFFFRRRR ; F=forward, R=reversed for each item
IVALUEF=12345 ; forward coded items
IVALUER=54321 ; reverse coded items

3. Re-analyze the data, if necessary.
Output Table 23 - dimensionality.
This will report:
How much of the variance is shared across all the items
The first "contrast" (Table 23.2) should identify the two most conspicuous sub-scales at the top and bottom of the plot.
Do those subscales agree with the Five Factor Model?
The second "contrast" plot should identify the next one or two most conspicuous sub-scales at the top or bottom of the plot.
Do those subscales agree with the Five Factor Model?

4. We often discover that some subscales do not exist as separate entities in the data.

5. We can construct a profile for each person across subscales using DPF (Winsteps Table 31) and the subscale code in the item label.

RaschNovice: Hi Mike,

In the Five Factor Model, the subscales of each factor are correlated maybe .40 to .70. They share their variance with each other, and with the factor...it's a hierarchical arrangement.

A principal components analysis of the item residuals with the Rasch dimension removed, however, is looking for orthogonal dimensions in the data. Well, there should be no orthogonal dimensions in the data, because the subscales are correlated. They were created that way. They exist as correlated entities.

Running the analysis you suggested shows that 23.3% of the variance is accounted for by the items, while 5.8% is accounted for by the first contrast.

Looking at the items that load highly on the 1st contrast shows that 4 of the top 8 are Depressivity.

| 1 1 | .50 | .11 1.22 1.22 |A 19 DEP_F_041 Sometimes I feel com |
| 1 1 | .45 | .02 .87 .87 |B 24 DEP_F_221 Too often, when thin |
| 1 1 | .44 | -.49 1.03 1.02 |C 43 VUL_F_086 When I'm under a gre |
| 1 1 | .43 | -.56 1.16 1.15 |D 27 SLF_F_076 At times I have been |
| 1 1 | .39 | .09 .81 .82 |E 23 DEP_F_191 Sometimes things loo |
| 1 1 | .37 | -1.21 1.05 1.01 |F 20 DEP_F_101 I have sometimes exp

Looking at the items that load highly on the 2nd contrast...

| 2 1 | .52 | 1.06 .86 .85 |q 42 VUL_R_056 I feel I am capable |
| 2 1 | .51 | .71 .79 .77 |t 46 VUL_R_176 I can handle myself |
| 2 1 | .38 | .64 1.09 1.08 |H 41 VUL_F_026 I often feel helples |
| 2 1 | .36 | .61 .97 .96 |O 22 DEP_F_161 I have a low opinion |
| 2 1 | .36 | .71 .94 .94 |o 48 VUL_R_236 I'm pretty stable em |
| 2 1 | .33 | .78 .68 .67 |m 47 VUL_R_206 When everything seem |
| 2 1 | .30 | .02 .87 .87 |B 24 DEP_F_221 Too often, when thin |
| 2 1 | .24 | .11 1.22 1.22 |A 19 DEP_F_041 Sometimes I feel com |

Looking at the items that load highly on the 3rd contrast...

| 3 1 | .68 | .35 1.35 1.35 |u 13 ANG_F_066 I am known as hot-bl |
| 3 1 | .54 | .40 1.40 1.44 |l 11 ANG_R_156 It takes a lot to ge |
| 3 1 | .47 | .29 1.09 1.10 |W 12 ANG_F_006 I often get angry at |
| 3 1 | .41 | .46 1.04 1.06 |e 9 ANG_R_036 I'm an even-tempered |
| 3 1 | .37 | .66 .92 .93 |X 14 ANG_F_126 I often get disguste |
| 3 1 | .29 | -.24 .89 .89 |V 16 ANG_F_216 Even minor annoyance |
| 3 2 | .16 | -.07 .90 .91 |j 10 ANG_R_096 I am not considered

Contrasting the positively and negatively loaded item content from each contrast doesn't really add up---the negative loadings are a hodge-podge, so I did not post them---but as you can see, each of the first three contrasts is dominated by items from a particular subscale.

So in spite of what I just argued above, there is evidence that the subscales appear as separate dimensions.

MikeLinacre: Yes, RaschNovice, a correlation of R = .4 to .7, indicates that the correlated variables share R^2 = 16% to 49% of their variance in common.

So at least half of their variance is not in common, but along orthogonal dimension(s).

We expect to see each subscale, in turn, contrast with the other subscales. But your first two contrasts are a mixture of VUL and DEP contrasted with the other items.

The third contrast ANG is what we expect to see.

Please Rasch analyze only the VUL and DEP items.
Then run Table 23. If VUL and DEP do not separate into opposite ends of the 1st contrast, then this would challenge the definitions of the two subscales.

RaschNovice: Mike,

I guess it doesn't bother me if a mixture of the VUL and DEP items dominate the first contrast. The reason is: Psychologists are frequently much more specific in their use of constructs that what is conclusively supported by the data. The data is self report, so we expect that its validity will be reduced by any number of issues, including lack of insight on the part of patients. One might even argue that the reason patients see psychologists is because self report is inaccurate.

For example, an interesting study was recently done on "cognitive complexity." The study basically grouped subjects into high and low cognitive complexity groups. A lexical study (get ratings on 500 to 700 personality traits, then factor analyze) was done for each group. Seven factors were judged best for the high cognitive complexity group, whereas three were judged best for the low cognitive complexity group. As you can see, then, the dimensionality of the data is in part dependent on the ability of the subjects to make fine distinctions. Some people are just working with a coarser set of categories than others.

I'm comfortable with the idea that the Five Factor Model (FFM)consists of higher order factors and lower order traits. The whole field of psychology consists of these kinds of hierarchical constructions. This is the accepted standard in the field: Broad constructs consist of narrower lower order constructs. My purpose is to pull the FFM subscales into a common frame of reference, not to deconstruct the FFM using Rasch analysis. A paper critiquing the FFM through a Rasch analysis of its higher-order dimensions will have no impact on established assessment practices. Really. No one will pay attention. Pragmatically, it's a non-starter.

Perhaps at some point in the future I can go back and refine the item content so that separate subscales are somehow supported by a Rasch analysis. For now, however, I need some way of using Winsteps to pull these correlated subscales into a single frame of reference.

Or, let me come at this from another direction. Say I have a list of 20 personality trait adjectives related to Neuroticism. From these, I develop a 10-adjective scale, which defines an item hierarchy related to Neuroticism. For each of these 10 adjectives, I develop 8 self report items. Now I have 10 subscales of 8 items each. But the scales don't exist "at the same level." They originally defined a hierarchy of Neuroticism, and as scales they still should. Except that, as scales, they should be more reliable measures.

As you can see, now that I've taken the next step in developing my assessment framework, I seem to have lost my hierarchy. There does not seem to be a convenient way of pulling my subscales together that reflects their original order in the Neuroticism item hierarchy. For one thing, my scales now consists of items that, from the perspective of the higher order dimension, violate local independence: The items should be similar in that they tap the Rasch dimension, and otherwise as different as possible. Instead, my items now tap both the Rasch dimension AND the facet of the original item hierarchy for which they were constructed.

So, the desire to have greater reliability of measurement at the level of individual items in the item hierarchy, which caused me to create scales for each trait adjective, has in fact led to disaster. What is the solution, if not MCMLM?

That said, I would much rather continue using Winsteps than learn and buy Conquest, so just assuming that the subscales are okay, where would you go next using Winsteps?

As always, Mike, thanks for your consideration of these complex issues.

The Novice

MikeLinacre: Thank you for these further insights into your work, "The Novice".

When encountering these types of problems, my strategy is to imagine "What would physical scientists do in this same situation?" They rarely attempt complex statistical modeling. Instead they try to build up there theories one piece at a time. Within each piece there will be many correlated measurements. But that does not worry them because all those many numbers will be summarized into one or two numbers to take to the next level.

This piece-wise approach is the measurement philosophy operationalized in Winsteps. Winsteps constructs additive measures on unidimensional latent variables from qualitative observations. These observations are modeled to share the same latent variable (so they are dependent), but also to differ stochastically (so they are locally independent).

The Winsteps measures are then become input variables in further statistical analysis of the relationships between the measures, such as structural-equation modeling (SEM).

590. Location of experimental item

deondb September 20th, 2010, 8:34pm: Dear Mike
I wish to estimate the item location parameter of an experimental item, yet I do not want to include the item in estimating person locations. How can this best be done in Winsteps?
Deon de Bruin

MikeLinacre: No problem, Deon.

Give that item a weight of 0. It will have a difficulty and fit statistics, but will not influence the person estimates.

(experimental item entry number) 0

deondb: Thanks Mike. This is clear advice.

591. using make-shift abilities

phloide May 21st, 2010, 5:11pm: I have a scenario, and I need to know if what I am doing with the IRT model is completely flawed, or if it is reasonable/doable:

I am doing a highly experimental study in which I have some erratic variables that I need to create a measurement for, and I have determined that the most reliable way to do this is to anchor the person’s ability (PAFILE?). The problem is: the only known measurement I have of each person is their GPA. Is there a way to use the GPA as the “ability” for each person so that I can evaluate the items?


MikeLinacre: Yes, phloide. You are using a productive experimental technique.
Please see https://www.rasch.org/rmt/rmt142n.htm

phloide: Thanks for the input! I have run my tests, and I need a little help interpreting the items I am looking at. Everything on the items (to my limited understanding) looks good EXCEPT for the PTMEA, which are running kind of low (which I expected, because these items were not developed to measure what I am using them for), so I wanted to see if you could help me interpret one item so I know what I am seeing:
example item ->
Measure: -2.6
Infit: 1.09
PTMEA: .18
Exact OBS:92.3
Matched EXP: 91.3

The measures for the items were really low (-3), which again, was expected, because winsteps was measuring retention for a cohort that already had very high retention (so the items were expected to appear "Easy")

My reading:
Measure: -2.6 -> this item will have a high retention rate because it is "easy"
Infit: 1.09 -> this item doesn't have a lot of "noise" in the interpretation
PTMEA: .18 ->this item is measuring something other than retention
Exact OBS:92.3 / Matched EXP: 91.3 -> this item generally returns the response we predict

I was just curious of why it says it is predictable and not a lot of noise, but is apparently measuring something other than what I am trying to predict...are these completely "bad" items with no interpretable value?

MikeLinacre: Thank you for your question, phloide.

he point-measure correlation is 0.18.
What is the expected value according to the Winsteps output?

According to these numbers, the observed and the expected values of the correlations will be almost the same. If so, there is no evidence that this item is "bad" only that it is easy.

The maximum possible values of the point-measure correlation for an N(0,1) distribution is shown at https://www.rasch.org/rmt/rmt54a.htm. Notice that the correlations are very low for outlying measures.

phloide: Ok, I have run my tests, and I ran it for several subgroups. What I am trying to figure out is how to read it now. Item_1 represents the "ability of those who took Item_1/measure of Item 1". "Whole" represents the ability of everyone in the group/subset.

Whole Item_1 Item_2
All -3 -3.12 -2.67
Subset_A -3.1 -3.6 -2.98
Subset_B -3.1 -3.25 -3.02

any assistance would be great

Thanks for all your input. I have some other things I did that I will write about later (I tried some centering techniques that were interesting)

MikeLinacre: Thank you for your post, phloide.

Whole Item_1 Item_2
All -3 -3.12 -2.67
Subset_A -3.1 -3.6 -2.98
Subset_B -3.1 -3.25 -3.02

We see on Item 1 that Subset A are about 0.5 logits less able than the whole sample, and about 0.35 logits less able than Subset B.

These numbers do not tell us whether this difference in ability is statistically significant or substantively important. But 0.5 logits is a large amount, especially if the Subsets are large. In many educational situations, 1 year's growth = 1 logit.

phloide: Thanks! this is helping so much!
one more question (before I ask another question later...)

I have tried an experiment where I used ability levels generated two different ways when running tests on subsets. When evaluating subsets, the person abilities I measure/anchor the individuals on are:
test1: the abilities that the individuals obtained when calculating the abilities for the population, and
test 2: the recalculated ability of the individual within the realm of the subset.

most of the results were similar (if Item 1 would increase in the uncentered test for Subset_A, it would also increase for the centered test for Subset_A).

Whole Item_2
Subset_A -3.1 -2.99

Whole Item_2
Subset_A -2.6 -2.55

I admit I stole the idea from HLM... but I didn't know if you had seen any research or had any thoughts about this as far as IRT... The only one that shows the most logit changes between Centered and Uncentered is the cohort of Grade Point Averages Under 3.00 in regaurds to their joining or not joining a Fraternal org (Centered:.22 logits difference between the ability level of the whole subset of students with GPAs under 2.00 who tried for a fraternal organization and the ability level of the whole subset of students with GPAs under 2.00, Uncentered: .01 from the same subset). I would read it to mean that Fraternal organizations may have a particular impact on students under 3.00 that may be hidden/unseen while observing the original population the subset is drived from.

MikeLinacre: phloide, thank you for reporting your finding.
Yes, there is an interaction between subset-membership and ability-level.
We can detect this in Winsteps using DGF (Table 33) or in Facets using "dummy" facets with bias/interaction (B) terms in the Models= specification.

phloide: I learned something, and didn't know if there was a paper on it somewhere (I could use it... I'm working on my D)... when persons are achored, the value returned by Winsteps for items isn't the item difficulty, but a synthetic person ability. for example, I anchored 5000 people to examine an item with a "correct response" of .93. the value returned was -3.1. I figured out through experimentation (which I have documented) that the -3.1 was not the .50 item difficulty, but the measurement in which my sample would have a .93 chance of answering the item correctly. I back-calculated the item B=((LN((1-.93)/.93)/(-1.7))+(-3.1)) to derived the actual Item difficulty (-1.54)

MikeLinacre: Thank you for your comment, Phloide.

At what values did you anchor your 5000 people?

You wrote, "the -3.1 was ... the [item difficulty] measurement in which my sample would have a .93 chance of answering the item correctly"

Comment: yes, that is the correct item difficulty measurement (in logits) for this anchored analysis.

Your computation has the divisor 1.7. This suggests that you want to report the difficulty in probits, not in logits. For a probit computation in Winsteps, please set:
USCALE = 1/1.7 = 0.59

phloide: here's my delima: if I don't set the USCALE at .59, my model works perfectly (using the1.7 divisor). If I do set the USCALE at .59, I cannot get any numbers to line up. The values I anchored the 5000 people were a standardized score based on their GPA. I am trying to do an item analysis where each item needs to be evaluated independantly, but all persons need to have a predefined ability.

here's exactly what I am doing
Took all students and standardized their GPA (Z scores) and anchored their ability
evaluated the item.

If Item had a correct answer ratio of .93, I did the B=.....1.7+returned winstep variable)

My average GPA was 3.25 and retention .93. If I did a lookup table using the returned variable B=..., my Z[0] =.93=3.25 GPA

could it be because my abilities are already based on a normalized scale that I don't need to use USCALE?

MikeLinacre: Thank you for your question, Phloide.

Please do whatever works! The purpose of measuring is to be useful. :-)

592. Winsteps Table 20 vs. Person Ability

uve July 29th, 2010, 11:25pm: Your math is correct, Uve, if all items have the same difficulty. Then:
Ability measure for person N = average item difficulty + Ln (count of correct answers / count of incorrect answers)

But items usually have a range of difficulty. A useful approximation for this situation is:

Ability measure for person N = average item difficulty + Ln (count of correct answers / count of incorrect answers) * multiplier


multiplier = square-root ( 1 + (variance of difficulty of items encountered) / 2.9 )

Working backwards from your numbers, the S.D. of your item difficulties approximates 0.67 logits. Is this close, Uve?

uve: Hello Mike!

As always, I greatly appreciate your help. Yes, the mean S.D. of the item difficulties of the test in question was 0.67 logits. So would I simply square (.67 * .67) this to get the variance portion of your multiplier portion of the equation? Or are you saying that mean S.D. represents the variance needed and I just use .67.


MikeLinacre: Glad the approximation works for you, Uve. Yes, variance = .67^2.

That formula (known as PROX) is based on the assumption that the item difficulties are normally distributed. It seems that yours are close to that distribution. :-)

uve: Mike,

All worked out great:

0 + Ln(60/40) * [SQRT(1+.67^2/2.9)] = .44

I assume in most cases the average item difficulty will likely be zero if all went well with the estimation process. If convergence did not reach 0 after the requested number iterations, then that would replace the zero in the equation, if I'm correct.

Where does the 2.9 come from? Just curious.

MikeLinacre: Uve, 0 is the average difficulty of the items. See Table 3.1

2.9 = 1.7^2

For 1.7, see https://www.rasch.org/rmt/rmt112m.htm

PROX approximates the normal distribution (of the items) with an equivalent logistic distribution. See https://www.rasch.org/rmt/rmt83g.htm

uve: Mike,

All worked well when I created these calculations for the extreme scores, but now I have another issue. When I correlate our local assesments in logits with our state tests, they can be considerably lower than when I correlate with our raw scores. One test of 1300 had 26 students scoring 0 out of 42 questions. The item measure standard deviation of that test was 1.73. This means when I convert 0 scores to .3, the resultant logit is -7.03. However a score of 1 jumps to -5.29. A score of 2 only jumps to -4.27. I'm worried that these 0 and maximum scores converted to logits are creating an outlier influence that is negatively surpressing my regression model. In SPSS, the Pearson R for the raw scores was .57, but with the logits it was .47. That's a considerable difference. When I delete the extreme scores and correlate the logits, it's .59.

What should I do?

MikeLinacre: Uve, we need to know more about the numbers reported by the state tests. Are they on a non-linear raw-score-based metric, or on a linear latent-variable metric?

Also, you could adjust your computation to allow for the much larger S.E.s of extreme scores. You may want to use information weighting: (Rasch measure)/(Rasch S.E.^2), or effect-size weighting: (Rasch measure)/(Rasch S.E.).

uve: The state converts the raw scores to logits (though they call them thetas) and then are converted to a more readable scale that goes from 150 - 600. We call these the scale scores. Oddly enough, they calibrate their items to .67 rather than .50.

Most of the item std dev's for our local tests were in the .75 range. The example I gave was an unusual exception.

When you mention the two options for adjustment, I assume you mean that after the Rasch person measure has been calculated, Ln(correct/incorrect)*sqrt(1+item stdev^2/2.9), which is what I did, you then take that measure and either divide it by the S.E.^2 or just the S.E. Is that correct?

MikeLinacre: Yes, that is correct, Uve.

uve: If I understand you correctly, then I have a practical problem. I have 105 tests to correlate with our state tests. Each test can vary from 20 to 100 questions with the average being around 50 items. This means for each test I would have to write 20 to 100 conditional formulas in SPSS directing it to divide the final person logit measure by the S.E.

Clarification: when you say S.E., I assume you are referring to the item S.E. reported in Table 20 for a given score. Or are you referring to the person measure S.E. reported for a given person?

If mean item S.E. could be used, then this would be more pactical as I would only have to direct SPSS to divide all person measures by the same mean item S.E. reported for that test. This would only be 105 additional steps, as opposed to an average of 50 times 105 steps, which I simply do not have time to do.

MikeLinacre: Thank you for your questions, Uve.

For complete data (no missing responses), the standard errors for an individual are effectively the same as the standard errors in Table 20.

For incomplete data (such as computer-adaptive tests), every individual's ability estimate has a different standard error.

If you are using the PROX formula, then the S.E. = multiplier * sqrt ( max score / ( correct score * incorrect score))

uve: Mike,

I can't thank you enough for your patience with me during this whole process. The students who score 0 typically only answer the first 3 or 4 questions, all of which are wrong. So there is much missing data with them.

Let me make sure I completely understand you up to this point using an example:

A student scores 40 on a 50 item test with a mean item measure std dev of .75. The equation would then be:

[Ln(40/10)*sqrt(1+.75^2/2.9)] /S.E.

where S.E. = [sqrt(1+.75^2/2.9)] * [sqrt(50/(40*10))]

If this is correct, I will give it a try. If these new person measures are not better correlated to the state scale scores than using our raw score, I think the best practical decision at this point would be to delete the 0 scores only from the SPSS regression, not Winsteps.

MikeLinacre: Good to read more information about this situation, Uve.

The "0" scorers are far from behaving in accordance with their abilities, so please do delete them. Inferences based on their ability estimates will be misleading.

593. Statistical test for improved fit for 2PL?

oosta September 14th, 2010, 9:51pm: The 3PL and 2PL advocates state that those models are usually better than Rasch because they usually fit the data better. However, these models have more parameters to estimate and, therefore, incorporate more on chance when they estimate the parameters using the specific sample of response data. Has research been done to determine whether the increase in fit is statistically significant--that is, greater than what you would expect by chance? If so, is the fit for the 2PL and 3PL models usually statistically significant?

I am ignoring the philosophical reasons for choosing Rasch over 2PL/3PL.

MikeLinacre: Oosta, the usual methods for comparing nested models are chi-square tests and AIC-type tests.

A complication is that 2PL/3PL assume parameter distributions. Rasch does not. So, if the true parameter distributions match 2PL/3PL, then their fit should be better. But if the true distributions do not match 2PL/3PL, then 2PL/3PL force the estimated parameters to fit the distributions, distorting the estimates and the fit statistics. The effect is that 2PL/3PL can be reported with better fit than Rasch, even though the parameter estimates are inaccurate!

Typical 2-PL assumptions are: the sample is normally distributed, and the discrimination parameters are log-normally distributed.
3-PL adds the assumption: lower asymptote = 1/(number of options), when the data are too thin to estimate the lower asymptote from the data (the usual situation).

In simulation studies, the results have been ambivalent. See, for instance, ""surprising to observe the one parameter model performing better than the three parameter model . . . since there is no theoretical reason to expect such a result" (in Hambleton, pp. 208-209)

Applications of Item Response Theory ed. by R.K. Hambleton, Vancouver, B.C.: Educational Research Institute of British Columbia, 1983

594. About the naming of the model

dachengruoque September 8th, 2010, 8:03am: In FACETS's software there is a file named example, in which there is a example file named subsets.txt, it seems that is a rating of something, but I can not see the rating scale, why does this example file could miss the rate scale definition? Seccondly, I want to know if I have to tell the model specification which facet is rater or repondent by puting a B besides either # or ?. Thanks a lot, dr linacre!

MikeLinacre: Here are the answers to your question, Dachengruoque.

1. Subsets.txt contains this specification:
Model = ?,?,?,R9
R9 is a rating-scale with a highest valid category number equal to, or less than, 9

2. "which facet is rater or respondent"
If you want to specify that a facet is the rater then
Inter-rater = facet number
There is no specification for "respondent"

If you want to specify bias or interaction analysis between two facets, then use two Bs:
Model = ?,?B,?B,R9
for the computation of interactions between the elements of facets 2 and 3.

dachengruoque: Thanks a lot, Dr Linacre!
For question 2, if the data responses are from survey and I would like to invesitgate the behavior of the responding, could I speccify the inter-rarter = the facet that represents the respondents?

MikeLinacre: Dachengruoque, inter-rater= produces the numbers shown at

These "agreement" statistics are not intended for respondents, but you can report them if you want to :-)

dachengruoque: Thanks a lot, Dr Linacre for your illuminating pointer!

dachengruoque: Obs % = Observed % of exact agreements between raters on ratings under identical conditions.

Exp % = Expected % of exact agreements between raters on ratings under identical conditions, based on Rasch measures.
If Obs % ¡Ö Exp % then the raters may be behaving like independent experts.
If Obs % &#187; Exp % then the raters may be behaving like "rating machines".
Dr Linacre, in the case that Obs % &#187; Exp % how much difference would mean that the raters may be behaving like rating machine? 10% or 5% ? Is there a rule of thumb or threshold ? Thanks a lot!

MikeLinacre: Dachengruoque, when test administrators want the raters to act like machines, the administrators are hoping for 90%+ observed agreement between raters under identical conditions.

dachengruoque: Thanks a lot, Dr Linacre!

595. Can I arrange the elements by infit t?

dachengruoque September 8th, 2010, 8:29am: Can I arrange the elelements by either ascending or descending order of infit t or MnSquare in output of FACETS 3.58 ? Thanks a lot, Dr Linacre!

MikeLinacre: Dachengruoque,
the Arrange= specification has some options
"Output files" menu, "Score file" to Excel.
In Excel you can do any sorting that you like :-)

dachengruoque: I don't know if I had the right place. I read FACETS 3.58 user manual on page 76. I found that there is some specification on Arrangement for example F/f for ascending/descending order of infit MnSquare but not Infit t. Yeah, porbably, EXCEL is the better choice to sort whatever I like. Thanks a lot, Dr Liancre, for your kind and patient answering all the time!

MikeLinacre: Yes, dachengruoque, there are many different sort orders. Arrange= has only a few of the sort orders.

dachengruoque: Thanks a lot, Dr Linacre!

596. Easier Test or Better Teaching?

uve September 8th, 2010, 1:08am: Dear Mike,

In analyzing two English theme tests for our 4th graders , the mean percent correct of Theme1 was 62% and 77% for Theme5. In Winsteps the mean person measure of Theme1 was .72 and stdev was 1.07. For Theme5: 1.70 and 1.39. Mean item measure and stdev for these tests were: 0, .92 and 0, .85

Proficiency cut points for both tests are set at 78% correct. I want to argue that the two test can't have the same cut points because one is clearly different than the other. I want to argue that the second test is easier, but the staff may say that both tests are intrinsically the same in difficulty, it's just that the subject matter covered on Theme5 was taught better and the students scored better accordingly. They may have a point. The same group of students took both tests.

Since the same students took both tests, I would say that the difference in mean ability is a factor of the test, not of better teaching. Perhaps there’s not enough information here to say one way or the other, but I’d really like to hear your thoughts on the subject.

MikeLinacre: This is a puzzle, Uve.

Better teaching makes the test easier, and also the students more able.

For instance, an arithmetic test:
the better teacher says: 2 x 2 is the same as 2 times 2.
the worse teacher forgets to tell the students about "x".

So, on the test is the question: 3 x 6=
For the students of the better teacher, the item is easy, and the students are better able to do multiplication.
For the students of the worse teacher, the item is very hard, and the students are less able to do multiplication.

In this example, we would attribute the real difference to the teaching, and the changes in item difficulty and student ability are reflections of the teaching.

uve: Good points.

So if I assume the ability increased on the second test due to good teaching, I should keep the cut points for both tests the same. If I assume the items on the second test were just easier, then I would need to adjust the cut points for proficiency higher.

Is there any other evidence I should gather to help guide me in either direction?

MikeLinacre: It is confusing for everyone if the item difficulties are not the same in equivalent situations, Uve. If the item difficulties change, then the definition of the construct has changed. We don't know what we are measuring!

Usually we choose a "standard" situation and use its item difficulties everywhere. Often the standard situation is the pre-test or the first test whose results are published. This defines the situation to which future test results will be compared. We can quantify changes in item difficulty at later times as displacements or DIF, but adjustment is not made for these in the person ability estimates.

597. Dimensionality and Variance explained by items

RaschNovice September 8th, 2010, 7:04am: In the knox cube test example, the variance explained by the items is 47.8 percent. The modeled variance explained was 35.5 percent.

The first contrast is 2.7 eigenvalue units, or 5.6%.

The difference between the variance actually explained, and the modeled variance explained begs the question: Which to compare to the variance of the first contrast?

The argument to use the actual variance explained is: Hey, that's what the data say...

The argument to use the modeled variance is: In the Rasch paradigm, we're constructing rulers, so I need to know how good my ruler is relative to some confounding influence.

Which should we use, Mike, and why?

MikeLinacre: RaschNovice, neither!

The "variance explained" is dominated by the spread of the items and the persons - see https://www.rasch.org/rmt/rmt201a.htm Figure 4.

If you want to make a comparison, it is with the expected size of the unexplained variance, see https://www.rasch.org/rmt/rmt233f.htm

For most practical purposes, an eigenvalue of 2 (the strength of 2 items) is an indication that there is a noticeable secondary dimension (factor, strand, component, etc.) in the data.

So your 2.7 eigenvalue suggests that there is a content difference between the items at the opposite end of the first contrast (opposite loadings on the 1st principal component in the residuals).

598. Winsteps Table 20 vs. Person Ability Revisited

uve September 2nd, 2010, 5:08pm: Dear Mike,

Several weeks ago you corrected my mistake regarding how to approximate ability levels by providing the multiplier. I am correlating state tests with our own and am converting the raw scores into logits to run the regression analyses. When I ran one of these tests (50 questions) through Winsteps, the measure standard deviation was .92. Using the equation you provided then comparing those to four scores in Table 20, here is what I got:

1)Table 20, raw score 41, measure 1.76: my measure calculation, 1.72
2)Table 20, raw score 31, measure .56: my measure, .58
3)Table 20, raw score 18, measure -.69: my measure -.65
4)Table 20, raw score 10, measure -1.61: my measure -1.58

Example 1 calculation 0 + Ln(41/9)*sqrt(1+.92^2/2.9) = 1.72

Though these aren’t exact, I wanted to make sure this was just an approximation function. It appears the function is closer to Table 20 where the test takes on a more linear behavior, and deviates more at the extremes.

Also, how do I deal with scores of 0 and 100%? I can't divide by zero, so SPSS skips this and treats it as missing data. For 0% it produces a logit of 0, which is not helpful either. I believe this throws my regression coefficients off somewhat, but I guess that depends on how many students got a zero on the test and how many got 100%. For example, according to SPSS regression coefficients, 1.84 logits corresponds to a score of 41. All individual scores of 41 in SPSS read 1.72. Very confusing. In case it is helpful, here's the regression coeffcients in the equation output of SPSS where I'm trying to find out what advanced on the state tests (402) corresponds to our tests:

402 = 41*1.84 + 326 if I use the Winsteps data then: 398 = 41*1.76 + 326, which puts me off 4 state scale score points. So to make the correction it would have to be: 402 = 43*1.76 + 326. So I have to decide is it 41 or 43. This can make a big difference in determining advanced on our tests. I hope you can understand the delema I'm in.

Should I delete these 0 and 100% scores first before converting to logits and running the regresssion? SPSS ignores missing data in the analyses unless I state otherwise.

As always, many thanks for your time!

MikeLinacre: Uve, yes, "0 + Ln(41/9)*sqrt(1+.92^2/2.9)" is the PROX (normal approximation) estimate of the maximum-likelihood estimate. The closer everything is to a normal distribution, the better the approximation.

Extreme scores: usually a score correction of 0.3 score-point is applied, so ln(0/50) => ln(0.3 / 49.7) and ln(50/0) => ln(49.7/0.3).

uve: Thanks Mike. This makes sense and worked out well for me. Another thing then:

I've converted the individual student raw scores of our tests to logits using the equation and applying the .3 adjustment for extreme scores, then ran a linear regression in SPSS of our local test logit scores to the state scale scores those same students received. The Pearson R is .75 and higher for most tests so I know I've got great relationships between our tests and the state. Using the regression line coefficients, I then plug in the logit scores reported in Table 20 and get a scale score conversion of our raw scores into state scale scores. All looks great, but here's the problem.

This data is based on all the tests that were taken in the school year 2009-10 and the state exam given at the end of the school year in May. But I've been notified that these same local tests have been modified for the 2010-11 year. So when a student receives a raw score this year, the converted scale score may not be accurate any longer.

I know I can anchor items in Winsteps and get a new adjusted Table 20. Now here's my question: once I do this, do I just use the same original SPSS regression line coefficients and plug those in using the new Table 20 to get the new adjusted scale scores?

If so, these seems to suggest that we are saying, "Had the students this year taken last year's test, this would have been their scores and corresponding scale scores. Therefore the scale scores for this year's new students taking the new version are comparable using the regression coefficients developed from last year's test."

As always, thanks so much for your help!

MikeLinacre: Uve, it looks like you need to equate the 2010 test with the 2009 test.
Do they share some items? Then you can do "common item" equating.
If you have strong reasons for believing that some person samples are statistically equivalent, then you can do "random equivalent samples" equating.
Otherwise "virtual equating", where you align approximately equivalent items on the item map: https://www.rasch.org/rmt/rmt193a.htm

After the equating has been done, logits on the 2010 test can be converted into logits on the 2009 test, and the 2009 scores estimated.

uve: Yes, the tests share many questions in common. I already know which items to anchor, so that won't be a problem. Once that is done, I assume I'll go to table 20, get the logits for each raw score from 0 - 50, then apply the SPSS regression coefficients (calculated from spring's state test scale scores and the 2009 version) to these new logits to get the scale scores for 2010. Does that sound valid?

MikeLinacre: Yes, that sounds like a good procedure, Uve.

Equating 2010 to 2009 will convert 2010 logits to 2009 logits. Then the regression will convert 2009 logits to 2009 raw scores.

The SPSS regression needs to be logit-linear (or linear if extreme scores are not a concern). Usually the values given under Winsteps Table 20.1 are good enough for a linear conversion between raw scores and logit measures.

uve: Thanks!

599. Export of vertical ruler and Facet information

dachengruoque September 7th, 2010, 2:09pm: In FACETS, I found it pretty difficult to export a neat and dandy vertical ruler to word 2007 either by copy and paste or other means. But I find the vertical ruler in the published papers are pretty neat how could do that? Is there knack at exporting it attractively without sweat? Thanks a lot! Following the vertical ruler, if there is a facet tailed by a letter A does it stand for anything else besides label? What else could it represent ? Thanks a lot!

MikeLinacre: Thank you for your questions, Dachengruoque.

In Facets, please use Vertical= and Yardstick= to format the vertical ruler.
www.winsteps.com/facetman/vertical.htm and www.winsteps.com/facetman/yardstick.htm

Also please see ASCII= www.winsteps.com/facetman/ascii.htm for some suggestions about printable vertical rulers.

dachengruoque: Thanks a lot, Dr Linacre!

600. Strict axioms for Rasch analysis of likert data?

dachengruoque September 5th, 2010, 2:21am: What strict axioms testing should researchers do before the likert scale data could be used in Rasch analysis? Thanks a lot!

MikeLinacre: The axioms are the same as for a meaningful raw-score analysis of Likert data, Dachengruogue.

The responses are scored so that:
higher person score = more of what we are looking for, relative to lower person score
higher rating scale category = more of what we are looking for, relative to a lower rating scale category
If these hold, then a Rasch analysis will be meaningful.

We don't know, in advance, whether they do hold, so we do a Rasch analysis, and then look at its results. After some adjustment of the data, the Rasch analysis usually is meaningful.

dachengruoque: Yes, I mean actually there should be some requirements for doing Rasch analysis if the researchers would like to apply it in their analysis of data. For example, before analysis of variance, one of the statistical requirements is that the data should be in normal distribution and variance of data among the groups arehomogeneous. But what exact statistical procedures should we do to make sure that say, some likert scale responses data are suitable for Rasch analysis? Or where should researches find the relevant information for ascertaining the appropriateness or meaningfulness of doing Rasch analysis with WINSTEPS or FACETS that you kindly authored and updated. Thanks a lot, Dr Linacre!

MikeLinacre: ANOVA is a good example, Dachengruoque. Usually we don't know whether its assumptions have been met until after we have performed the analysis. And, of course, they are never met exactly.

It is the same with Rasch analysis. Each Rasch analysis produces measures, but also fit statistics and other indicators. We examine the fit statistics, measure hierarchy and other indicators to verify that the Rasch analysis has produced usable measures.

Theoreticians have proposed mathematical axioms for Rasch measurement, but these are impractical. The chaos inherent in real data is too great.

We are like road-builders. We want to build usable roads, not perfect roads. In the Artic winter, a usable road is made of ice. Usually we would say, "a road that must be maintained every day, and that only lasts a few months, is a bad road", but we quickly realize that any road is better than no road.

Rasch analysis is like that. Any Rasch measures are better than no Rasch measures!

dachengruoque: Thanks a lot, Dr Linacre, for your explanation!

601. Is Journal of Applied Measurement Online ?

dachengruoque September 5th, 2010, 1:27am: Some expert listsed a journal named Journal of Applied Measurment, is it accessible online or indexed full-text in some database? Thanks a lot!

MikeLinacre: Dachenguroque, please see www.jampress.org

dachengruoque: Thanks a lot, Dr Linacre! If it were accessbile online through some database like Wiley or Sciendirect tht would be too good. Thanks a lot, anyway!

602. For Profit Companies using Rasch

AmyN September 1st, 2010, 6:30pm: Does anyone know of any for-profit/ commercial organizations that are currently using Rasch modeling with healthcare data? I am particularly interested in small analytic companies who build models to predict health events/ complications using Rasch or IRT for the general or working population (under age 65).

Thanks for any help you can provide!

MikeLinacre: AmyN, please also ask this in the Rasch Listserv: linked from www.rasch.org/rmt/index.htm

603. Bootstrapped DIF

OlivierMairesse August 31st, 2010, 12:31pm: Hello fellow Rasch enthusiasts.

What follows is a procedure undertaken to perform bootstrapped samples of DIF measures, but can also be used as a template for any other output file needed for your purposes.

The idea is to run WINSTEPS from a batch file a n number of times, resampling person raw scores every iteration and to compute DIF estimates for different person classes.
The procedure however spawns an n number of files, which then have to be read and from which data must be extracted. Extracting full tables and collapsing them together in one file can be done using PowerGREP. After that, a macro is written to order the data into a workable datamatrix. The second procedure and the third can probably easily be done when you are familiar with Regular Expressions (RegEx), but unfortunately, I am not (and will probably never be).

So this is what Mike, Jan from (http://it.toolbox.com/) and myself came up with. It's rather lengthy, but it works.



"Dear Mike,

This is the whole procedure to exctract bootstrapped DIF measures. Thanks to you, Johan and a guy named Jan on a VBA forum.

1) Write this code in notepad and safe as a .bat file to perform n bootstrap samples of raw scores by permuting persons. In this case we use 10 samples

REM - produce the generating values: this example uses EXAMPLE.txt:
START /WAIT c:\winsteps\WINSTEPS BATCH=YES EXAMPLE.txt EXAMPLE.out.txt PFILE=pf.txt IFILE=if.txt SFILE=sf.txt

REM - initialize the loop counter
set /a test=1

REM - simulate a dataset - use anchor values to speed up processing (or use SINUMBER= to avoid this step)
START /WAIT c:\winsteps\WINSTEPS BATCH=YES Effectiveness_Final.txt Effectiveness_Final%loop%.out.txt PAFILE=pf.txt IAFILE=if.txt SAFILE=sf.txt SIFILE=SIFILE%test%.txt SISEED=0 SICOMPLETE=No SIMEASURE=No SIEXTREME=No

REM - estimate from the simulated dataset
START /WAIT c:\winsteps\WINSTEPS BATCH=YES Effectiveness_Final.txt data=SIFILE%test%.txt SIFILE%test%.out.txt DIF=@STAKEHOLD pfile=%test%pf.txt ifile=%test%if.txt sfile=%test%if.txt tfile=* 30 *

REM - do times
set /a test=%test%+1
if not "%test%"=="11" goto loop

Results of every bootstrap sample will be found in SIFILE1.out.txt to SIFILE10.out.txt.

Create a folder and copy these files inside that folder.

2) Try or buy POWERGREP (http://www.powergrep.com/)

In powerGrep:

Search and include files from folder

fill in these commands:

Action type: Collect data
Search type: regular expression
&#61672; Group results for all files

Search: \A([^\r\n]*+\r\n){9}(([^\r\n]*+\r\n){0,41})
Collect: \2

File sectioning: do not section; count line number
target file creation: save results into a single file
delimiter: line break
3) Your file should look like a series of tables 30.1.
Insert this text in Excel, be sure to delimit the colums correctly to get your data (DIF measures).
Delete all redundant lines and colums to get your data as such:

Class 1 -.51 Class 2 1.52
Class 1 -.51 Class 3 .33
Class 1 -.26 Class 2 -.31
Class 1 -.26 Class 3 -.90
4) In my example I have 3 classes and 20 items so this creates lines of 60 DIF measures.

&#61664; this is the macro I used

Sub extractDIF()
Dim i As Long
Range("E1").Select '---activate cell to the right of last cell with data
While Len(ActiveCell.Offset(0, -4)) > 0 '---While there are lines with data to the left of the active cell
i = i + 1
ActiveCell = i '---Row number for sort to be performed after macro execution
ActiveCell.Offset(0, 1) = ActiveCell.Offset(0, -3)
ActiveCell.Offset(0, 2) = ActiveCell.Offset(0, -1)
ActiveCell.Offset(0, 3) = ActiveCell.Offset(1, -1)

ActiveCell.Offset(0, 4) = ActiveCell.Offset(2, -3)
ActiveCell.Offset(0, 5) = ActiveCell.Offset(2, -1)
ActiveCell.Offset(0, 6) = ActiveCell.Offset(3, -1)

ActiveCell.Offset(0, 7) = ActiveCell.Offset(4, -3)
ActiveCell.Offset(0, 8) = ActiveCell.Offset(4, -1)
ActiveCell.Offset(0, 9) = ActiveCell.Offset(5, -1)

ActiveCell.Offset(0, 10) = ActiveCell.Offset(6, -3)
ActiveCell.Offset(0, 11) = ActiveCell.Offset(6, -1)
ActiveCell.Offset(0, 12) = ActiveCell.Offset(7, -1)

ActiveCell.Offset(0, 13) = ActiveCell.Offset(8, -3)
ActiveCell.Offset(0, 14) = ActiveCell.Offset(8, -1)
ActiveCell.Offset(0, 15) = ActiveCell.Offset(9, -1)

ActiveCell.Offset(0, 16) = ActiveCell.Offset(10, -3)
ActiveCell.Offset(0, 17) = ActiveCell.Offset(10, -1)
ActiveCell.Offset(0, 18) = ActiveCell.Offset(11, -1)

ActiveCell.Offset(0, 19) = ActiveCell.Offset(12, -3)
ActiveCell.Offset(0, 20) = ActiveCell.Offset(12, -1)
ActiveCell.Offset(0, 21) = ActiveCell.Offset(13, -1)

ActiveCell.Offset(0, 22) = ActiveCell.Offset(14, -3)
ActiveCell.Offset(0, 23) = ActiveCell.Offset(14, -1)
ActiveCell.Offset(0, 24) = ActiveCell.Offset(15, -1)

ActiveCell.Offset(0, 25) = ActiveCell.Offset(16, -3)
ActiveCell.Offset(0, 26) = ActiveCell.Offset(16, -1)
ActiveCell.Offset(0, 27) = ActiveCell.Offset(17, -1)

ActiveCell.Offset(0, 28) = ActiveCell.Offset(18, -3)
ActiveCell.Offset(0, 29) = ActiveCell.Offset(18, -1)
ActiveCell.Offset(0, 30) = ActiveCell.Offset(19, -1)

ActiveCell.Offset(0, 31) = ActiveCell.Offset(20, -3)
ActiveCell.Offset(0, 32) = ActiveCell.Offset(20, -1)
ActiveCell.Offset(0, 33) = ActiveCell.Offset(21, -1)

ActiveCell.Offset(0, 34) = ActiveCell.Offset(22, -3)
ActiveCell.Offset(0, 35) = ActiveCell.Offset(22, -1)
ActiveCell.Offset(0, 36) = ActiveCell.Offset(23, -1)

ActiveCell.Offset(0, 37) = ActiveCell.Offset(24, -3)
ActiveCell.Offset(0, 38) = ActiveCell.Offset(24, -1)
ActiveCell.Offset(0, 39) = ActiveCell.Offset(25, -1)

ActiveCell.Offset(0, 40) = ActiveCell.Offset(26, -3)
ActiveCell.Offset(0, 41) = ActiveCell.Offset(26, -1)
ActiveCell.Offset(0, 42) = ActiveCell.Offset(27, -1)

ActiveCell.Offset(0, 43) = ActiveCell.Offset(28, -3)
ActiveCell.Offset(0, 44) = ActiveCell.Offset(28, -1)
ActiveCell.Offset(0, 45) = ActiveCell.Offset(29, -1)

ActiveCell.Offset(0, 46) = ActiveCell.Offset(30, -3)
ActiveCell.Offset(0, 47) = ActiveCell.Offset(30, -1)
ActiveCell.Offset(0, 48) = ActiveCell.Offset(31, -1)

ActiveCell.Offset(0, 49) = ActiveCell.Offset(32, -3)
ActiveCell.Offset(0, 50) = ActiveCell.Offset(32, -1)
ActiveCell.Offset(0, 51) = ActiveCell.Offset(33, -1)

ActiveCell.Offset(0, 52) = ActiveCell.Offset(34, -3)
ActiveCell.Offset(0, 53) = ActiveCell.Offset(34, -1)
ActiveCell.Offset(0, 54) = ActiveCell.Offset(35, -1)

ActiveCell.Offset(0, 55) = ActiveCell.Offset(36, -3)
ActiveCell.Offset(0, 56) = ActiveCell.Offset(36, -1)
ActiveCell.Offset(0, 57) = ActiveCell.Offset(37, -1)

ActiveCell.Offset(0, 58) = ActiveCell.Offset(38, -3)
ActiveCell.Offset(0, 59) = ActiveCell.Offset(38, -1)
ActiveCell.Offset(0, 60) = ActiveCell.Offset(39, -1)

ActiveCell.Offset(40, 0).Range("a1").Select '---Selects cell six rows below the current one
End Sub

Delete original data and redundant lines.
Compute bootstrap means and intervals.


All the best,


PS: probably 8 ) will be read as 8), so change accordingly :)

604. Which software should I use to process data?

dachengruoque August 31st, 2010, 2:05am: I deisnged a survey to invetigate anxiety level to some prompts under different contexts
The questionnaire runs like the following
Prompt sentence A
a) context A 1 2 3 4 5 6
b) context B 1 2 3 4 5 6
h) context H 1 2 3 4 5 6

Prompt sentence B
a) context A 1 2 3 4 5 6
b) context B 1 2 3 4 5 6
h) context H 1 2 3 4 5 6

I have some 13 prompts sentences under each prompt there are the same eight contexts (A-H) running through the 13 prompt sentences. The questionnaire attempts to eclicit the respondents' differential anxiety reactions to the different prompt under the context A-H, while 1 represents strongly disagree and 6 strongly agree.

I would like to find 1) the differential anxeity reaction to 8 different contexts for all the respondents as a whole
2) the differntial anxiety reaction to differnt contexts for each respondent

I feel that I should use Facets to explore the these two questions because the endorsability of the anxiety reactions to different prompts under different contexts are pretty much like eliciting different ratings on different factes of performance ( like accuracy, fluency etc. of speaking) of different test takers. There are three factes: namely one respondents ( 95), context ( 8), prompt ( 13) .By the way, I doubt that 95 respondnets will not suffice the processing of data. Probably I need more respondents. Am I right ? Thanks a lot!

MikeLinacre: Dachengruoque, yes, this is a complicated situation.

The choice software follows from the choice of measurement model.


Respondent overall anxiety + Prompt + Context + (respondent x context) = {data}

This is a Facets (or similar software) analysis

dachengruoque: Thanks a lot!

605. why can't bubble chart pop up?

dachengruoque August 30th, 2010, 7:43am: I find that bubble chart can't pop up in EXCEL 2007 as I followed the instruction on page 292 of Applying the Rasch Model ( 2nd. ed.) . What should I do? My winsteps is 3.58 version. Thanks a lot!

MikeLinacre: Dachengruoque, Winsteps 3.58 was released in 2005. Sorry, it is probably not compatible with Excel 2007.

dachengruoque: thanks a lot!

606. Is that a mistake?

dachengruoque August 28th, 2010, 7:45am: https://www.winsteps.com/bond6.htm
The above link is chapter 6 data dile for the bond and fox's first edition of the book.
On the second line of the data file reads like" Item1 = 3 ; Response to first item is in column 5" I guess that 5 should be 3. Because observation of the data below tells me that the first column seemingly gender information 2 for female, 1 for male as convention goes, the second column is all 7 that runs down all rows of data, so I think that 7 is a kind of delimiter that separates personal information from the the responses to the first item. But I find that the data file doesn't specify all the above information, I mean, personal information and delimiter ( 7) stands for. Am I right ? Thanks a lot for your patient replies to such questions from a green hand at Winsteps.

MikeLinacre: Thank you for pointing out the mistake in bond6.htm, Dachengruoque.

Yes, that line should be:
Item1 = 3 ; Response to first item is in column 3

This is a file from Trevor Bond. The first data line is:
The person identification (person label) is "27"
The response string is "23345363635515124556665541"
If the person label is to the left of the item responses, then the person label is in the columns from NAME1= to ITEM1= - 1.

dachengruoque: I think 27 is not person label, because in the following data line, I found 27 and 17 repeatedly pop up at the very beginning of every data line. I think that 1 or 2 represents gender information?£¨ a wild guess) 7 is delimiter, but I found no information about the meaning of 27 or 17 either in the data file or in the first ed. of the book. If 27 or 17 are the person lables , then do that mean that there are only two respondents to the survey? Thanks a lot for your patient replies, Dr Linacre!

dachengruoque: Dear Dr Linacre,
You said in the above reply that "If the person label is to the left of the item responses, then the person label is in the columns from NAME1= to ITEM1= - 1." Now in the example data file, person label ( 27 or 17) is on the left of the response string, so why do you say that "the person label is in the columns from NAME1= to ITEM1= - 1". I found that there is no minus 1 in the example data file. Thanks a lot!

MikeLinacre: Thank you for your post, Dachengruoque.

The person identification is in columns 1 and 2 of this data file.
ITEM1 = 3
ITEM1=3 - 1 = 2
So the first person label is in columns 1-2 and is "27".

"27" is probably a demographic indicator for the person, such as "Gender=2 + Ethnicity=7".

7 is definitely not a delimiter.

dachengruoque: Thanks a lot, Dr Linacre for your timely and patient replies to my basic questions!

607. Processing of Missing values in likert scale  

dachengruoque August 28th, 2010, 12:54am: Before processing participants' responses to my survey, I found that some of them missed filling some surveys but not so seriously, so do I have to leave the missing values there or replace them all with the average response? Thanks a lot!

MikeLinacre: Dachengruoque, Rasch analysis of missing data is usually no problem. Rasch measures are estimated from the observed responses. Missing responses are ignored.

dachengruoque: OK, thanks a lot.

dachengruoque: Dear Mike, you said that usually no problem. Can I interprete that if the missing reponses do not exceed 3% of all the responses then it is ok to leave missing responses there ? Do I have to replace the missing value with some specific number , say 0 , to tell WINSTEP, that some respondent failed to answer the item? Thanks a lot for your patient answering my questions.

MikeLinacre: Thank you for your email, Dachengruoque.

In computer-adaptive testing, we often have 90%+ missing data.

To tell Winsteps that a response is "missing", enter a non-numeric code in the data file, or omit the numeric code from CODES=.

For instance, if the valid data codes are 1,2,3,4 and 9 is missing data, then

dachengruoque: I see, Dr Linacre! Thanks a lot!

608. Manually Calculating Rasch Score

max August 4th, 2010, 8:52pm: Hi all, I think this should be relatively straightforward but I'm not sure how to do it:

I have a growing database of completed questionnaires that I've done a Rasch analysis on, and I'd like to be able to get a quick update when a bunch of new questionnaires get entered. I have item and structure files that I've used to anchor the calibrated items, but I'd like to be able to apply them directly within my SAS dataset instead of exporting the data to Winsteps, running with the anchored items, then importing the person files back into SAS.

For instance, if I have the structure file:

; CATEGORY Rasch-Andrich threshold
0 .000
1 -1.319
2 -.668
3 .393
4 1.594

and the item file (I took out a lot of the extra info, which I don't think is needed here?):

1 -.701
2 -.203
3 .396
4 .002
5 .809

can I calculate the Rasch score of a person who answers:

item1: 4
item2: 2
item3: 0
item4: 2
item5: 1

? I don't even care about the person fits, since I had already dealt with any fit issues during the item calibration. All I want is a quick way to look at the updated scores without a whole bunch of importing and exporting. Once all of the questionnaires are collected and entered I'll probably recalibrate the items, but even then it would be easier to import the item and structure files into SAS and apply them to the raw scores than to import the person file...


MikeLinacre: Max, use the algorithm at https://www.rasch.org/rmt/rmt122q.htm
In that algorithm, R is the raw score on the test, so, for your example, R = 4+2+0+2+1 = 9

Alternatively, these numbers are provided in Winsteps Table 20.
0 .000
1 -1.319
2 -.668
3 .393
4 1.594
1 -.701
2 -.203
3 .396
4 .002
5 .809
11111 ; dummy data for 5 items, so that Winsteps runs

And then output Table 20. It has the Rasch measures for every possible raw score.
You can than do a Table look-up instead of a computation.

max: That was exactly what I was looking for, thanks! I do have 2 questions though. In order to reproduce the results I get directly from Winsteps, I need to set the iteration change far lower in SAS than I do in Winsteps; I used 0.005 in Winsteps but needed 0.00001 in SAS (I know you recommend 0.01 on the web page). Also, where does the score correction of 0.3 come from? Is that something that I should definitely leave in place, or could I play with it to make some of my extreme values slightly less extreme?

I included my entire code here just so anyone can see what I did, and included a few of my measures along with the item difficulties, the step calibrations, and the results I got from Winsteps.

data manual ;
input item1 item2 item3 item4 item5 item6 item7 item8 item9 item10 item11 item12 winsteps ;
format winsteps 8.3 ;
label winsteps = 'Winsteps Generated Rasch Score' ;
cards ;
0 4 4 4 4 0 0 0 0 4 4 0 0.02903
2 4 4 4 0 4 4 2 2 4 4 1 0.90047
1 0 0 3 2 0 3 1 1 4 4 1 -0.27077
4 2 0 3 3 4 3 0 3 4 4 0 0.47787
1 2 0 1 1 4 1 0 2 2 1 1 -0.59564
1 2 2 2 2 1 0 2 2 2 2 2 -0.27077
1 0 0 2 0 3 0 0 2 2 0 0 -1.2047
1 0 0 1 0 3 1 1 1 1 2 1 -0.97655
3 0 2 3 3 3 2 4 3 4 3 4 0.80909
0 1 0 1 1 1 3 1 1 1 0 1 -1.0864
1 0 0 0 0 1 0 0 0 1 0 0 -2.5874
0 0 0 0 0 0 0 0 0 0 1 0 -3.7481
0 0 0 0 0 0 9 0 0 0 0 0 -4.8744
0 0 0 1 0 1 0 0 1 1 0 0 -2.271
1 1 0 0 0 0 0 0 1 1 0 0 -2.271
0 0 0 1 0 1 9 0 2 0 0 0 -2.1614
1 0 1 0 0 0 0 0 0 3 0 1 -1.812
0 0 0 1 0 1 0 1 0 0 0 1 -2.271
2 0 1 3 2 3 4 0 2 3 1 0 -0.19438
2 2 1 2 2 4 4 4 3 2 2 4 0.63791
0 0 1 2 2 0 2 0 1 1 1 0 -1.2047
0 0 0 2 0 1 0 1 2 2 2 1 -1.0864
0 0 0 2 3 3 9 9 4 4 3 9 0.11744
2 2 1 1 2 1 0 1 3 3 3 2 -0.19438
0 0 1 3 4 0 9 0 1 1 0 0 -1.0836
run ;

%let indata = manual ;
%let lconverge = 0.00001 ;
%let extremes = 0.3 ;
%let items = 12 ;
%let itemlist = item1 item2 item3 item4 item5 item6 item7 item8 item9 item10 item11 item12 ;
%let diffs = -0.33056 -0.03599 0.1975 0.11119 0.51657 0.08079 -0.17053 0.1753 0.40039 -0.57683 -0.45401 0.08618 ;
%let categories = 5 ;
%let choices = 0,1,2,3,4 ;
%let steps = 0 -1.245 -.08782 .22220 1.11064 ;

data &indata (drop=Dmean Rmin Rmax i e_score variance M_ j k h k0jsum h0tsum k0hsum j0tsum_a j0tsum_b) ;
set &indata ;
label item_count = '# of Items Answered'
R = 'Score'
M = 'Measure'
SE = 'Standard Error'
format M SE 8.3 ;
array items{&items} &itemlist ;
array diffs{&items} _TEMPORARY_ (&diffs) ;
array steps{&categories} _TEMPORARY_ (&steps) ;
array Pnij{&items,&categories} _TEMPORARY_ ;
Rmin = . ; Rmax = . ; R = . ; item_count = 0 ;
do i = 1 to dim(items) ;
if items{i} in (&choices) then
if R ne . then do ;
Rmax = Rmax + dim(steps)-1 ;
R = R + items{i} ;
Dmean = Dmean + diffs{i} ;
item_count = item_count + 1 ;
end ;
else do ;
Rmin = 0 ;
Rmax = dim(steps)-1 ;
R = items{i} ;
Dmean = diffs{i} ;
item_count = 1 ;
end ;
else items{i} = . ;
end ;
if R = 0 then R = &extremes ;
if R = Rmax then R = R - &extremes ;
M = Dmean + log( (R-Rmin) / (Rmax-R) ) ;

M_ = M + &lconverge * 10 ;

do until (abs(M_ - M) le &lconverge ) ;

M = M_ ;

do i = 1 to dim(items) ;
do j = 1 to dim(steps) ;
k0jsum = 0 ;
do k = 0 to j-1 ;
k0jsum = k0jsum + steps{k+1} ;
end ;
h0tsum = 0 ;
do h = 0 to dim(steps)-1 ;
k0hsum = 0 ;
do k = 0 to h ;
k0hsum = k0hsum + steps{k+1} ;
end ;
h0tsum = h0tsum + exp( h*(M-diffs{i}) - k0hsum ) ;
end ;
Pnij{i,j} = (exp( (j-1)*(M-diffs{i}) - k0jsum )) / h0tsum ;
end ;
end ;

e_score = 0 ;
do i = 1 to dim(items) ;
if items{i} in (&choices) then
do j = 1 to dim(steps) ;
e_score = e_score + (j-1)*Pnij{i,j} ;
end ;
end ;

variance = 0 ;
do i = 1 to dim(items) ;
if items{i} in (&choices) then
do ;
j0tsum_a = 0 ;
do j = 1 to dim(steps) ;
j0tsum_a = j0tsum_a + (((j-1)**2) * Pnij{i,j}) ;
end ;
j0tsum_b = 0 ;
do j = 1 to dim(steps) ;
j0tsum_b = j0tsum_b + (((j-1)*Pnij{i,j})**2) ;
end ;
variance = variance + (j0tsum_a - j0tsum_b) ;
end ;
end ;

if abs((R-e_score)/variance) le 1 then M_ = M + ((R-e_score)/variance) ;

end ;

M = M_ ;
SE = 1/(variance**.5) ;

run ;

MikeLinacre: Congratulations on your success, Max!

You wrote: "In order to reproduce the results I get directly from Winsteps, I need to set the iteration change far lower in SAS than I do in Winsteps; I used 0.005 in Winsteps but needed 0.00001 in SAS (I know you recommend 0.01 on the web page).

Reply: Fine! With modern computers a few thousand extra iterations are no problem! This algorithm uses Newton-Raphson iteration (which is simple and general). Winsteps uses iterative curve-fitting (which is intricate but more robust).

You wrote: Also, where does the score correction of 0.3 come from? Is that something that I should definitely leave in place, or could I play with it to make some of my extreme values slightly less extreme?

Reply: Please play with it. Statistical experts recommend anything between 0.1 and 0.5.

max: Hi Mike, I have another question about this; how do I calculate the scores when I have collapsed categories? The structure files look a good bit different than those from the non-collapsed analyses, and the only thing I've tried so far is just recoding all the data in SAS then using the sfile from the recoded Winsteps analysis (along with the new ifile), which isn't working. These are my original (working) and collapsed (not working) sfiles, with the category changes in between:

Sfile for original data:
0 .000
1 -.898
2 -1.156
3 -1.120
4 -.634
5 -.369
6 -.105
7 .355
8 .452
9 1.275
10 2.200

Old New
1 0
2 0
0 0
3 3
4 3
6 5
5 5
7 7
8 7
9 9
10 9

Sfile for re-scored data:
Step Difficulty
0 .000
1 41.780
2 .000
3 -43.545
4 41.780
5 -42.185
6 41.780
7 -41.389
8 41.780
9 -40.000

MikeLinacre: Max, collapsing categories changes everything. The original raw scores are also collapsed. This is confusing for everyone, so please only collapse categories when there is compelling and convincing evidence to combine categories!

I notice that your new scoring has unobserved intermediate categories. How do you want these analyzed?

a. Maintain unobserved categories in the category hierarchy. Even though they are unobserved, they are conceptual levels of performance: STKEEP=YES.

b. Squeeze unobserved categories out of the category hierarchy. These categories do not exist as conceptual levels of performance: STKEEP=NO
For b. it is usually easier to understand if we do this directly:
Old New
1 0
2 0
0 0
3 1
4 1
6 2
5 2
7 3
8 3
9 4
10 4

The analytical process is:
A. Original categories => collapsed categories
B. Original raw scores => collapsed raw scores
C. Collapsed data => Rasch measures
D. Rasch measures => collapsed raw scores
E. Rasch measures => original raw scores

Since there is not a 1-for-1 relationship between original raw scores and collapsed raw scores, regress the original raw scores on their corresponding collapsed raw scores or Rasch measures. This will give you a prediction equation that you can use in E. to obtain the expected original raw score from the collapsed raw scores or Rasch measures.

How does this work for you, Max?

609. Raw scores and Rasch scores

rsiegert August 22nd, 2010, 7:05pm: Dear Folks

I recently read a paper from 1999 that completed a Rasch analysis of a rehabilitation measure called the FIM+FAM (Functional Independence Measure + Functional Assessment Measure).

The authors found a modest number of misfitting items. They also correlated the raw scores with the Rasch scores and as these were very highly correlated concluded that users of this instrument could confidently continue totalling the raw scores and using these as there was little advantage in the Rasch scores.

I felt uneasy about this conclusion but am relatively new to RA and not a statistician by profession. I wondered what the position of Rasch experts was on this?

Warm Regards

Richard Siegert 8)

MikeLinacre: Richard: you probably received my response on the MBC listserv, but that is a great question which deserves to be answered every time!

Your question takes us back to 1989 and the start of Rasch work on the FIM.

The correlations between complete raw scores and Rasch measures are high. It would be surprising if they were not, because the raw scores are the sufficient statistics for the Rasch measures.

But .....

The inferences based on FIM raw scores can be misleading, particularly close to extreme scores. In 1989, there was a large graph displayed at Conferences which showed "FIM gain" vs. "Rehab. effort". It was based on raw scores and showed that Rehab. effort decreased in raw-score effectiveness as the patients neared independent status. Finding: "Rehabilitation is no longer cost-effective for close-to-independent patients. Discharge the patients (just before they become independent)!"

This finding contradicted the experience of therapists. Ben Wright redrew the graph using Rasch measures, he reported that Rehab. effort increased in effectiveness with increasing patient independence. The problem was that the FIM raw scores (like almost all raw scores) showed a strong ceiling effect.

If you are a statistician, perhaps there is no advantage to Rasch measures. If you are a patient, there is a huge advantage. Rasch measures make the difference between ending therapy for a patient in an almost independent state, and ending therapy when the patient is truly independent. Raw scores bias findings relating to treatment effectiveness. Treatments in the center of the raw score range produce bigger changes in raw scores compared to similarly effective treatments near the extremes of the raw-score range.

And Rasch measures have many other advantages. We can look at the history of thermometry. We could use raw-score thermometers (like Galileo thermometers http://en.wikipedia.org/wiki/Galileo_thermometer ). They correlate highly with Celsius thermometers. But who seriously thinks that a Galileo thermometer is a reasonable substitute for a truly additive thermometer?

610. recommended number of judges and ratees in Facets

pjiman1 August 19th, 2010, 3:45pm: Hello, thank you for your help in the past.

I am writing a grant and was wondering about recommended sample sizes for ratees and judges for a Many Facet Rasch Measurement. I have a 25 item rating scale that teachers use to rate students. I was wondering what would be the number of judges and ratees necessary to establish a sound network to link every parameter by some connecting observations. How many ratees per item should there be, how many judges per ratee and per item should there be, how many judges should there be so I can establish that the judges are interchangeable?

much appreciated


MikeLinacre: Thank you for your questions, pjiman1.

We definitely want a linked network of judges. The practical constraints are usually time and money. See https://www.rasch.org/rn3.htm which illustrates some situations.

Provided that a linked network of judges exists, then the more judges there are for each ratee, the more is the statistical power to detect aberrant rater behavior. From a Rasch perspective, the cost-benefit curve probably peaks at 3 judges rating each performance.

611. Rasch advantages over other conventional methods

lovepenn August 11th, 2010, 4:23am: Dear Mike,

I'm tyring to clearly explain several advantages of using the Rasch model in developing a construct from a set of items, and wonder if my understanding is right.

Let's say I have 10 items.
If I use sum scoring or principal factor analysis, every item will be considered as contributing equally to the construct under investigation.
If I use the Rasch model, it takes into account the different levels of item difficulty when estimating person abilities. Am I correctly saying?
Say, the item # 7 is more difficult than the item #4, and there are two persons who had 9 items correctly answered, but one got the #7 wrong and the other got the #4 wrong. If I use sum socring, then two persons will have the same value score (i.e., 9). But the Rasch model will take into account the different response patterns between the two persons.

I think I understand the procedure of how the Rasch model transforms the raw scores into equal-interval, linear measure, but I am not sure if I correctly understand how it handles item response patterns when estimating person ability.
If I say that we would know from the Infit Mean Square how much each respondent's response pattern is consistent with the hierarchical ordering of the items, would it be correct? If I say, that way the Rasch model handles the pattern of item responses, would it be correct?

As always, your advice will greatly help me to proceed. I would deeply appreciate any comment. Thanks, -lovepenn

MikeLinacre: This is a brave endeavor, lovepenn, Rasch is better, but for other reasons.

First, let's remove a misconception. The Rasch measures are the same for the same raw scores on the same sets of items. Rasch does not use "pattern measurement". Some IRT models could do that, but it has proved too difficult to explain to parents, lawyers and juries. So everyone reports: "same raw score on same items = same scale score".

So why is Rasch better? Many reasons:

1. The person abilities and the item difficulties are additive (linear) measures on the same latent variable. So that the spacing between the measures is meaningful (unlike person raw scores or item p-values) .

2. Rasch estimates are robust against missing data.

3. The hierarchy of person abilities supports predictive validity: "Do the person measures have the meaning that we intended them to have?"

4. The hierarchy of item difficulties supports construct validity: "Are we measuring what we intended to measure?"

5. The precision of each ability and difficulty is known. "How well does this test classify our person sample?"

6. The expected performance by any person on any item can be inferred from the abilities and difficulties. "What is the next thing to teach this child?"

7. The validity of the patterns of responses (item fit or person fit) can be investigated.
This is usually the starting point in a Rasch analysis. "Are the responses meaningful?"
This is where mean-squares come in.
High mean-squares = the response string is too chaotic.
Mean-square near 1.0 = the response concurs with the Rasch model
Low mean-squares = the response string is too predictable. Each new response is not telling us something new about the person or the item.

And more .....

lovepenn: Thank you so much, Mike.

It makes things clear. It is often stated in the studies applying the Rasch model that the Rasch model focuses on the patterns of responses. That's why I was confused.
Given the Rasch procedure of converting raw total scores for each item and for each person into odds and then making log transformations of those raw data odds, I thought that the Rasch person measures should be the same between the two persons in the example mentioned above. So, I wondered if so, how the Rasch model handles the different patterns of responses.

Thanks for your clear points and additional comments on the Rasch advantages.

612. Nominal Data in Surveys

uve August 7th, 2010, 1:48am: Dear Mike,

I was recently given a 30 question survey to analyze (attached). My experience so far has been strictly with dichotomous student achievement testing data. This survey is messy, but that's another issue. Anyway, there are 4 different likert scales leaving 11 questions each with its own scale. Of these 11, questions 11, 17, 21, 25 and 29 are not on a scale. That is, they are strictly nominal data. I've attached the survey and will attach the Winsteps file in a reply.

Several questions:

1) How should nominal data be handled in such a survey? How would you assign scores, it at all?
2) I attempted to reverse score item 28 but apparently with no success. What did I do wrong?
3) How can I get Winsteps to display the titles of the item distratctors?
4) What is the purpose of the CLFILE commands?

Sorry for all the questions. As always, I greatly appreciate your help.

uve: Here's the Winsteps file.

MikeLinacre: Thank you for these questions, Uve.

1) How would you assign scores, it at all?
A. Define the latent variable you want to measure with your survey.
B. Score each item so that "higher score = more of what you want to measure".
C. If you are uncertain about the scoring of an item.
C.1. Guess the best scoring you can.
C.2. Give the item a weight of 0 using IWEIGHT=
C.3. Analyze the data
C.4. Look at Table 14.3. This will tell you which option has the highest average ability measure = more of the variable.
C.5. Change the scoring (IVALUE=) or omit the item (IDELETE=) if no scoring makes sense.

2) I attempted to reverse score item 28 but apparently with no success. What did I do wrong?
A. All your items appear to be positively oriented (positive correlations). Reverse-scoring is not needed.
B. The reverse scoring is successful when NEWSCORE= is omitted. Please use IVALUE= or NEWSCORE=, but not both.

3) How can I get Winsteps to display the titles of the item distractors?
4) What is the purpose of the CLFILE commands?
A. CLFILE= specifies the title of the distractor:
(item number) (distractor code) (title)
1 A Not at all

uve: Mike,

Thanks for your replies. I've reviewed the Winsteps help on CLFILE and IVALUE and have 3 additional questions:

1)CLFILE help provides examples of how to label different groups of items that have different scales or individual items that have different scales; however, I can't quite seem to find examples of how to do both. In my attached file, Winsteps understands that group 1 and 2 each has different scales. It also correctly assigned the unique scales for the questions assigned to group 0. However, it did not assign group 3 and 4 correctly. What did I do wrong here?

2) IVALUE help provides examples of how to rescore items assigned to groups, but how do I tell Winsteps that I want each question assigned to group 0 to have its own scoring?

3) How do I modify the IVALUE and CLFILE to function the same should I decide not to group any questions at all? I assume I would have to provide scales and labels for each individual question to handle the CFILE issue--is that correct? And would deleting the groups option have any affect on IVALUE?

As always, thanks for your patience and guidance.

MikeLinacre: Thank you for your questions, Uve.

Based on your CLFILE=
1. Every specified item-option has been correctly labeled.
2. Group 1 items 26, 27 have been labeled by item 1
2. Group 2 items 5, 6, 18, 19 have been labeled by item 2
3. Group 3 items 7,8,9,10 - none of these items are specified in CLFILE=
4. Group 4 items 12,13,14,15,16 - - none of these items are specified in CLFILE=

Suggestion: It is helpful to put the group number in the item label, and the item number in the CLFILE= label. These numbers can be removed later. For instance:

&END ; group numbers in item labels
01 1Comfort using data as instructional leader


CLFILE = * ; item numbers in option labels
1+A 1Not at all
1+B 1Somewhat comfortable

2. "how do I tell Winsteps that I want each question assigned to group 0 to have its own scoring?"

Group 0 is a special group which says "this item is grouped by itself".
GROUPS = 0000
is the same as
GROUPS = 1234

3. "How do I modify the IVALUE and CLFILE to function the same should I decide not to group any questions at all?"

Use IREFER= to control IVALUE=. IREFER= does not group items for analysis.
CLFILE= can be used to define the options for each item. It does not require GROUPS=

uve: Mike,

I think I've really botched this one up then. Here is what I intended in terms of groups:

Group 1: items 1,26,27
Group 2: items 2,3,4,5,6, 18, 19
Group 3: items 7, 8, 9, 10
Group 4: items 12,13, 14, 15, 16

Items 11, 17, 20-25, 28-30 each have their own unique likert scale and so are "grouped" as 0.

I was attempting to have the CLFILE recognize the numbers 1-4 as grouping functions so I wouldn't have to duplicate the scales for each question that belongs to each group. For example, I wanted Winsteps to say, "I see the number 3 but this isn't referring to question #3, it means group 3 so I'll assign the same likert scale to items 7 through 10. Now I see number 11 which doesn't really belong to a group so I'll refer to this as the actual question and assign it the likert scale for question 11."

My problem then flows into how I have IVALUE score items 11, 17, 20-25, and 28-30 each in a unique way that best measures the variable as you suggested.

So I've made two mistakes:

1) it seems I am asking CLFILE to do something it can't, and
2) I need to combine IVALUE with IREFER. I'm assuming IREFER applies the scoring I state in IVALUE to the item I tag in IREFER.

I would greatly appreciate it if you could provide me a solution to 1 and provide a brief example for 2.

On a side note: I initially grouped items as I did so that I could analyze table 3.6 and see how the 4 primary category structures in this survey behaved in terms of the Structure Calibration. However, I do eventually want to undo this so that I can look at the probability category curves for each question on an individual basis. That is, see what the "peaks" on each question look like. I hope I'm on the right track here. Thanks again as always.

MikeLinacre: Uve, yes, it is a good idea to have a CLFILE= for every item. Then if you change the grouping, etc., the item maintains its option descriptors.

If you want to label each group with the same labels, then you can use one item from each group as an example item:
26+A ; for group 1
3+A ; for group 2
7+A ; for group 3
12+A ; for group 4
11+A ; by itself
17+A ; by itself

IREFER= controls IVALUE=, but if IREFER= is omitted, then IVALUE= is controlled by GROUPS=.
If you specify both IREFER= and GROUPS=, then they operate independently.
IREFER= and IVALUE= do the rescoring.
GROUPS= groups the items for the analysis

If you need to score each item in a different way, then each needs a different code letter in IREFER=. For instance, for 30 items, each scored in a different way
IREFER=ABCDEFGHIJKLMNOPQRSTUVWXYZ1234 ; 30 different item rescorings
CODES= ABCDE ; 5 valid codes in the data
IVALUEA=00011 ; scoring for item 1
IVALUEB=01010 ; scoring for item 2
IVALUE4=01230 ; scoring for item 30

then we can group the items which share the same rating-scale structure


if you want to look at the model probability curves for each item separately, then


613. has anyone used construct map alpha?

harbor_x August 6th, 2010, 11:33pm: Ive been using this program for a few months now. Its somewhat complicated but I feel like im getting a pretty good grasp on it. However It says that you can add various sets of tests (I've only been able to run one test at a time).

anyone else have any experience with this?

link http://bearcenter.berkeley.edu/GradeMap/download.php

MikeLinacre: Harbor_x, there is a section in the FAQ about multiple constructs. Is that what you mean by "sets of tests"? http://bearcenter.berkeley.edu/wiki/index.php/ConstructMap_FAQ

harbor_x: Wow, thanks mike. I've actually been looking all over the user manual for that info (it's huge).

614. grants or funding for Rasch Research?

pjiman1 August 4th, 2010, 4:37pm: I was wondering if the message board members happened to know of grant or funding sources if investigators want to purse Rasch analyses of measurement issues. I've looked at AERA, Spencer Foundation, and William T. Grant Foundation. Was wondering if others knew of sources. I should add that my areas of inquiry are measures of children/adolescent social and emotional issues and school-based prevention programs.

Thanks for your assistance.

MikeLinacre: Pjiman1, you will probably have more success looking for research money for the substantive area (e.g., children's behavioral problems), and then using Rasch as a research tool.

615. Is ICORFILE based on standardzed residuals?

william.hula August 4th, 2010, 5:30pm: Hello,

I have a cpuple of questions about the WINSTEPS ICORFILE and PCORFile output options.

1. Are the correlations based on the standardized or score residuals?

2.I've tried computing the correlations on both the standardized and score residuals obtained from the IPMATRIX option, and for both I get results that are different from the ICORFILE option. The differences aren't especially large, but neither are they non-trivial. Should the results obtained from running correlations on the residuals obtained from the IPMATRIX option match those obtained from ICORFILE?


MikeLinacre: Will,
in Winsteps, ICORFILE= is controlled by PRCOMP=, so
PRCOMP=R, then ICORFILE= inter-item correlation matrix of raw residuals
PRCOMP=S, then ICORFILE= inter-item correlation matrix of standardized residuals

ICORFILE= omits extreme scores in the correlation computation. These can be included or excluded in IPMATRIX=

616. numbers & correlations

arthur1426 August 2nd, 2010, 9:32pm: Hi Mike,

I'd like to clarify two points. 1)Most authors seem to suggest that 200 is the minimum numbers of subjects required for IRT-type analysis. However, in a review of functional status scales that employ IRT, I've come across multiple manuscripts that fall short of this number (e.g., Dubuc et al., 2004 applied the Rasch Partial Credit model to a sample of 91; Haley et al., 2002 used the Rasch Rating Scale model on a sample of 150; Kempen & Suurmeijer, 1990 applied nonparametric Mokken scaling to a sample of 100). Might it be that the adequacy of test targeting influences sample-size, and thus, a well-targeted test may produce adequate location precision with less than 100 subjects.

2) In discussing the utility of Rasch methods, it was suggected that the benefits of the Rasch method can be overstated (as compared to CTT). For instance, the correlations between Likert scores and IRT person parameters are very high (~ r=0.95). I'm I right in thinking that this may be true for the middle of the distribution, but that the benefits become more apparent at the higher and lower ends of any particular scale?

Any input concerning these two point would be much appreciated.

MikeLinacre: Thank you for your email, Rob.

1. To answer this, we need to distinguish between descriptive IRT models (which need big samples) and prescriptive Rasch models (which function well with small samples). For robust Rasch estimation under normal conditions, a sample size of 30 is sufficient: https://www.rasch.org/rmt/rmt74m.htm
For a general discussion of "What is IRT?", see https://www.rasch.org/rmt/rmt172g.htm

2. Yes, very high and very low raw scores are non-linear with the central raw scores. However, the correlations between Rasch measures and raw scores can be above r=0.99. So we can say, "if the raw score correlation with the Rasch measures is very high, then the raw scores approximate linearity, but if not, then they don't." The problem is that we don't know which alternative is true until we have estimated the Rasch measures! Further, Rasch analysis reports many other features of the data, not usually discovered in a CTT analysis.
More importantly, Rasch produces a map of the item difficulties and person abilities, located on the latent variable. CTT rarely reports raw scores and p-values coherently together. This map is important when investigating the Construct validity and the Predictive validity of the Rasch measures (or raw scores).
And, when there are missing data, CTT collapses ....

arthur1426: Thanks for this Mike. The links as well should prove informative.

Best regards,


howard July 29th, 2010, 4:53am: In IPMATRIX, whenever I try to save a matrix of "response values after scoring" or most other, but not all, matrices, and I request "include extreme persons " I get the extreme high values (all items correct) but not the extreme low values (all items incorrect).

Is there a way I can get all subjects in the output files?

Thank you

MikeLinacre: Howard, this is strange. I have just tested this instruction again, and it works correctly for me in Winsteps 3.70.

What version of Winsteps are you using?

Please edit Exam1.txt, add a line at the end of the file which is all 1's and another which is all 0's. Then analyze these data. Does it work correctly for you?

howard: Thank you for your prompt reply. I am using Winsteps 3.70 and after getting your response I tried saving the output as an EXCEL file and ASCII file, both worked fine. The problem was when I had the output go directly into SPSS. I am using an older version of SPSS. In any case, as far as I am concerned the problem is solved.

Thank you :)

MikeLinacre: Thank you for asking about this problem, Howard. There is a bug in the SPSS output form IPMATRIX= which will be remedied in the next update to Winsteps.

618. person estimates

nike2357 July 19th, 2010, 10:20am: Dear Mike,

I'm a bit confused about the person estimates output in Table 18.1. I did an analysis of the same items and persons in a different program (WINMIRA) and the person estimates it gives me differ substantially from the ones in the Winsteps output. Winmira uses WLE estimates, while Winsteps uses JMLE, I think. Could this be the reason? But how can the estimates differ so much (for some persons by more than one logit)?
Thanks for your help!

MikeLinacre: There are two stages here, Nike. So let's identify where the problem is:

Let's assume your data are dichotomous, scored 0-1.

1. The item difficulties
Please cross-plot the item difficulties for Winsteps and Winmira. Are they noticeably different? In what way?
Does setting STBIAS=YES in Winsteps bring the item difficulties into closer alignment?

2. The person estimates.
Please anchor the Winsteps items at the Winmira item difficulties using IAFILE=, and then compare the person estimates. Are they the same or different.

This will isolate the problem.

nike2357: Dear Mike,

thanks for your ideas. My data are actually polytomous (5-point scale).

1. Item difficulties in Winsteps are consistently a bit higher than in Winmira (between 0.01 and 0.15). Using STBIAS=YES they are aligned closer, the highest difference now is 0.12.

2. Person estimates still differ after anchoring, but less so. For middle scores they now differ around .1, towards extreme scores the differences increase (up to .37 for the maximum score). Here too estimates by Winsteps are always higher compared to the ones reported by Winmira.

MikeLinacre: Nike, does "higher" mean more extreme (away from zero)?

If so, then that result is expected for the items. The JMLE estimation method used in Winsteps produces more extreme estimates than the CMLE estimation method used in Winmira. The difference is usually much less than the standard error of the estimates, and is usually corrected by STBIAS=Yes.

For the persons, we expect the person estimates from the anchored run of Winsteps to agree with the MLE estimates produced by Winmira, but since the data are polytomous, both the items (IAFILE=) and the rating scale (SAFILE=) must be anchored.. The WLE estimates of Winmira will be more central then the MLE estimates.

Does this help explain the situation, Nike?

nike2357: Dear Mike,

thanks for the quick reply.

Yes, by "higher" I meant more extreme.

Anchoring the rating scale as well as the items didn't produce estimates that are the same as the ones computed by Winmira. On the contrary, compared to the estimates I received when only anchoring the items, these are actually a bit more extreme (0.01 to 0.04 above the previous estimates). How much of a difference between the MLE and WLE estimates would you call substantial?

I have another quick question: If I want person estimates based only on the items in one scale (by using ISELECT or IDELETE), why does Winsteps still take all items into account?

MikeLinacre: Winmira has two options, Nike, MLE and WLE for the person estimates. Be sure that you are selecting MLE.

Substantial differences are bigger than the S.E. of the person measure estimate.

ISELECT= and IDELETE= in your Winsteps control file should definitely eliminate those items from the estimation. Please produce Winsteps Table 14. Those items should be flagged as "DELETED".

nike2357: Dear Mike,

yes, using ISELECT those items are flagged in Table 14. However, person estimates are still based on the complete item set, not only on the items I selected. E.g. Table 18 still has all items (84) in the "total count" column instead of the 20 I selected. Is there anyway to get person estimates based only on the items in one specific scale instead of all items?

One more question: I'm comparing item scales constructed according to CTT rules vs. IRT rules. In the CTT framework I compute z-scores for the persons' scores. Does it make sense to do this with person estimates as well or is there an IRT-equivalent?
Also, for the CTT norming with z-scores I used the mean and standard deviation of a different (larger) sample. Is something similar possible with the person estimates?

Thanks so much for your help!

MikeLinacre: Thank you for your questions, Nike.

Please be sure that ISELECT= is in your Winsteps control file, or entered at the "Extra specifications" prompt. It must be actioned before estimation. Look at the summary at the bottom of your analysis screen. It shows how many items are active for estimation. Does it say: " ITEM 84 INPUT 20 MEASURED"?

Once the person measures are estimated, you can transform them in any way you wish. In fact, the transformations will be more accurate than those based on raw scores, because the measures are additive (equal-interval) but the raw scores are ordinal.

nike2357: Dear Mike,

thanks for the clarification. I didn't realize ISELECT had to be used before estimation.
So if I let the estimation run with all items and later select items using the "Specification" menu this only affects which items are shown in the tables but it doesn't change any of the estimations?

MikeLinacre: Correct, Nike. ISELECT= before estimation selects the data to use in estimation. After estimation, it selects the items to be reported.


mayoni2007 May 29th, 2010, 11:00am: Hi.
I was introduced to the Rasch Model in a seminar and got really interested. I have registered for masters classes in Education Measurement and Evaluation. For my research, I must use the Rasch Model in analysis of constructed response items. Where do I start? Is there any progressive help out there?

MikeLinacre: The book "Applying the Rasch Model" www.amazon.com/Applying-Rasch-Model-Fundamental-Measurement/dp/0805854622/ref=tmm_pap_title_0 by Bond & Fox is a good place to start, Mayoni. For online material, see www.rasch.org/memos.htm

connert: Mayoni,

Mike may have been too polite to mention this but you can take an on-line course at www.statistics.com, which he teaches. There are several courses, depending on your skill and knowledge level. You can get the excellent advice he offers here and more in a systematic format. Check it out.


MikeLinacre: Thank you, Tom. The Courses are summarized at www.winsteps.com/courses.htm

620. Group-anchoring or not?

wantednow July 2nd, 2010, 3:13am: Dear Mike,

I want to compare how raters rater subjects of different ¡®typicality¡¯, i.e., subjects whose performance is more or less in line with the rating scale, based on their scores. How can I do this? Should I introduce, say, ¡°typicality¡± as a dummy facet? I¡¯ve tried group-anchoring ¡°typicality¡±, but this makes invalid the command that examinees are the non-centered facet. Many thanks!


MikeLinacre: Liann, it sounds like "typicality" is a dummy facet in a Facets software analysis.

Here is what you could do.
A. Put a code in as first character of each subject label:
1 = Typicality type 1, 2= Typicality type 2 etc.

B. Let's say you have 3 facets, so typicality becomes facet 4
4, Typicality, D ; a dummy facet
1 = Typicality type 1
2 = Typicality type 2

C. Add this to the model statement, let's say that the raters are facet 3.
Models = ?,?,?B,?B, R ; interaction between raters and typicality

D. No need to change the data. Let's say the subjects are facet 1
Dvalues= 4,1,1,1 ; the element number for facet 4 is in the element label of facet 1, column1 with a width of 1

E. Do the analysis!

wantednow: Mike, thank you for your prompt reply. I'll give it another try.

wantednow: Dear Mike,

I've tried your method, and here is what I got:

Bias/Interaction: 1. Raters, 3. Typicality

There are empirically 20 Bias terms
| Iteration Max. Score Residual Max. Logit Change |
| Elements % Categories Elements Steps |
| BIAS 1 -3.3750 -10.5 .3899 |
| BIAS 2 -.0524 -.2 -.0065 |

I don't quite understand it. And what should I do in order to find out about the interaction between individual rater and this dummy facet? Should I run FACETS a second time with the following model:
?,?,?,Performance ; raters, examinees, rater x typicality, produce ratings on "Performance". And how should the element of 'rater x typicality' be written?

Thanks a lot!

MikeLinacre: Liann, that Table tells you that the bias/interaction terms have been estimated.
Please look at Facets Tables 13 and 14, and the Excel plots, to see the terms.
There is no need to change your Facets analysis.
Facets "Output Tables" menu, "Tables 13-14", Choose the facets, click on the Excel boxes, https://www.winsteps.com/facetman/index.htm?t13menu.htm

621. PCM and error message

ong July 4th, 2010, 9:10am: Dear Mike,

I used winsteps version to analyse a dataset that consists of dichotomies and polytomies.

To code that, ISGROUP =0, for PCM and ISGROUP =D for dichotomous.


What does this warning message mean? Should I be concerned over it?

Thank you.


MikeLinacre: Yes, you should be concerned about this message, Ong.
The good news is that it usually has nothing to do with ISGROUPS=
The message indicates that the missing data in your dataset has made the dataset look like:

If you don't see this in your data, please email me directly.

Later: Ong emailed me the Winsteps control and data file. I was wrong. There were no missing data. It was the ISGROUPS= :(
Be sure to confirm that all items in the same ISGROUPS= group have the same range of categories.
We live and learn! :)

622. Reversing measure question

DAL July 2nd, 2010, 5:03am: Hi there!
Some minor weirdness, maybe a bug that's been fixed since I'm using winsteps 3.66, but after a quick search I couldn't find other posts with the key words.

For a study I'm doing on presentation proposals I did a facets analysis, created a control file for winsteps and ran that without a problem. One of the raters did not fit, so I took him out of the calculation by using pdelete = 6 and ran winsteps again.

The second time I ran winsteps when I checked out the output tables presentation measures had reversed!

I don't think I'm doing anything wrong, but is there any reason for the measures to switch order?


MikeLinacre: DAL, this is a strange situation.
In Winsteps,
the measures for the rows are always: higher score -> higher measure
the measures for the columns are always: higher score -> lower measure
In Facets,
positive facets: higher score -> higher measure
negative facets: higher score -> lower measure

Please do email me directly if this does not explain the situation ...

DAL: Ach. sorry, I didn't get the story right.

The original file from Facets has the measures in the right order, with the highest scoring presentation first. I then got Facets to produce a Winsteps file, but its the Winsteps file that has the measures reversed, with the highest scoring presentation now having a negative score.

And your reply is the reason why: when creating the control file for winsteps in Facets I selected 'raters' as the row and this put the measures in reverse order.

'Mystery' solved. Sorry for posting a silly question!

MikeLinacre: Don't worry, DAL. I have been in the audience when the presenter has got his own measures reversed, telling us that the easiest item was the hardest. The result was confusion all round the crowd, followed by arguments between those who supported and opposed the presenter's interpretation of the analysis ::)

623. Workflow: 20 scales, 1 Winsteps Cntrl File?

RaschNovice July 1st, 2010, 3:50am: I have 700 subjects and 2500 items in an SPSS file. I would like to develop maybe 30 personality scales from the data.

Winsteps works great when all the data refer to a single scale. But in this case, I'd like to easily manage multiple scales from a single Winsteps control file. Otherwise, I have create a new Winsteps file everytime I want to work with a new scale.

Can I just have multiple iSelect lines, and just change the one I comment out, like so:

; iSelect 1 2311 56 713 94 : Extraversion
; iSelect 534 234 1175 345 943 : Impulsivity

and so on? Looking at the help file, I don't think this will work.

How can I manage the workflow for multiple scales easily using a single Winsteps control file?


MikeLinacre: Yes, you can do this with Winsteps, Roger.
In column 1 of each item label (between &END and END LABELS) put a scale code: A-Z 1-4 = 30 codes.
You can then use ISELECT= to obtain the scales you want to analyze
ISELECT=B ; analyze scale B
then in another analysis
ISELECT={S4} ; analyzes scales S and 4 together.

RaschNovice: Ah Okay...this is helpful.

May I suggest a new feature to the already most awesome Winsteps, Mike?

Perhaps a new keyword like iScale, after which I could just list the items and their scoring direction. Something like

iScale = 17F 102R 45R 19F 21F 89F 100R

where F = forward, and R = reversed.

Perhaps that could also include item weight, since you have the option to weight items in Winsteps already. Something like:

iScale = 17F1 102R2 45F2 19F2 21F2 89F2 100R

Working with such a large dataset, it's hard to track which items are assigned to what scale, and their scoring directions, and their multiple scale assignments. This would make it easier.

Having 30 sets of like 15 to 20 items scattered randomly through a set of 2500 items, some of them reverse scored, some of them assigned to multiple scales...it's a bit of a nightmare.

These are the kinds of things computers are good at. :)


MikeLinacre: Roger: A current possibility with Winsteps is to use an SPFILE= for each scale-configuration containing its information:

In your Winsteps control file:

where SCALEA.txt is a separate text file which contains:


IDELETE=+17,+102,+45,+19,+21,+89,+100 ; active items

IREFER=* ; the scoring group for the items
17 F
102 R
45 R
19 F
21 F
89 F
100 R
CODES = 12345 ; the scoring
IVALUEF = 12345
IVALUER = 54321

IWEIGHT=* ; the item weighting
17 1
102 2
45 2
19 2
21 2
89 2
100 1

RaschNovice: Wow...Thank you for the great example, Mike. You are one helpful dude. :)

624. How to adjust for guessing

RS June 21st, 2010, 12:46am: Hi Mike

I am planning to use a Mathematic test with 50 multiple choice items for a large scale testing program. I have trailed 76 items and used Rasch model to calibrate them. The results of this analysis indicate that 13 items did not fit the Rasch model. As result, these items were removed. Thus, the job is to choose 50 items out of 63 to form the final test. The problem is that there are only three hard items (with a location greater than 2 logits) while I need at least eight items. Guessing is one possible reason for this issue as students had 25% chance to get even the hardest item correct.
In order to resolve this issue, I have sorted all students by their raw scores and then removed 5% of students who had the lowest raw score and recalibrated the items. The result of this analysis show a 10% increase in item standard deviation. That is, easy items became easier but the position of hard items did not change significantly. I was wondering if there is another way to adjust for guessing factor. Many thanks. RS

MikeLinacre: RS, a useful approach is to trim out the observations that could be subject to guessing: https://www.rasch.org/rmt/rmt62a.htm

This is implemented with CUTLO= in Winsteps https://www.winsteps.com/winman/index.htm?cutlo.htm

Try different CUTLO= values in the range 1.0 to 2.0 logits to balance between losing wanted information and losing unwanted noise.

RS: Thanks a lot Mike. I found this approach very interesting.
The only concern is: Does the process of trimming out the aberrant responses compromise the probabilistic nature of the Rasch model? Regards, RS.

MikeLinacre: Insightful question, RS.
The trimming accords with the Rasch model, provided that the trimming is independent of the observed values.
So trim all the observations for low-probability situations, not merely the guesses.
CUTLO= does this.
The process is equivalent to making the test "adaptive" instead of "complete".

RS: Great. Thanks a lot Mike.

625. model comparison between facets models

suping June 16th, 2010, 4:32am: Dear Prof. Linacre
I am using facet to compare different models.
I am confusing that where can I get the final deviance and number of estimated parameters.
I can only see the Data log-likelihood chi-square in table 5 of the output.
but, can not find number of estimated parameters
where can I get the information or the output file will not offer it directly and I have to caculate it by myself?
tks a lot

MikeLinacre: Thank you for your email, Suping.

The number of free parameters in a Rasch analysis depends on your experimental design, but an upper limit is the count of all the elements (persons+items+raters+...).

626. Comparing standard deviations

humphrey June 3rd, 2010, 9:08am: Dear Mike,

I have several tests which have different number of questions (from 20 to 100) so the maximum possible score on each test form is different. Therefore, I cannot compare the standard deviations of the samples raw scores who have taken each test form as a result of different ranges of scales.

What can I do to bring raw scores' standard deviations onto the same scale and make them comparable? I want to know which test form has spread the subjects more widely.

Can I use percent correct instead of sums and then compare the standard deviations of percent corrects?

If I Rasch analyse my data separately and get the standard deviations of samples logit measures, will they be comparable across tests which don’t share a common raw score scale? Don’t I need some sort of equating?


MikeLinacre: Thank you for your email, Humphrey.

You want to compare the "standard deviations of the samples raw scores" for tests with different numbers of items.

An immediate solution is to divide the S.D.s by their maximum possible scores. This will give a standardized S.D.

Since the tests have different items, there is an equating problem which applies equally to Rasch measures and raw scores. For instance, a test of 20 high-discrimination items and a test of 100 low-discrimination items require equating for either the raw-scores or the Rasch measures to be comparable.

humphrey: Thanks a lot Mike,
How about if I compute percent correct for each test-takre as their score (instead of sum of raw scores) and then compute the SD of percent corrects?

MikeLinacre: Yes, that would be another way of standardizing the raw scores, Humphrey.

627. INFIT

howard June 3rd, 2010, 6:54pm: Given that conceptually, "infit is an information-weighted statistic , sensitive to the misfit of responses from persons whose estimated level on the latent variable is near the location of the item on the latent continuum"

Does it make any sense to use infit when an item's difficulty is much higher or lower than the abilities of the persons being tested.

MikeLinacre: Howard, infit for the item will report on the few(?) persons close to the item difficulty - usually within about 1.5 logits of the item difficulty.

628. double key items in Winsteps

dmt1 June 3rd, 2010, 6:44pm: Is it possible, when entering a scoring key into Winsteps, to designate items that have more than one key?

MikeLinacre: Dmt1, do you mean that some items have two correct response codes, or are partial credit? If so, you can use
KEY1=, KEY2=,... https://www.winsteps.com/winman/index.htm?key.htm
or IVALUE=... https://www.winsteps.com/winman/index.htm?ivalue.htm

629. Confidence interval for cut-off

harmony April 24th, 2010, 6:48am: Hi all:

I want to set a confidence interval with 95% certainty for the cut-off point on a crt and want to confirm that I am doing this correctly.

I look at table 20 in winsteps and find the S.E. for the cut-off score. I then calculate 1.96(S.E.). I then apply this figure, not to the raw score, but to the measure.

Is this correct?

MikeLinacre: Yes, that is correct, Harmony. For the measure you use the measure S.E.

If your criterion-referenced test includes all the items, then you can look up the equivalent raw-score range in Winsteps Table 20. Or the raw score S.E. at the cut-point is approximately 1/(measure S.E. at cut-off point) for a dichotomous test - https://www.rasch.org/rmt/rmt204f.htm

harmony: Thanks again Mike! Your replies are much appreciated.

Just to confirm that I've got this, a S.E. of .30 = a raw score S.E. of 1/.3 = 3.33.

A 95% confidence interval then is 1.96(3.33)= 6.52 or approximately +/- 7.

This isn't too good for an exit test from a language program! If this is correct, any advice on the best way to apply this information to the cut-off?

Thanks again for your help.

MikeLinacre: Harmony, you have a measure S.E. of 0.30 logits.

Let's assume this is a dichotomous test with L items, then
S.E. = 1 / sqr ( 0.2 * L)
L = 1 / (S.E.**2 * 0.2) = 56 items.
Confirming ...
S.D. of a binomial distribution of 56 items with 0.7 probability = sqr (0.7 * 0.3 * 56) = 3.4
So score S.E. of 56 items = 3.4 = your finding of 3.3 = 100 * 3.3/56 % = 6%

Let's suppose we doubled the number of items to 112 = 4 * 56, then the measure S.E. = .30/1.4 = .21 (smaller), and the score S.E. = 4.8 (bigger) = 100*4.8/112 = 4.3% (smaller)

Most Examination Boards cannot face this situation. They imagine that their pass-fail decisions are precise to the nearest score-point, when really those decisions are fuzzy.

If we do take imprecision into account, then Ben Wright pointed out there are two perspectives:
1. Only pass the definitely qualified: Pass-fail point = cut-point + ?*S.E.
2. Only fail the definitely unqualified: Pass-fail point = cut-point - ?*S.E.

harmony: Thanks again Mike. I truly appreciate your thorough reply. I was actually applying the formulas to another test with much lower standard errors and was perplexed when I discovered that the raw score confidence interval increased. I though I was doing something wrong. But as you have shown, the actual percentage of variance decreases and the increase of raw score confidence intervals is a factor of the higher number of questions. Your help is much appreciated.

harmony: Hi Mike:

I have another question related to this issue. Why is the S.E in Rasch so much higher than the S.E in CTT?

Looking at another test, I have a cut point standard error of .12 which -using your formula- translates into a raw score S.E of 8.33 (1/.12=8.33).

The CTT calculated S.E. for the same test, however, is 3.9. I understand that this score is calculated for the test as a whole with the assumption that scores vary equally at any point in the test (which we know is not true), yet wouldn't the average error be more, rather than less?

MikeLinacre: Harmony, true, most analysts use an average raw-score S.E. obtained from the raw score reliability, but we can compute the raw-score S.E. for any score in the test. Here is a useful approximation:

Raw score S.E. &#8776; (1 / Rasch logit S.E.)

This is pictured at https://www.rasch.org/rmt/rmt204f.htm. So what does happen?
1. More items on the test = bigger raw score S.E. = smaller Rasch S.E.
2. More extreme (less central) score on the test = smaller raw score S.E. = larger Rasch S.E.

630. P-value

Raschmad May 26th, 2010, 11:26am: Dear Mike,
How are classical p-values (item difficulty) computed in Winstpes for polytomous items?

MikeLinacre: Anthony, p-values for dichotomies are the average score on the item by the sample of persons, so Winsteps does the same computation for polytomous items: average score on the item by the sample of persons. Does this make sense to you?

Raschmad: Hi Mike,
You mean adding up the item raw scores and dividing them by the number of persons?
Is it then possible to compare the difficulty of items which have different number of response categories?

MikeLinacre: Anthony:
"You mean adding up the item raw scores and dividing them by the number of persons?" - Yes.
"Is it then possible to compare the difficulty of items which have different number of response categories?" - Yes. But you must define what you mean by "item difficulty" in a way that allows for the different numbers of categories. The standard Rasch definition is "the point on the latent variable at which the top and bottom category are equally probable". Another definition is "the point on the latent variable at which the expected score on the item is half-way between the top category and the bottom category". Then there are criterion definitions based on the content of the categories, such as the transition from "failure to success" on the rating scale.

Raschmad: Dear Mike,
I don't mean Rasch difficulty. Just within classical model. The question is "how do we compute p-values for items which are NOT scored 0 and 1 in such a way that we can compare difficulties of items which don't have equal number of categories?"
The procedure of adding up item raw scores and dividing by the number of test-takers doesn't give comprabale results across items which have different maximum possible scores. Consider an item with 5 categories and an item with 10 categories both in the same instrumnet. A sample of 100 persons have taken the test. The first item can have a maximum raw score of 500 and the second 1000. Dividing by 100, the maximum p-values will be 5 and 10. However, a p-value of 5 for the first item means easier than a p-value of 8 for the second. Becaue 5 for the first item means all correct but 8 for the second doesn't mean so. A transormation is needed to make them comprable.

MikeLinacre: Anthony, the "define item difficulty" problem is the same for Rasch and for CTT. But an immediate solution is to divide the average response to the item by the number of categories, so that all p-values have the range 0-1.

Raschmad: Thnaks Mike,
Great help!

631. Rasch Expert

tutorjk May 26th, 2010, 7:34pm: Hi All,

Is there a Rasch Model expert out there who is willing to talk to me about my project? I am based in Dublin, Ireland, but I will travel to anywhere in Ireland or the UK. I have studied 'Applying the Rasch Model' but I plan to use it in a particular context in my doctoral thesis. I would love the opportunity to either confirm or correct my expectations with the advice of an expert, but I don't think that I could achieve that by email.

Best regards, John.

MikeLinacre: John, please contact relevant experts on www.winsteps.com/consult.htm and/or post your request on the Rasch listserv accessible from www.rasch.org/rmt/

tutorjk: Thank you Mike,

I will try that right away. best regards, John

632. Can culture be a facet?

Seanswf May 11th, 2010, 1:48am: Dear Mike,
I am analysing training course evaluation data for a global organization. We see consistent trends across courses in the severity/leniency of student ratings by country. This makes it impossible to compare evaluation scores on any given course across countries. Do you think it is possible to adjust for this by having country be a facet in a multi-facet Rasch analysis?

MikeLinacre: Yes, Sean. Yes, you can model "culture" as a facet. The students or raters are grouped within culture, so you would need to constrain the analysis in some way. For instance, group-anchor the raters by culture, which assumes that, after adjusting for culture, the average rater severity is the same across cultures.

Seanswf: Thanks Mike that is encouraging! I am new to Facets would you know of an example which is available to show me how to group-anchor the students by culture?

MikeLinacre: Seanswf, here is a simple approach to set up group-anchoring:
1. Run the Facets analysis including "country" as an unanchored facet.
2. You should see "subsets" reported.
3. "output files" menu, "subset"
4 In the "subset" file, copy the labels= for "raters" elements
5. Paste the "raters" into your original specification file, replacing its "raters" elements.
6. Redo the analysis. It is now group-anchored.

https://www.winsteps.com/facetman/index.htm?subsetdata.htm gives more details

Seanswf: Many thanks!

633. item and person reliability

tugba May 11th, 2010, 1:34pm: hi,
May I ask you that I `ve some issue (item reliability=.00, person reliability=1.0) with a data set of 666 students and 9 scales (each scale has about 5 items).
Did you ever figure out what was causing this problem?

MikeLinacre: Tugba,
person reliability=1.0 with a data set of 666 students - this indicates that the students have a wide ability range.
item reliability=.00 for 9 scales (each scale has about 5 items) - this indicates that the item difficulty range is narrow.
If you want me to verify the computations, Tugba, please email me the table of person abilities (and their standard errors), and also the table of item difficulties (and their standard errors).

634. Halkitis (1993) Computer-adaptive testing al

Tetsuo May 8th, 2010, 3:26pm: Mike,
I'm preparing to make a CAT module on moodle. While I was reviewing MESA Memorandum No. 69 and Rasch Measurement Transactions 6:4, I found a confusion in the figure to explain CAT algorithm of Halkitis (1993). At the left bottom of the fig, there are two arrows after the branch point "More than 5 items administered?"

Shouldn't the left arrow be labeled "Yes" not "No" because it goes to "Target difficulty = competence measure"? And the right arrow should be labeled "No" because it goes to "Target difficulty = competence -0.5"?

In the text, it says "Beginning with the sixth item, the difficulty of items is targeted roughly directly at the person competency, rather than 0.5 logits below. "

Anyway, I like the idea to give a student an extra opportunity for success at the beginning of CAT, which presumably lower their anxiety against tests, although it sacrifices efficiency of CAT. Are there any other examples using the same idea in CAT algorithm? I don' t remember who it was, but somebody proposed ending CAT with a few rather easy items so that students could end CAT with satisfaction?

MikeLinacre: Thank you, Tetsuo. I have corrected the Halkitis flowchart. https://www.rasch.org/rmt/rmt64cat.htm

The idea of ending CAT with easy items is new to me, but 50% targeting is severe. 70% targeting prevents examinees "giving up" or response sets.

CAT efficiency used to be seen as crucial, but now we see that test security, item exposure, examinee behavior and content coverage are more important. Test length is not so important.

635. Testing CAT in Winsteps?

rblack April 25th, 2010, 12:32pm: Hi Mike and others,

I'm full of questions these days! Anyway, suppose I constructed a Rasch scale using ~90 dichotomous items using data from ~500 people. Approximately 300 of these people responded to all items. Scores on this Rasch scale correlate highly with a behavioral outcome.

Now I want to test if a CAT version of this Rasch scale would correlate as highly with the behavioral outcome. Is there any way to test this using Winsteps using the data from the subset of people who responded to all items (N~300). I realize this is not an ideal situation, but I am limited to this data.

Anyway, is there any way to do this in Winsteps? If yes, are there any examples I could follow? Ultimately, I'd like to have a data set as follows:

Person_ID Rasch_Score CAT_Score Outcome

I could then see if the correlation between the CAT score and outcome are comparable to the correlation between the Rasch score and outcome.

Thanks again for all your help!


MikeLinacre: Ryan, you could approximate a CAT dataset by using CUTLO= and CUTHI= to trim your current data set to the targeted items.
Usually a CAT analysis uses anchored (fixed) item values so the process would be:
1. Analyze the full data set. Output the person statistics (PFILE=) to Excel from the full analysis, output the item statistics to a text file: IFILE = if.txt
2. Analyze the full dataset trimmed with CUTLO=, CUTHI= using anchored item measures IAFILE=if.txt. Output the person statistics (PFILE=) to Excel from the trimmed analysis.
3. Combine the two Excel spreadsheets to investigate the relationship between the two sets of person statistics.

rblack: Hi Mike,

Thank for responding so quickly. I'm sorry for the follow up but I'm not sure I fully understand in terms of the concrete steps.

First, I run the analysis with the full data set, and save the person file (pfile). Then I save the item file ("ifile") as "if.txt", which looks like this:


Then, I analyze the data set after modifying the Winsteps file by adding the following statements:

IAFILE=if.txt; "if.txt" is the item file generated from the first analysis
CUTLO=?; <-What should this value be??
CUTHI=?; <-What should this value be??

Then, output person statistics and compare this against person statistics from the original analysis.

Thank you,


MikeLinacre: Ryan: reasonable values are
for 50%-success targeting
CUTLO = -1.0
CUTHI = 1.0
for 75% success targeting

If the resulting tests are too short, widen the interval, and vice-versa.

When you look at the Scalogram in Winsteps Table 22, you should see blanks for the top left and bottom right responses. If you don't see this:
1. Please look at the "Output files", "Control variables" to verify that CUTLO=, CUTHI=
If that is correct, then
2. Please email me your Winsteps version number for special instructions.

Tetsuo: Mike & Ryan

When I calibrate item difficulty and make item banks for CAT, I use CUTLO for eliminating lucky guesses and CUTHI for eliminating careless mistakes before utilizing Outfit/Infit MNSQ & ZSTD for eliminating misfit items/persons.

In my case, all the questions have four choices. So I set CUTLO = -2 (only 12% success, which is far less than 25% random success), and CUTHI = 3 (95% success: 5% of truly wrong responses seem to be statistically rejectable?). Couldn't it be a kind of rule of thumb?

I used to eliminate misfits items/persons only with Outfit ZSTD. I eliminated items/persons with Outfit ZSTD > 1.96, and lost a lot of items. After I learned at your online course that if the d.f. are too large (greater than 300), substantively trivial misfit is statistically significant, I see the Outfit MNSQ first. When MNSQ is greater than 1.3 and at the same time ZSTD is greater than 1.96, I eliminate the items from the analysis. However, I explore my data closer and realized only a small portion of lucky guess and careless mistakes have a rather great impact to fit analysis. So I apply the rule using CUTLO & CUTHI first for my analysis now.

636. how to model this data through facets?

limeijuan March 24th, 2010, 2:30am: In this period,I was analyzing a data through many-facets Rasch model,But I encountered a problem that I didn't kown how to solve it.My question is how to model it when the rating scale has its own scale level(it not only has 4 category scale,3 category scale but also has dichotomous responses).Thank you very much.
You also can get the data example in accessory, Thanks again!

MikeLinacre: Limeijuan:

Facet 4 are the criteria. Criterion 1 has 4 categories. Criterion 2 has 3 categories. All the other criteria are dichotomies:


limeijuan: model=?,?,1,R3

but when I run this program, there is also report:

Data line in data file is:
Error F23 in line 9446: Invalid element number. Must be integers in range 0 - 2147483647
Execution halted

in my data file ,there are 9445 lines,I don'know why is it ?
Thanks again!

MikeLinacre: limeijuan, your data are in an Excel file. Please make sure that there are no data lines with invisible codes after your last data line. The easiest way is to select all the rows in the spreadsheet after your last data row and delete them. The rows will stay on your screen but you will know that they are blank.

limeijuan: Mike, Thanks very much for helping me solving these two questions.

I also want to have a result-consensus and consistency indices of interrater reliability about rater pair, it include Exact Agreement , Cohen's weighted kappa,and Pearson's r ,Kendall's tau-b. There also some Cross-Classification of rating Frequencies for rater pair.
But I don't know which syntax to produce this result?
Thank you very much again.

MikeLinacre: Thank you for your question, limeijuan.

Those agreement statistics are "classical" descriptive statistics. They are not computed by Facets, but can be computed by conventional statistical programs. You can probably compute them with Excel. A place to start is http://en.wikipedia.org/wiki/Inter-rater_reliability

limeijuan: hello£¬mike£¡Thanks for your patient answers.

There are some problems in my results.

in rater-criteria interaction:
table 5 measurable data summary
sample resd is 0.4
table11.1 bias/interaction measurement summary
sample resd is 0.4
so the variance the rater-criteria interaction can explain is 0
however,Fixed (all = 0) chi-square: 2372.7 d.f.: 285 significance (probability): .00

in rater * criteria* examinee interaction:
table11.1 bias/interaction measurement summary
sample resd is 0.2
table 5 measurable data summary
sample resd is 0.4
so the variance the rater-criteria interaction can explain is 12%
however,Fixed (all = 0) chi-square: 8895.3 d.f.: 24040 significance (probability): 1.00
these two results are Contradictory

why? according to the other result, rater-criteria interaction is significant,but the variance the rater-criteria interaction can explain isonly 0.

MikeLinacre: Limeijuan, yes, there is something paradoxical about these findings.

1. Fixed (all = 0) chi-square: 2372.7 d.f.: 285 significance (probability): .00
This one looks reasonable. The rater-criteria interaction does not explain much variance, but enough to reject the hypothesis that there is no rater-criteria interaction

2. Fixed (all = 0) chi-square: 8895.3 d.f.: 24040 significance (probability): 1.00
This looks unusual, very low chi-square and very high d.f. - this suggests that each rater-criteria interaction has only a few observations. So that we may be observing accidental interactions. The interactions do explain the data, but not in a statistically robust way, because the rater-criteria standard errors are high.

If these explanations are not correct, please zip your Facets specification and data files, then email them to me.

limeijuan: Hello,Mike:
Thanks very much. I had check that the standard error of rater-criteria is very small£¬but the standard error of others are high¡£
so I have to email my facet specifications and data files to you¡£

limeijuan: [Hello, Mike:
Are there any error in my specification and data? Next Monday I will have a oral presentation in paychometric Annual conference about this research,So I am very anxious now. Wish your answer and help.
Thank you very much.

MikeLinacre: Limeijuan, the standard errors are influenced by the number of observations. If an element has many observations, then its standard error will be small. If it has few ratings, then its standard error will be large.

Looking at your specification files:
This is correct, but not needed:
It is the same as this:
model=?B,?B,1,R3 ; any combination of B's automatically applies to all model specifications

You specified 3-way interaction. This is usually too detailed:
You probably want to model every possible 2-way interaction.
?B,?B,1,R3 ; any combination of B's automatically applies to all model specifications
?B,?,2B,R3 ; second interaction analysis
? ,?B,3B,R2 ; third interaction analysis
?, ? ,4 ,D
? ,? ,5 ,R2

Otherwise all looks good to me. Congratulations!

637. problem with person reliability

mats April 29th, 2010, 9:45am: Hi everybody

I am analyzing data from a depression treatment trial. Constructing two Winsteps files from Excel, one from initial measurement and one from the final. Wintseps won´t calcultade person reliability, (it gives the figure .00). Item reliability is calculated and seems OK. I can´t figure out the problem. Can Mike or someone else help me?

MikeLinacre: Mats, there are three main influences on reliability:
1. Person sample spread (measure S.D.)
2. Number of items
3. Number of rating-scale categories for each item.

A reliability of zero happens when the average measurement error of the person measures (dominated by the number of items and number of categories) is greater than the observed person measure S.D. Then all of the variance in the sample could be due measurement error.

Let's suppose that we have a sample with an observed measure S.D. of 0.5 logits, and our test has 10 3-category items. Then the average measure S.E. = 0.6, and the reliability = 0. This is the situation if only the last 10 items of the "Liking for Science" data (example0.txt) are analyzed.

mats: Thanks for the reply Mike. I makes sense: the person sample spread is very small in the pre-treatment sample. Post treatment, the spread is greater, and the person reliability is 0.79.


638. Spec file and analysis

miet1602 April 29th, 2010, 12:34pm: Hi Mike,

Thanks for the previous response!
I have another question, this time about setting up a facets specification file for an interaction analysis.
I attach an excel file where I set up the data (invented) the way I think it should be. Could you maybe have a look and let me know if this is fine - in particular with respect to using element ranges. The explanation of the facets and some basic info about the study setup are in the excel file as well.

Thanks again,

MikeLinacre: Milja, in your Excel file you ask: "does it matter that I am using element ranges for different facets in the same data file?"
Reply: No. Each data line is decoded independently. Only one range in each data line, and the observations must match the range.

miet1602: Thanks Mike!

639. Achieving connectedness

miet1602 April 28th, 2010, 2:04pm: I was wondering if you think the following judge/candidate/item design will produce enough connectedness for valid results:

Let’s say, there are 6 judges, a test with 15 polytomous items for each candidate, and 60 candidates. The paper is marked item by item. If:

(a) each judge marks some items from each candidate (e.g. each judge marks 10 candidates on item 1 (so all six judges mark item 1 but for different 10 candidates each), similarly on item 2, etc.),
(b) no judge marks all the items for the same candidate, and
(c) no two judges mark the same items from the same candidate - will this give enough connectedness?

I am not sure whether without the (c) feature the network will be sufficiently connected...

Am I right in thinking that this design is similar to what you had in you 1994 paper in Objective Measurement Vol. 2?


MikeLinacre: Milja, yes, your design is the "minimal effort" judging plan at https://www.winsteps.com/facetman/index.htm?judgingplans.htm. It is sufficiently connected.

640. Dimensionality - Contrast Plot

rblack April 19th, 2010, 3:30pm: Hi Mike et al.,

I fit a Rasch model using Winsteps, and obtained the following results:

Raw Variance explained by Measures: 43%
Unexplained variance in 1st contrast: 2.5 (Eigenvalues)

I suspected all along that this construct was multidimensional, so I'm not surprised by the results.

I looked at the standardized residual contrast 1 plot, and there's a scatter of items across all four quadrants. All of the items in the upper right hand quadrant include items that I thought belonged to the unidimensional construct of intterest.

I refit the Rasch model on new data for just those items in the upper right hand quadrant and results are much more consistent for a single dimensional construct:

Raw Variance explained by Measures: 57%
Unexplained variance in 1st contrast: 1.5 (Eigenvalues)

Does what I did sound reasonable/acceptable? Any thoughts would be most appreciated.



MikeLinacre: Ryan, sounds good.

It is the up-down contrast that is important, not the right-left (which indicates item easiness/hardness). So please try including all the upper items. This should increase the "variance explained" and also the person "test" reliability.

rblack: Hi Mike,

Thank you so much for responding and recommending that I include all upper variables. In general, would you suggest I include the items that fall right on the middle line as well?

Thanks again!


MikeLinacre: Ryan, "the items that fall right on the middle line" ?

We usually want to retain as many items as are reasonably possible, because more items = greater measurement stability and precision (test reliability). So you could measure the people with and without those items, and then cross-plot the pairs of person measures. Probably almost everyone is close to the diagonal, indicating that the dimensional differences between the items are probably too small to matter.

Also please don't lose sight of what you want to measure! Statistical analysis only provides guidance, it does not make decisions. So, look at the content of those middle items. Are they what you want or not, Ryan?

rblack: Hi Mike,

Thank you again for continuing to respond. After reviewing the items that fall on the middle line, I've decided to exclude them. They fit the constuct loosely and when kept in, have poor fit statistics.

One more question, if you have the time. Have you ever come across a situation where it was more important to be able to predict future behavior (i.e. predictive validity) from a set of items than have a unidimensional construct? I ask because I'm dealing with this very issue right now--to build a unidimensional construct which ends up removing items that clearly have predictive power. To deal with this, here's what I'm thinking about doing:

(1) Fit the Rasch model with the reduced set of items, and save person scores (in logits)
(2) Fit a logistic regression model with the person scores as an independent variable, and treat the other variables that fell out of the Rasch model as separate independent variables. <--I will check to see that these variables do not represent another single construct. I don't think they do.
(2) Save predicted values from the logistic regression model.
(3) Test the overall accuracy, sensitivity and specificity of the predicted values from the logistic regression model by constructing an receiver operating characteristic (ROC) curve with the outcome of interest as the state variable.

Does this approach make any sense to you? Would you approach this from a different angle?

Thanks again & sorry for the longwinded question.


rblack: Sorry for the double post but I have a small correction. For (2) most of the remaining items do in fact fit into another single construct. So, have two unidimensional constructs--I would end having two independent variables for my logistic regression: (1) person scores derived from a Rasch model with X number of items and (2) person scores derived from another Rasch model with k number of items. Does this sound reasonable? I'm wondering if I should be considering a MIRT [or even CFA] at this point. Thanks again. -Ryan

MikeLinacre: Ryan, you ask: "Have you ever come across a situation where it was more important to be able to predict future behavior (i.e. predictive validity) from a set of items than have a unidimensional construct?"

Yes, definitely! The Netflix Challenge: https://www.rasch.org/rmt/rmt233d.htm - point 7.

But the history of science indicates that the strongest predictions are based on decomposing phenomena into unidimensional variables and then making predictions based on the values of those variables. Physical science, which features this approach, is advancing speedily. Social science, which largely ignores this approach, is barely advancing.

CFA and MIRT describe the current dataset. Their success at prediction depends on how closely the idiosyncrasies of the future dataset match idiosyncrasies of the current dataset.

Rasch tries to construct from the current dataset measures which are robust against the idiosyncrasies of both the current and the future datasets.

CFA and MIRT may "get lucky" and predict the future better than Rasch, but, since the future is always different from the past in unforeseeable ways, we would expect Rasch to have more success in general.

rblack: Thank you, Mike. This is invaluable. So, for the moment let's stick within the realm of Rasch modeling. If after fitting the Rasch model based on the contrast plots, I observe two clearly defined constructs (items clustered at top versus items clustered at bottom). Based on what I observe, I decide to fit the Rasch model on data from the second year for the first cluster of items only and then I refit the Rasch model on the same 2nd year data including only the second cluster of items only. By this point, I have obtained person scores for each of the two constructs. The data set looks like this:

Construct 1 Construct 2
ID Person_Scores Person_Scores Outcome
(X1) (X2) (Y)
1 -0.378 -0.456. o
2 2.609 8.76 1

Then I'd fit a logistic regression model:

logit[y] = B0 + B1X1 B2X2

where X1 = person scores from Rasch model a
X2 = person scores from Rasch model b

Then I'd use the predicited values from the logistic regression to run a receiver operating characteristic (ROC) curve analysis to measure sensitivity/specificity, and overall accuracy on a specified outcome?

Do you see where I'm headed? Does this sound reasonable?--Treating person scores from each construct as a separate variable to be entered in the logistic regression equation.



MikeLinacre: Ryan, it is always useful to think about the same situation in physical measurement. Imagine that we are measuring NFL offensive tackles. X1 is the Rasch measure of "weight" and X2 is the Rasch measure of "height" and the outcome is "invited to play in the Pro Bowl". Then we can imagine what logistic regression models make sense ... :)

rblack: Thank you very much for your help.

harmony: "But the history of science indicates that the strongest predictions are based on decomposing phenomena into unidimensional variables and then making predictions based on the values of those variables. Physical science, which features this approach, is advancing speedily. Social science, which largely ignores this approach, is barely advancing."

I am interested in seeing social sciences advance and find this quote very intriguing. Can you point me in the right direction of literature on this topic, Mike?

MikeLinacre: Glad to hear you are intrigued by the advance of science, Harmony!

The most conspicuous proponent of this viewpoint is William P. Fisher, Jr. - he has numerous pieces in Rasch Measurement Transactions and elsewhere http://www.livingcapitalmetrics.com/researchpapers.html.

Here is an interesting thought: http://wiki.answers.com/Q/Why_is_it_important_that_you_test_only_one_factor_at_a_time_when_you_perform_an_experiment

And here is a comment regarding the centrality of measurement:

"The systematic, careful collection of measurements or counts of relevant quantities is often the critical difference between pseudo-sciences, such as alchemy, and a science, such as chemistry or biology. Scientific measurements taken are usually tabulated, graphed, or mapped, and statistical manipulations, such as correlation and regression, performed on them. The measurements might be made in a controlled setting, such as a laboratory, or made on more or less inaccessible or unmanipulatable objects such as stars or human populations. The measurements often require specialized scientific instruments such as thermometers, spectroscopes, or voltmeters, and the progress of a scientific field is usually intimately tied to their invention and development." http://en.wikipedia.org/wiki/Scientific_method

harmony: Thank you Mike. ;)

641. Multilevel Logistic Regression = Rasch?

rblack April 23rd, 2010, 3:15am: Hi Mike and others,

I have another question...

From what I've read online I believe that one could consider a Rasch model [using items that have dichotomous response options only] as a binary logistic regression model with person random effects. Is that correct? I tested this out employing the GLIMMIX procedure in SAS 9.2, using MLE based on an adaptive Gaussian quadrature method, and observed what appeared to be similar results to those produced from Winsteps. Is my thinking off here? Also, what type of model is being fit in Winsteps when the response options are, say, ordered categories (i.e. 1=low, 2=middle, 3=high)? Would this be an ordinal logistic regression with person random effects?

Thanks and sorry for all the questions!


MikeLinacre: Yes, we can formulate the Rasch model as a logistic regression model as you have done. Congratulations!

In the 1980s we expected the standard statistical packages to introduce Rasch modules. There were some attempts. But the standard statistical packages are not designed to produce the detailed parameter-level reporting required for effective Rasch analysis. So Rasch work done using standard statistical packages has tended to be impractical, for instance https://www.rasch.org/rmt/rmt54n.htm.

"person random effects" - if this parameterizes the person distribution as normal, N(m,v), then the closer the true sample distribution is to normal, the more accurate will be the estimates.

Rasch rating scale models correspond to "transition odds" logistic models, also called "adjacent-category" logistic models. Most "ordinal logit models" are "cumulative-category" models. These are not Rasch-type models, but correspond to ""graded response" models.

http://support.sas.com/kb/22/871.html identifies a SAS procedure for "adjacent-category" models.

rblack: Thank you for the historical information and clarification. I must say that I've yet to find a program to provide as much detailed output as Winsteps for Rasch models.

Regarding the person random effects, I was assuming a normal distribution. If that assumption didn't hold, I suppose one alternative would be to assume a gamma distribution, for instance. But then one would need to use the NLMIXED procedure instead of the GLIMMIX procedure in SAS.

Thanks again!


MikeLinacre: Ryan: "assuming distributions" is a challenge throughout statistical analysis. When the assumed distributions approximate the true distributions than the findings are usefully accurate, but when they don't match then the findings can be inaccurate. For instance, if the true distribution is strongly bimodal, then normal-distribution-based inferences can be misleading.

In Rasch estimation, CMLE (conditional MLE), PMLE (pairwise MLE) and JMLE (joint MLE) do not assume parameter distributions, but MMLE (marginal MLE) and PROX (double-marginal MLE) and most IRT estimation method do.

You could try this:
1. estimate the item difficulties using "person random effects"
2. estimate the person abilities from the estimated item difficulties: https://www.rasch.org/rmt/rmt102t.htm
3. test the hypothesis that the estimated person abilities are normally-distributed

rblack: Mike,

You are a wealth of information!

Thank you,


642. RSM

Raschmad April 24th, 2010, 8:25am: Dear Mike,
Can rating scale model be applied if the number of response categories in the items of an instrument are different. Some of the items in my instrument have 3 some have 5 same have 7 and some have 2 response categories.

MikeLinacre: Thank you for your question about the Andrich rating-scale model, Raschmad.
1. This model assumes that all the items have the same response-structure. Your response-structures differ across items, so you can use either:
2. The partial-credit model: this specifies that each item has a different response structure
3. The grouped rating-scale model: you specify which subsets of items share the same rating scale structure.
In Winsteps, 1. specified by omitting (IS)GROUPS=. 2. is specified with GROUPS=0, and 3. is specified with GROUPS=(selection indicators)
In Facets, 1. is specified with all "?". 2. is specified with "#", and 3. is specified with "Models=" and "Rating scale=R,General" specifications


harmony April 19th, 2010, 6:41am: Hello everyone:

In classical test theory we can calculate the standard error for skew and kurtosis and check to see if our calculated value for these statistics on a test is two times that value in order to know if our data are significantly skewed or peaked such that the assumption of normality has been violated.

In this case I am referring to a placement test instrument that is significantly negatively skewed.

Are there statistics in Winsteps that correspond to skew and kurtosis? (the item and person's maps are helpful representations, but are not statistics.)

Also, is there a precise way to interpret the effect of skew on the reliability of cut-off decisions at the skewed points in the data?

MikeLinacre: Harmony, Winsteps does not report skew or kurtosis. Probably the easiest way to obtain these is to output the person measures to Excel, and then use the Excel functions =SKEW and =KURT.

Non-normal skew and kurtosis generally lower test reliability. See http://jalt.org/test/PDF/Brown1.pdf for their effect on cut-point reliability, which may be to increase or decrease reliability. A simulation study should indicate their impact on reliability coefficients.

harmony: Thanks Mike

644. common item linking without the items?

harmony April 19th, 2010, 5:58am: Hi all:

I've been told that it is possible to link test via common item equating -as in a sub-test that both tests share- and that it is possible to then remove that sub-tests from each linked test and administer the two tests without it and they are both still equated. Is this correct?

Somehow the logic of that escapes me. When the subtest is removed from either test, the item measures change as does the relative difficulty of each test. How can tests be linked without a link?

Thanks for any help.

MikeLinacre: Harmony, this is probably what they are thinking:
1. Analyze and equate the two tests. This produces an equated measurement scale on which every item has a known "equated" difficulty.
2. Estimate the equated item difficulty for each item.
3. Remove the common items.
4. Administer the two tests without the common items. Collect the data.
5. Anchor (fix) the remaining items at their equated difficulties.
6. Analyze and report the measures on the two tests. These will be on the equated measurement scale.


harmony: Now it makes sense! :D


645. examinee(Infit/outfit statistics)

Tak_Fung April 16th, 2010, 3:50pm: Hi all:

I am very new in Rasch Analysis. I know how to interpret the Infit, Outfit statistics

report for Items in Winstep output. However, I have troubles

in interpreting Infit, Outfit statistics for Person in Winstep output.

Are there any reading materials available that can guide me for such

interpretations? Thank you very much.


MikeLinacre: Thank you for your question about infit and outfit statistics for persons, Tak Fung.

The Table in https://www.winsteps.com/winman/index.htm?dichotomous.htm is a good starting point.

Tak_Fung: Dr. Linacre: Many thanks for the prompt reply. I appreciate it very much.

646. Is Rasch model better choice for analyzing survey?

dachengruoque April 5th, 2010, 6:18am: As a newbie in ststistics, I found that a lot of surveys on attitude are analyzed by factor analysis to extract factors out of the responses. In reading Fox, C. M., & Bond, T. G. (2001). Applying the Rasch model :fundamental measurement in the human sciences . Mahwah, N.J.: L. Erlbaum. I found that authors cited a paper by Ben Wright Wright, B. D. (1996). Comparing Rasch measurement and factor analysis. Structural Equation Modeling: A Multidisciplinary Journal, 3(1), 3-24
He suggested that Rasch is better choice than factor analysis for seven reasons as he concluded in the last part of the paper. Is it a widely accepted or is it only widely acclaimed by Rasch gurus? Becasue seemingly factor analysis is the first choice when it comes to analyzes survey or questionnaires? Thanks a lot for your views!

MikeLinacre: Thank you for your question, dachengruoque.

The answer to your question is philosophical. If you are interested in describing a dataset, then factor analysis is as good a method as any other. If you are interested in using a dataset to construct additive measures along unidimensional variables, then Rasch is the only method.

Most statisticians are satisfied with describing the data, so factor analysis is the most common technique.

For some criticisms of factor analysis from a measurement perspective, please see https://www.rasch.org/rmt/rmt81p.htm and https://www.rasch.org/rmt/rmt142m.htm

dachengruoque: thanks a lot, Dr Linacre, for your succinct explanation and literature pointer!

klangenhoven: Since I am starting out in analysing a large survey data set and getting used to the terminiology of statistics could you in laymens terms provide an example to illustrate your statement "using a dataset to construct additive measures along unidimensional variables"

MikeLinacre: Thank you for your email, klangenhoven.

"using a dataset to construct additive measures along unidimensional variables" - the history of the development of the thermometer is a great example of this. It took about 200 years for physical scientists to use their dataset of heat indicators to construct additive measures (temperatures) along the undimensional variable of "heat quantity".

In survey research, we have a similar process. For instance,
and the details of a Rasch analysis ...

dachengruoque: thanks a lot for your explanation, Dr Linacre!

647. Anchoring with subgroups

JCS April 12th, 2010, 3:56pm: Hi Mike,

I am investigating whether a subgroup of examinees--who exhibit extremely low effort on the test--influence the item parameter estimates. I want to determine whether the inclusion/exclusion of this subgroup is warranted based on whether the item estimates are affected. In other words, I want to calibrate twice: first using the high-effort group and again using everyone (both high and low effort groups).

I'm curious as to whether the low-effort group will affect item-fit and/or the difficulty estimates. I suppose, in theory, this is like examining high ability vs. low ability groups but I want to go beyond looking at correlations. I want to examine the changes in item fit statistics.

My question involves whether I need to anchor the scale somehow between the first and second calibrations. Since the entire sample from the first calibration is also included in the second calibration, should I anchor on that group?


MikeLinacre: JCS, a more immediate solution would be to do a DIF (differential item functioning) study of high vs. low performers. If you have a code in the person label for low performers, then you could do the DIF report (Winsteps Table 30) based on this. Alternatively you could do it based on a sample stratification (DIF=MA2 in Winsteps).

In the approach you are suggesting, we need to decide what to keep the same (anchor) between the low-group and both-groups analyses, and what to allow to change. If we want to investigate item fit statistics, then allowing the the person measures or the item difficulties to change will alter the fit statistics. So perhaps we want to keep them them unchanged:
1. Analyze complete dataset - output person measures, item difficulties, response structures (if polytomies). (Winsteps PFILE=p.txt, IFILE=i.txt, SFILE=s.txt)
2. Analyze only low group - anchor the measures (Winsteps PAFILE=p.txt, IAFILE=i.txt, SAFILE=s.txt)
3. Compare the item fit statistics between 1. and 2.

JCS: Thanks, Mike. I considered DIF analysis and I will probably run that, too. I am using preliminary data to examine item quality. I want to know whether my decisions regarding item quality change based on whether I include/exclude those low-effort examinees.

MikeLinacre: JCS, you wrote "whether I include/exclude those low-effort examinees."

OK - so ignore my advice about anchoring. Please compare two free analyses: one analysis of everyone and one of only the high group.

JCS: Thanks for the reply. I calibrated the high effort group first, then used PAFILE and IAFILE to anchor the entire group (both low and high effort together). I did this because I wanted to actually look at the differences in infit-outfit values. My hunch was that the amount of variation increased, which is what I found--particularly for a few really difficult items.

From a practical/operational standpoint, your most recent suggestion is certainly appropriate. However, I don't believe the two "free" analyses allow me to quantify change in item fit.

MikeLinacre: JCS: when the anchor values from one analysis are applied to another analysis, the fit in the anchored analysis is almost always worse.

So, please do a simulation study as a baseline. Simulate both groups (Winsteps SIFILE=) to match your situation, but to fit the Rasch model. Then do the analysis according to your anchoring design. We know that the fit of both analyses should be the same, so any increase in misfit in the anchored analysis is an effect of anchoring. The size of this anchoring effect is the baseline for your findings with the anchored analysis of the empirical data.

648. DIF

jjswig April 13th, 2010, 3:57pm: Hi Mike:

I recently took one of your online courses and found it excellent. I study quality of life (QOL) in a particular lung disease. Certain authors have found differences in QOL between men and women. Your course got me thinking about this: is there a true difference in QOL between genders, or is it that the instrument is just not behaving the same for each gender.

If I perform Rasch analysis on the instrument and find DIF for certain items, what is that really telling me? If I do find that, then I'll certainly investigate to see if what I know about the item and the disease might explain it, but might I simply be left with differences between genders that I am unable to explain by means other than "it is what it is"

MikeLinacre: Yes, you have met a conundrum often encountered in scientific investigation. Is it a cause? Is it an effect? Or is it an accident?

The statistics report DIF. But does that indicate a higher ability for the group or a lower difficulty for the item? Statistics cannot tell us.

For instance, for DIF on a language test, if a group are specifically taught something, we would say "the group has higher ability on those items". But if some items happen to align with the life experiences of a group, we would say "those items are easier for the group". But obviously this is an arbitrary evaluation of the DIF finding..

649. Why objective measurement?

dachengruoque April 9th, 2010, 6:53am: Dr Linacre,
In reading Applying the Rasch Model 2001, I found a serial publication named by Objective Measurement is listed as classic readings for rasch analysis, why is it named objective measurement? Does it imply that the other means of measurement are subjective? Please forgive me if it is a too stupid question! ;D Thanks a lot!

MikeLinacre: Dachengruoque, the term "Objective Measurement" has changed its meaning over the years. It was introduced to differentiate between "objective" tests with predetermined correct/incorrect answers, such as multiple-choice tests, and "subjective" tests, such as essay-writing, which required the expertise of a rater. It was thought that removing raters from testing would lead to better tests. Unfortunately those "objective" tests were found to have their own, but different, problems. An obvious problem has been the trend toward "teaching to the test".

Now "objective measurement" means "measurement which is independent of the specific choice of test questions". This idea (not the terminology) goes back to E.L. Thorndike (1904), and from this idea the Rasch model can be derived, see https://www.rasch.org/rmt/rmt143g.htm

The opposite of the current meaning of "objective measurement" is something like "test-dependent scoring", even if the scores are transformed using a statistical device such the 3-PL IRT model.

dachengruoque: Thanks a lot, Dr Linacre! Can I understand this way, the current meaning Objective Measurement refers to those measurements that can not satistfy the reuqirement that the scores of test candidates are related to context effects which are taken for granted in traditional IRT approach? Thefore can I say that testlet theory appraoch is opposite to Objective Measurement? Thanks a lot, Dr Linacre!

MikeLinacre: dachenroque, "Objective Measurement" does not limit the item-types. So any item-type can support Objective Measurement, including testlets (item clusters). But we must always verify that the items or item-clusters are co-operating to support Objective Measurement.

There are situations where testlets can be advantageous. For instance, on a reading-comprehension test, a cluster of items may relate to one piece of text. In this situation, the Objective Measurement criterion is that the objective measure is independent of which pieces of text (each with its own cluster of items).

dachengruoque: thanks a lot, dr Linacre, for your patient answering!

650. The error information

dachengruoque April 10th, 2010, 1:12am:

This is a post by somebody else, but I am interested in it, so I crossposted it for any comments, though I know this is not releveant to Winsteps or Facets. If it is not in the interst of the bbs's questiona and asnwer please delete it and forgive my intrudence. Thanks a lot!

A script is rewritten from a parscale examples to test and run a partial credit model£¬But the follwing information like an error information pops up often

The script I rewrote from the example reads as follows£º>FILES DFNAME='H01.DAT', SAVE;
>BLOCK1 REPEAT=3,NIT=1, NCAT=12, ORIGINAL=(1,2,3,4,5,6,7,8,9,10,11,12);
>CALIB LOGISTIC, PARTIAL, NQPT=21, CYCLES=(100,1,1,1,1,1), NEWTON=2, CRIT=0.005, SCALE=1.7;

The error information reads£º

>BLOCK1 REPEAT=3,NIT=1, NCAT=12, ORIGINAL=(1,2,3,4,5,6,7,8,9,10,11,12);


MikeLinacre: dachengruoque,

Please check that you do not have spaces where Parscale expects tabs, or tabs where Parscale expects spaces.

Is your ITEM= correct? I expected something like: ITEM=(1(1)12)
Also, what about BNAME=MYBLOCK

dachengruoque: Thanks a lot, Dr Linacre! You mean that the rewritten script has something important missing right? So software doesn't work.

MikeLinacre: Yes, dachengruoque. Parscale tells us that the script is incorrect.

dachengruoque: Thanks a lot, Dr Linacre!

651. how to interpret SVD and RMS residual

seol February 10th, 2010, 6:14am: Dear Dr. Linacre

Hi ! in the updated version(3.69.0) of WINSTEPS program, you added two statistics(RMSR and SVD of residuals). I am just curious about the advantage of RMSR compared with fit statistics or how to interpret or guidlines.

Further, How can I interpret SVD of residuals in the view of indicating multidimensionality ?

I really appreciated for your kind apply. :)


MikeLinacre: Seol, thank you for asking.

These features were included in Winsteps for the Netflix Prize - see "Rasch Lessons" in www.rasch.org/rmt/rmt233.pdf - so I made them available for everyone.

I am learning about them myself :-)

howard: Dear Dr. Linacre

What values are acceptable for RMSR

Howard ::)

MikeLinacre: Howard, I don't know. Perhaps you will be the person who discovers those values, at least for your own data. :-)

652. Why the results are different?

dachengruoque April 5th, 2010, 6:25am: I ran Winsteps over the dataset provided by Fox, C. M., & Bond, T. G. (2001). Applying the Rasch model :fundamental measurement in the human sciences . Mahwah, N.J. : Lawrence Erlbaum Associates. On the page 56 there is a Table 5.1 which shows slightly different results compared with the one from mine. Why are the results different? For the different versions of software or different software? Thanks a lot for your explanation!My results is as follwoed:
ITEM PRTIII for Chapter Five Mar 30
1 -3.43 1 143.0 127.0 .30 1.25 1.35 1.08 .33 .00 .44 1.00 86.0 89.5 .79 1 R Item 1
2 3.67 1 143.0 6.0 .44 .91 -.16 .64 .05 .00 .34 1.00 96.5 95.8 1.05 1 R Item 2
3 .86 1 143.0 41.0 .22 1.17 1.48 1.09 .40 .00 .47 1.00 74.1 79.4 .78 1 R Item 13
4 -.66 1 143.0 75.0 .21 .94 -.58 .98 -.04 .00 .61 1.00 81.1 76.3 1.07 1 R I0004
5 -.40 1 143.0 69.0 .21 .93 -.72 .90 -.51 .00 .62 1.00 81.1 76.5 1.11 1 R I0005
6 1.02 1 143.0 38.0 .23 1.17 1.47 1.29 .98 .00 .45 1.00 76.2 80.0 .75 1 R I0006
7 -.83 1 143.0 79.0 .21 .82 -1.82 .75 -1.40 .00 .67 1.00 81.8 76.4 1.29 1 R I0007
8 -3.03 1 143.0 122.0 .27 1.09 .64 2.98 2.65 .00 .48 1.00 88.1 86.7 .83 1 R I0008
9 .58 1 143.0 47.0 .22 1.04 .40 1.30 1.20 .00 .52 1.00 79.7 78.5 .88 1 R I0009
10 -.87 1 143.0 80.0 .21 1.00 .03 .86 -.73 .00 .61 1.00 75.5 76.5 1.04 1 R I0010
11 .72 1 143.0 44.0 .22 .85 -1.43 .72 -1.09 .00 .60 1.00 84.6 78.8 1.22 1 R I0011
12 -.44 1 143.0 70.0 .21 .97 -.29 .83 -.96 .00 .61 1.00 73.4 76.5 1.09 1 R I0012
13 2.81 1 143.0 12.0 .33 .78 -.97 .66 -.24 .00 .45 1.00 92.3 91.9 1.15 1 R I0013

MikeLinacre: Thank you for your question, dachengruoque.

Bond & Fox (1st edn) Table 5.1 p. 56 was estimated using the Quest Rasch-analysis software. Are you using Winsteps? If so you may discover that STBIAS=YES brings the estimates into closer alignment.

Notice that the differences in the number are not of statistical importance, because they are less than the standard errors of measurement (the imprecision in the estimates).

dachengruoque: [quote=MikeLinacre]Thank you for your question, dachengruoque.

Bond & Fox (1st edn) Table 5.1 p. 56 was estimated using the Quest Rasch-analysis software. Are you using Winsteps? If so you may discover that STBIAS=YES brings the estimates into closer alignment.

Notice that the differences in the number are not of statistical importance, because they are less than the standard errors of measurement (the imprecision in the estimates).
Yes I ran the data over Winsteps you authored!I see, thanks a lot, Dr Linacre!

653. Items revealing DIF

lovepenn April 3rd, 2010, 1:42am: Hi Mike,

If some items reveal the differential item functioning, what could be done to those items? Do I have to omit those items from the item set? Is there any way to retain those items? I found a study employing the procedure "equating" suggested by Andrich and Hagquist to retain the DIF items while reducing the effect of DIF. But I couldn't find the literature (Andrich and Hagquist).
Is this procedure commonly used by researchers? Do you know in which journal I can find this article? Is there any other way to deal with the DIF items?
Could you recommend articles or chapters that address this issue? Thanks always, --Lovepenn

MikeLinacre: Thank you Lovepenn.

There are several tactics when DIF is discovered.

1. Ignore it as inherent in the measurement system. For instance, in a test of English as a Second Language, different items will exhibit DIF for speakers of different native languages, depending on the relationship of those languages to English. When constructing such tests, content experts should try to balance the DIF across the major groups, but this can never be done perfectly.

2. Remove the item (perhaps rewriting it). This is when the DIF is seen to be a confounding factor which overwhelms the intention of the item. For instance, a mathematics word problem which uses a technical cooking or car-mechanics term better known to girls or boys.

3. Treat the data for one DIF group as missing data. For instance, if a generally good item is misunderstood by speakers of a specific dialect of English, then make the item missing data for them.

4. Split the item. Make the item two items, each with active data for one group and missing data for the other group. This maintains the same raw scores, but produces different measures for each group.

The decision about which approach to use is often driven by political and legal considerations. So the literature either focuses on the mathematics of DIF remedies or contains case studies of the resolution of DIF under specific conditions. I am not familiar with the Andrich and Hagquist paper, but it appears to belong to the mathematical genre.

Most important in resolving DIF is that we can communicate our DIF-resolution process simply, clearly and precisely to our audience. This will remove any feeling of unfairness or bias. "Mathe-magical" approaches to DIF resolution may make some in the audience wonder whether something underhanded is being perpetrated upon them.

lovepenn: Thank you so much, Mike.
It greatly helps.
Andrich and Hagquist's approach taken by the study that I found seems to be #4 in your summary. The study describes: divide each DIF item into three items (for three groups) and construct the new data set by equating, and then some items have missing data for particular groups..
Thanks again, --Lovepenn

dachengruoque: [quote=lovepenn]Thank you so much, Mike.
It greatly helps.
Andrich and Hagquist's approach taken by the study that I found seems to be #4 in your summary. The study describes: divide each DIF item into three items (for three groups) and construct the new data set by equating, and then some items have missing data for particular groups..
Thanks again, --Lovepenn

Thanks a lot!

654. Negative PtExp

drmattbarney April 2nd, 2010, 9:36am: Dear Mike

I'm analyzing a big dataset and came across something I don't think I've seen before. I've got negative expected point-measure correlations, as high as -1.00.

I've doublechecked my dataset and the reverse coding for negatively worded items are all correct.

Is this correct and I'm just not thinking clearly, or did I miskey something?

MikeLinacre: Matt, thank you for your question.

What type of data are you looking at?

A negative correlation of -1 can indicate that the marginal score is fixed across persons or items. An example of this is when everyone rank orders 6 objects. Then the marginal score for every person is 6+5+4+3+2+1 = 21. A similar situation arises with paired-comparisons (Coke vs. Pepsi), where one object is scored 1 and the other object is scored 0. The marginal scores of the comparisons are 1 + 0 = 1.

If you are looking at standard MCQ test or survey data, then something has gone wrong. There may be double-reversed the scoring. The negative items may have been reverse-scored at the data entry stage, then again have been reverse-scored at the analysis stage.

drmattbarney: Mike, thank you for the fast reply

These are multisource data on personality, values, cognition, affect, and behavior. Likert-type 1:strongly disagree; 2:disagree; 3:slightly disagree; 4:Neutral; 5:slightly agree; 6:agree; 7: Strongly agree.

I'll look into the double-reversed scoring, you're right that's a possibility

655. questionnaires

tutorjk March 23rd, 2010, 3:27pm: Hi Mike and all :),

I am a newbe conducting research on the actual v's the perceived use of mathematics in the workplace. I have in mind something like Fennema - Sherman Likert items for math anxiety, but used to compare the observed use with the self reported use of mathematics in the workplace. I will be using the Rasch RSM to establish validity with but I can only access a relatively small population.

Are there any members who could point me in the direction of existing questionnaires that I can adapt to my research project please?

Best regards, John K. Ireland.

MikeLinacre: Tutorjk, please help us understand what you want:
1. Questionnaires using Likert scales?
2. Questionnaires for comparing observed behavior with self-reported behavior?
3. Questionnaires about use of mathematics?
4. Questionnaires about the workplace?
5. ... ?

tutorjk: Hi Mike,

Thanks very much for your reply. I am looking for;
Statistics concerning the use of mathematics in the workplace and
The questionnaires that were used to gather those statistics, whether a Likert scale was used or not. In a nutshell items 2 and 3 in your reply to me.

My research is focused on the mathematics knowledge, skills and competence that underpin contextualisations of mathematics in the workplace, to the extent that they are dismissed as mere common sense - anything but maths. If this type of survey has been done before I can cite it as authority and adapt it for local conditions. if not, I will have to take a larger sample than the scope of the project allows.

I hope you can help me.

Best regards,


MikeLinacre: Thank you for for clarifying your needs, John.
Here is a start .... :)
Please Google "A survey of use of mathematics in industry in Hong Kong"
Also Google: observed "self-reported" questionnaires

tutorjk: Hi Mike,

These leads look perfect on the surface! I had found them previously but was slow to use my meagre research budget on the off-chance of their being suitable enough to warrant the expense. It seems that I have no option but to ask if my institute will purchase them on my behalf.

Thanks for your help and for being so generous with your time. Congratulations on your software and this site too.

Best regards,


tutorjk: Hi Mike,

Unfortunately, neither of the leads mentioned above were of use - both discussed the findings without providing details of the questionnaire. It seems that they did not use Rasch (or Winsteps ) to establish validity either. All a bit disappointing really. I don't suppose you have any other ideas?

Best regards,


MikeLinacre: Tutorjk, you wrote "without providing details of the questionnaire" - Yes, many studies report few details of the questionnaire wording because the authors of the questionnaire sell the questionnaires.

Have you Googled "Winsteps anxiety" ? Perhaps this will find a helpful paper.

657. p-values

Ivy March 23rd, 2010, 11:30am: Hi Mike
Can either winsteps or facets generate results of item p-values (item difficulty)?

MikeLinacre: Ivy, item p-values are reported by Winsteps when PVALUE=YES, or by selecting the p-value field for IFILE=
In Facets, Table 7, the "Observed Average" for the items is the p-value.

Ivy: Thanks! this forum is helpful, and mike is truly helpful!

658. scores with decimals

Ivy March 21st, 2010, 1:37pm: Hello Mike
If I have scores with decimals, should I just enter the data as normal (.5; 1.9), or I need to make ajustments ?
btw, missing data was coded as 'X', not a dot.

MikeLinacre: Ivy, Rasch programs expect integer data. So multiply all your data by 10 to eliminate the decimals.
But you may need to think about what your data signify. Rasch expects the data to be counts of qualitative advances up the latent variable. Several decimal values may need to be "chunked" into one integer value for this to apply. Category tables, such as Table 3.2 in Winsteps or Table 8 in Facets, tell us whether the data are acting as qualitative counts or something else.

Ivy: thanks, Mike, for the reply.

659. polytomous data

Raschmad March 9th, 2010, 3:48pm: Dear Mike,
I have a polytomous dataset and I want to analyse it with PCM. The items each have 25 categories. The response vector for a person is like "2231125". This perosn has scored 2 on the first item and 23 on the second. I was just wondering how the programme knows this. Doesn't Winstpes assume that this person has got 22 on the 1st item and mix all the other item scores as well? Although this doesnt make any problem for perosn ability estimation since the total person raw score will be the same, it does make a problem for item difficulty estimation since it changes item raw scores.

MikeLinacre: Thank you, Anthony.
My first guess is www.winsteps.com/winman/xwide.htm Example 3 - but it is usually much easier to reformat the original data into two columns for each response.

Raschmad: Dear Mike,
The coding commands you gave works for Liker type data when one knows the width of each item beforehand. The test I want to analyse is an educational one. 25 is the highest possible score but many students certainly score lower than 10 on many items. So reading a vector like "2231125" (for 4 items) where a person has scored 2 on the first item, 23 on the second 11 and 25 on the rest cannot be read by the programme. In fact the vector can be read in many different ways. 22,3, 11, 25; 2,23,1,1,25; 22,3,11,25 and ………………

MikeLinacre: Anthony, this is mysterious.
Does anyone know how to decode "2231125"?
If so, then we can tell the computer to do it the same way.
If not, then we need a different data format.

Raschmad: Dear Mike,
sorry for misunderstanding. I had a look at a control file which has been constructed with Winsteps out of SPSS and found out.
Winsteps considers two columns for each response when the width of the data is set to 2. if a response is 1 colum wide it leaves the column empty so can distnguish item responses. the response pattren for the third person in the following set (5 items) is: 0,3, 13, 1,0.
My apologies.
0 313 1 0

MikeLinacre: Glad you solved that problem, Anthony. It was mystifying me ... :-)

660. ask a question

jianrong March 10th, 2010, 3:28pm: thank you very much! I am a doctor in China. I need Rasch in my study.but now I am a fresh in this.I want to ask you a question and hope you give me an answer.thank you again.If I will use Rasch Measurement to assess a scale about stroke,how many statistic' ways should I mastered? thanks a lot. ;)

661. Test information Function

mats March 7th, 2010, 5:26pm: I am analyzing a study were the items of a test seems badly targeted for the population. About 50 % of respondents lies below the position of the items maximum item information, according to the item-person map in Winsteps.

In preparing a discussion about the implications for measurement for of respondents below the items. I am considering this:

"If a substantial part of the population lies outside the range of the items, measurement in the uncovered part of the dimension will be imprecise and existing changes in the dimension will cause small effects on the ratings."

Is my interpretation of the implication of poorly targetet items correct? Any comments?


MikeLinacre: Thank you for your question, Mats.

You write: "Seems badly targeted for the population. About 50 % of respondents lies below the position of the items maximum item information ..."

If the sample was exactly targeted on the test, then half of the respondents would be above the maximum test information and half below. But this situation is unusual for an educational test. Typically we expect average success rates to be around 75% or even higher. This would be 1 logit or so above the mean item difficulty.

Loss of measurement precision becomes a problem when the sample are 2 logits above or below the mean of the test, and the test is not long enough. Then the average success rate is 90% or 10%.

If much of the sample are below the mean item difficulty on an educational test, then we are more concerned with accuracy (misfit) than with precision (standard errors). A test that is too hard for the sample provokes unwanted behavior such as guessing, response sets or failure to finish the test.

Does this help?

mats: Hi, Mike and thanks for your reply.

The population is actually a depression sample before the start of a cognitive therapy. But about half the samle has rasch modeled "ability" below the maximum item functioning for the least difficult item.

Since most of my collegues only are aware of Classic Test Theory and raw scaore measurement I am trying to figurre out the consequences for measurement of improvement of the therapy, when the rating scale is poolry targetet for the population. My hypothesis are (after you added the fit problem):

1. Equal changes in the depression dimension will yield less effect on ratings (raw scaore) for the individuals below item levels, comapred to indivduals targetet for the items.

2. That SE would be greater in the indiduals poorly targeted for the items.

3. That fit would more problematic for the poorly targetet indivduals.

Is this correct?

MikeLinacre: Thank you for telling us about this situation, Mats.

1. and 2. are correct.

3. fit depends on the characteristics of the individuals. There is less statistical power to detect misfit in poorly targeted individuals, but poorly targeted individuals may have more idiosyncratic features. So misfit depends on the situation ...

662. Item information function

Ivy March 4th, 2010, 10:05pm: Hi Mike
I have a question on item information function in WINSTEPS. I know it produces graphs showing each individual item information as well as overall test information, but how to obtain their values at certain theta levels? Are they in any of the output tables, or one needs to calculate manually according to the fomula I =P*Q for one-parameter rasch model?

The numeric values of item information at designated theta levels (ability estimates) can be useful in selecting appropriate items. Can you please advise? Did I overlooked any of the winstep outputs?


MikeLinacre: Thank you for your question, Ivy.

There are several approaches in Winsteps, depending on exactly what you want to plot. Here are two:

1. Graphs window. Absolute scaling. Information function. Copy data to clipboard. Then paste into Excel.

2. GRFILE= to Excel, and then add the item difficulty to the x-axis value.

Ivy: thanks, Mike, for the prompt reply.
But suppose one wants to know the amount of item information an item can produce at theta level -2.00, how can he get it? The two suggested graph options produce information at certain pre-determined measure estimates (see below). Is there an option of obtaining information at ANY desired level/point?

1 -3.00 .05 .05 .95 .05
1 -2.94 .05 .05 .95 .05
1 -2.88 .05 .05 .95 .05

thanks again for your kind help


MikeLinacre: Certainly, Ivy. If you know the person ability = B and the item difficulty = D, then the easiest thing would be to use Excel to compute the exact item information. For a dichotomous item this would be:
Item information = exp(B-D) / (1+exp(B-D))^2
Test this with B = D, information = 0.25
B = D + 1.1, information = 0.19
But reading the value off a graph would be precise enough for practical purposes. This is because the person ability and the item difficulty are both estimates, so we never know the "true" information. The best we can do is an approximate estimate of the information.

Ivy: thanks, Mike.
two follow up questions: 1) the formula for polytomous items is the same? 2) for the purpose of equating test forms, keeping the test information estimates (sum of items information estimates) at key levels more or less comparable would be an aim, I wonder what are the normal practices for doing that, instead of comparing the test information graphs visually?

MikeLinacre: Following-up, Ivy.

1) The information function for a polytomous item is more complicated to compute. It is the term labeled "variance" in https://www.rasch.org/rmt/rmt34e.htm

2) For Rasch dichotomous items, a uniform item distribution produces approximately equal measurement precision over the range of item difficulties.

663. Common Item Equating

Raschmad March 2nd, 2010, 9:08am: Dear Mike,
I don’t understand the logic of “common item equating“.
Suppose we have two test forms: Form H(ard) and Form E(asy). Some items are common between the two forms. We compute the mean difficulty of the common items in the two forms, compute their difference and add this difference to the ability measures of all the persons who have taken Form H or subtract it from the ability measures of those who have taken Form E. Am I right?
But we expect the common items to have the same difficulty level in both forms, i.e., their mean difficulty difference is negligible. I don’t understand how this procedure compensates for difficulties in forms. Can you please explain this?

MikeLinacre: Anthony, yes, common-item equating depends on items having the same difficulty on both test forms, but "the same difficulty" does not mean "the same logit difficulty estimate". The value of the difficulty estimate depends on the choice of local origin (zero) of the logit measurement scale for each form. Usually, the zero point is the average difficulty of the items on the test form. The result will be that zero for the Easy form will be lower down the latent variable than zero for the Hard form. Common-item equating tells us the distance between the two zero points.

OK, Anthony?

Raschmad: Hi Mike,
Thanks. Does the average mean difficulty of a few common items represent the real difference in the origins of the two scales?

MikeLinacre: Yes, Raschmad, the difference between the average mean difficulty of the common items is an estimate of the real difference between the scale origins. This difference is the "equating constant".

664. Help with results

PsychRasch February 28th, 2010, 7:12pm: Good day respectable community

I have these results. My interpretation is that there's not multidimensionality or off dimensional data. Nevertheless, unexplained variance in first contrast is 2,3, bigger that 2, expected by chance...that 0,3 is something to worry about? Is there anything in the percentages that indicates some kind of linear dependency? There's not outfit in the data, but i'd like to know if the percentages explained are good or bad. To me, they're too small. Please help!

Total raw variance in observations = 13.9 100.0% 100.0%
Raw variance explained by measures = 7.9 56.9% 53.1%
Raw variance explained by persons = 3.1 22.5% 21.0%
Raw Variance explained by items = 4.8 34.4% 32.1%
Raw unexplained variance (total) = 6.0 43.1% 100.0% 46.9%
Unexplned variance in 1st contrast = 2.3 16.6% 38.6%
Unexplned variance in 2nd contrast = 1.3 9.3% 21.7%
Unexplned variance in 3rd contrast = 1.0 7.3% 17.0%

MikeLinacre: Thank you for your questions, PsychRasch.

Your "explained variances" is reasonable (>50%). https://www.rasch.org/rmt/rmt201a.htm Figure 3. The variance-explained would be higher if you could obtain a wider-performing sample.

The empirical "variance explained" (56.9%) is higher than the model 'variance explained" (53.1%) which indicates that there is some unmodeled dependency in your data.

You test has only 6 items. So an eigenvalue of 2.3 indicates that 2 or 3 of the items are opposing the other 4 or 3 items in some aspect of the data. Please look Table 23.2 to see which items they are, and also at Table 23.4 to see the impact of this contrast on the persons. With only 6 items, we would need very strong misfit or substantive dimensionality to omit an item or split the test in two. In fact, our motivation would be in the other direction. PsychRasch, can we think of more items (on the intended latent variable) to add to the instrument?

PsychRasch: Dr. Linacre, thanks for your kindness and your knowledge. What happens is that i have a test, but is divided in two parts and are not related. If I compute all the parts, there's multidimensionality, but of course it is because they're not evaluating the same thing. The internal validity of the instrument is not homogeneous. These are the results for the second part.

Table of STANDARDIZED RESIDUAL variance (in Eigenvalue units)
-- Empirical -- Modeled
Total raw variance in observations = 12.8 100.0% 100.0%
Raw variance explained by measures = 2.8 21.7% 21.9%
Raw variance explained by persons = 1.2 9.3% 9.3%
Raw Variance explained by items = 1.6 12.4% 12.5%
Raw unexplained variance (total) = 10.0 78.3% 100.0% 78.1%
Unexplned variance in 1st contrast = 1.9 14.6% 18.6%
Unexplned variance in 2nd contrast = 1.4 11.0% 14.1%
Unexplned variance in 3rd contrast = 1.4 10.6% 13.5%
Unexplned variance in 4th contrast = 1.2 9.1% 11.6%
Unexplned variance in 5th contrast = 1.1 8.4% 10.8%

I'd like to know more about unmodelled dimensionality.
thank you.

PsychRasch: This is table 23,4 of the second part

TABLE 23.4 C:\Users\ZOU456WS.TXT Feb 28 17:26 2010


| 0 3 0 | 0 1 2 | 8 00008P
| 0 3 0 | 0 1 2 | 29 00029P
| 0 3 0 | 0 1 2 | 42 00042P
| 0 3 0 | 0 1 2 | 66 00066P
| 0 3 0 | 0 2 1 | 9 00009P
| 0 3 0 | 0 2 1 | 30 00030P
| 0 3 0 | 0 2 1 | 35 00035P
| 0 3 0 | 0 2 1 | 38 00038P
| 0 3 0 | 0 2 1 | 51 00051P
| 0 3 0 | 0 2 1 | 54 00054P
| 0 3 0 | 0 2 1 | 63 00063P
| 0 3 0 | 0 2 1 | 67 00067P
| 0 3 0 | 0 2 1 | 70 00070P
| 0 3 0 | 0 2 1 | 71 00071P
| 0 2 1 | 2 1 0 | 36 00036P
| 0 2 1 | 2 1 0 | 52 00052P
| 0 2 1 | 1 2 0 | 46 00046P
| 0 1 2 | 0 3 0 | 68 00068P
| 0 1 2 | 0 3 0 | 74 00074P
| 0 2 1 | 0 3 0 | 12 00012P
| 0 2 1 | 0 3 0 | 33 00033P
| 0 2 1 | 0 3 0 | 39 00039P
| 0 2 1 | 0 3 0 | 40 00040P
| 0 2 1 | 0 3 0 | 43 00043P
| 0 2 1 | 0 3 0 | 49 00049P
| 0 2 1 | 0 3 0 | 75 00075P

PsychRasch: And this is table 23,4 for the first part of the test, the one with 2,3 of variance unexplained. Honestly I understand nothing about the meaning of this table. I'd appreciate your help Dr.

TABLE 23.4 C:\Users\ZOU864WS.TXT Feb 28 17:41 2010


| 1 2 0 | 0 1 2 | 20 00020P
| 1 2 0 | 0 2 1 | 4 00004P
| 1 2 0 | 0 2 1 | 8 00008P
| 1 2 0 | 0 2 1 | 25 00025P
| 1 2 0 | 0 2 1 | 29 00029P
| 1 2 0 | 0 2 1 | 34 00034P
| 1 2 0 | 0 2 1 | 37 00037P
| 1 2 0 | 0 2 1 | 50 00050P
| 1 2 0 | 0 2 1 | 53 00053P
| 1 2 0 | 0 2 1 | 62 00062P
| 1 2 0 | 0 2 1 | 68 00068P
| 0 2 1 | 1 2 0 | 39 00039P
| 0 2 1 | 1 2 0 | 74 00074P

MikeLinacre: Thank you for sharing your output, PsychRasch.

Table 23.4 is meaningless unless you know something about your people. In your output there are only sequence numbers. Do you have any demographics or other information you can place in the person labels? For instance, if you placed the person's ages and genders in the person labels, we might see than older men contrast with younger women.

Raw variance explained by persons = 1.2 9.3%

This is a very low value. It suggests that the 75 persons are clumped in a narrow range of raw scores on the test. The person reliability is probably low. Please inspect the item difficulty hierarchy. Does it match what you expected to see before the data were collected? Does the hierarchy represent a meaningful latent trait (construct validity), PsychRasch?

PsychRasch: Thanks for your help Doctor Linacre. Everytime I ask you something I learn a lot. Well...construct validity in this case is a recurrent topic. What I'm doing is applying a non standarized test (in my geographical area, is standarized in other countries), with a level of difficulty that has not been tested in my region (and apparently doesn't discriminate between ages). So the answer to your question about the item difficulty is no, it doesn't match what me and other people expected before the data were collected. This is a pilot procedure for a manual foreign test which I converted to digital, so I'm asking now if this analyses can show me something about the efficiency and pertinency of algorithms. I added personal variables to the data. At the right of the table there are three columns. The second one (middle) is sex. 0's are females and 1's are males. The rightmost column is age group. 1, 2 and 3 are...you can say "young" people (below 38), 4 and 5 are middle aged, and 6 and 7 is "old" people (over 60). I don't like to call them "old" but i don't know any other word in english for that purpose. I hope this information is useful in the making of some conclusions. Thanks Dr. Linacre.

| 0 3 0 | 0 1 2 | 8 1 1
| 0 3 0 | 0 1 2 | 29 1 3
| 0 3 0 | 0 1 2 | 42 0 4
| 0 3 0 | 0 1 2 | 70 0 6
| 0 3 0 | 0 2 1 | 9 0 1
| 0 3 0 | 0 2 1 | 10 1 1
| 0 3 0 | 0 2 1 | 30 0 3
| 0 3 0 | 0 2 1 | 35 0 3
| 0 3 0 | 0 2 1 | 38 1 4
| 0 3 0 | 0 2 1 | 39 0 4
| 0 3 0 | 0 2 1 | 41 1 4
| 0 3 0 | 0 2 1 | 51 0 5
| 0 3 0 | 0 2 1 | 54 1 5
| 0 3 0 | 0 2 1 | 58 0 5
| 0 3 0 | 0 2 1 | 63 1 6
| 0 3 0 | 0 2 1 | 66 0 6
| 0 3 0 | 0 2 1 | 71 1 6
| 0 2 1 | 2 1 0 | 45 0 4
| 0 0 3 | 0 3 0 | 74 0 7
| 0 1 2 | 0 3 0 | 43 0 4
| 0 2 1 | 1 1 1 | 36 1 4
| 0 2 1 | 1 1 1 | 52 0 5
| 0 2 1 | 0 3 0 | 68 0 6

PsychRasch: Another thing...the test was deviced for apahasic people, but we made the pilot with normal people.

MikeLinacre: Thank you, PsychRasch.

My choice of demographic variables was only an example. Please choose demographic indicators relevant to your study. For instance, mental state (depressed, etc.), alertness, intelligence, ....

"we made the pilot with normal people." - then your results are what we expect. Normal people would all be squashed together at one location on the latent variable. We see this often with Rasch analysis of clinical data. Unfortunately this makes it almost impossible to make inferences about the functioning of this instrument (reliability, dimensionality, etc.) for an aphasic sample.

You have made a good start, but you have some way to go before you can establish the validity of your instrument. You need a sample in which the subjects exhibit stronger and weaker symptoms of aphasia. You can then code these into your person labels, and verify the concurrent (predictive) validity of the instrument. You can also verify that the items correctly signal stronger and weaker symptoms through their difficulty hierarchy (construct validity).

PsychRasch: Many thanks Dr. Linacre. You've helped me a lot. I'll send the link of this forum to my teachers so they can see what the maker of WinSteps has said. You make a great task with this forum, enlightening the darkened. Hehehe. See you.

PsychRasch: One last question...is it possible that if you evaluate in different environments (different places, not necessarily with conditions proper to evaluate) can those conditions tend to generate off - dimensional data?

MikeLinacre: PsychRasch, different environments definitely change Rasch measures. A common situation is the same test given in a low-stakes situation (doesn't alter decisions relating to the test-taker) and a high-stakes situation (does alter decisions relating to the test-taker). Low-stakes data are more noisy with a less precise item-difficulty hierarchy and person-ability hierarchy. Paradoxically, low-stakes data are often used for making the "big" policy and budgetary decisions ... :-(

665. Are residual values ordinal?

bluesy February 22nd, 2010, 6:41am: I'm wondering about the nature of the residual matrix values. I conceptualize the observed matrix as containing ordinal values and the modeled matrix as containing interval values. When the modeled (interval) values are subtracted from the observed (ordinal) values, are we left with residual (ordinal) values?

MikeLinacre: Thank you for your question, Bluesy.

The modeled values are real numbers, but they are not interval numbers, because they have a restricted range. This type of numbers (and also many other types of numbers) are not defined in S.S. Stevens' classification.

For a critique of S.S. Stevens' classification, see http://www.cs.uic.edu/~wilkinson/Publications/stevens.pdf

bluesy: Dr Linacre, thank you for your prompt and clear reply. I will start reading this article immediately!

bluesy: Dear Dr Linacre,

A further question related to this matter if you don't mind...

How about the two estimated parameters (item difficulty calibrations and person ability measures)? When the observed ordinal-level values are transformed by the natural logarithm these values become interval-level? Right?

Is it only after the two estimated (interval-level) parameters are entered into the Rasch model equation to form the modeled matrix that the values are not classified as interval (i.e., they have a restricted range)?

MikeLinacre: Bluesy,

The additive (interval) Rasch measures have a conceptually infinite range.

The Rasch measures are used to compute the expected value of each observation. The expected value has a limited range.

Residual = observed value - expected value.

The residual also has a limited range.

bluesy: Dear Dr Linacre,

Thank you for your explanation - very clear and concise. So, let me recap what has gone before. The observed values are ordinal, while both the expected values and the residual values are non-interval. However, as always, I have another question!

For arguments sake, what if I wanted to conduct a parametric statistical procedure (for example, an ANOVA or a Pearson's correlation) after running an initial Rasch model analysis. One of the assumptions of parametric statistical procedures is to use interval-level data.

So, which Rasch model values could I use? Even though the expected values and residual values are non-interval, they are the closest to interval that I can get throgh the Rasch model. Therefore, can I use either the expected values or the residual values as the data for a parametric statistical procedure?

Many thanks in advance.

MikeLinacre: Bluesy, yes, "One of the assumptions of parametric statistical procedures is to use interval-level data."

But statistical assumptions are ideals. Real data never exactly meet those ideal assumptions. For instance, many statistical procedures assume normal distributions, but real data are never exactly normally distributed.

So the question becomes "Are the data close enough to meeting the assumptions for practical purposes?" - and we find out by doing an analysis and seeing if the results make sense.

We see a parallel in building a house. The architect assumes that all the wood, brick, steel components exactly accord with architectural specifications. The builder knows they do not. But the builder builds the house anyway!

666. Setting judge calibrations in rating scale?

ImogeneR February 19th, 2010, 12:18am: Hi
I was just reading in the MFRM chapter in Smith & Smith (eds) "Introduction to Rasch Measurement" that “judge severity was calibrated at the logit value where the probability of awarding category 2 equals prob of awarding category 3..(rather than 0 or 3) to prevent perturbations in the infrequent awarding of 0 ratings from disturbing judge estimations of severity...
I'm just wondering how this can be set up in Facets? Are there things to look at in the data first ? (Ie look at the 2 most frequently used rating scale levels?)

MikeLinacre: Yes, Imogene, we definitely want to look at the data and also do our first Facets analysis to verify that everything is correct.

Then the Rasch-Andrich threshold between the two most frequent categories will be the most stable point in the rating-scale structure. So it can be convenient to anchor (fix) this point at 0 logits relative to the item difficulty.

Here is how to do this in Facets:
Models = ?,?,...,MyRatingScale

Rating Scale = MyRatingScale, R9
1 = bottom category
2 = second category
3 = third category, 0, A ; 0,A anchors the threshold between categories 2 and 3 at 0

ImogeneR: Thanks very much Mike, i will give it a go.

667. Ability estimates and standard error

lovepenn February 14th, 2010, 3:18am: Hi Mike,
I see that one of the advantages of Rasch analysis is that we can obtain the precision of person and item measures, that is, standard errors for every person and item.
In my further analysis using each person's ability measure, I wonder how I could take advantage of S.E.s. I found that you said, "Most SPSS routines are not configured to take advantage of SEs." Do I need special software or statistical methods to take account of standard errors of measurement?

Thank you always for your helpful comments and advice.

MikeLinacre: Lovepenn, most statistical software packages process their original data points as though they are exactly precise points, even though they are often estimates, "observed with error". For instance, how many turtles are there on each of the Galapagos islands? How many widgets does a machine produce each day? The errors in the estimates don't matter when they are much smaller than the differences between the numbers, but, as statistical modeling has become more elaborate (e.g., data mining) then we are making estimates from estimates from estimates and the error terms compound. So even small errors can have an influence on our ultimate findings. We see this in the climate-change computations, and also the financial-crisis computations. The models they are using do not seem to produce robust error terms, but are reported as point estimates: "1 million new jobs". "Glaciers melt by 2050."

At present, we have to hunt for error-aware statistical software routines, and often don't find any that is suitable for our purposes :-(

lovepenn: Thank you, Mike, for your response.
Although we can't find any error-aware statistical software, I was thinking that there might be some ways to take advantage of standard errors in modeling procedure (like measurement modeling) or by using the precision measures as a sort of weight in analysis procedure. I crossplotted person measures against standard errors, and found it to take a U-shape, like smaller SE for persons with middle-range ability measures and larger SE for persons in extreme range. Then we may want to give more weights to persons with higher precision and less weights to persons with lower precision. Well, I don't have enough knowledge about measurement, so don't know how to proceed. I know this is not exactly Rasch-related question, but if you have any advice or recommendation for me, it will be most appreciated. Thanks,

MikeLinacre: Lovepenn, here are two approaches which utilize standard errors:

1. standardized size (effect size) = measure / S.E.
Example: https://www.rasch.org/rmt/rmt131f.htm

2. information weighting = measure / S.E.^2
Example: https://www.rasch.org/rmt/rmt62b.htm

668. Unidimensionality (PCA or fit statistics)

haugen_ida@hotmail.com February 3rd, 2010, 12:45pm: Hey!
I am hoping someone can help with my results (am a little confused). I have analysed a questionnaire with 5 pain items and 9 physical function items.
PCAR indicated multidimensionality with loading of the pain items in contrast to the physical function items. However, when plotting the person measures from the two subsets againt eachother > 40% of the person had significantly different estimates, but the person measures were indeed highly significanty correlated with a person correlation coefficient of 0,81. The fit statistics showed low fit statistics for all physical function items.
I am wondering? is this multidimensionality or overlapping questions?
Kind regards, Ida

MikeLinacre: Folks, Ida has already asked me this directly, and I have responded.

But she must need more advice. Please help her!

connert: Ida,

Did you do a separate analysis of the pain items and the physical function items? Or did you use the person measures from the combined analysis? In my experience it would be expected that the combined items would not scale but the pain items and physical function items would scale separately and the person items on each separate scale would correlate.


669. Tables 6.0.1,6.0.2 and 6.0.3

bahrouni January 31st, 2010, 9:55am: Hi Mike,
Thank you very much for replying so soon.
Do I understand that the data from my 3 groups were incomparable, and that this incomparability was expressed in those tables that show how each group behaved? How did FACETS place them on the logit scale then?
I'm not sure I have understood you well regarding rectifying that? I consulted FACETS help menu, unfortunately it confused me even further.
Could you please repeat that in a simpler way. I apologize for for adding to your plate.
I attach the file to have a look at.
Thank you
Farah Bahrouni

MikeLinacre: Farah, when there are disconnected subsets of the data, Facets estimates them independently.
My apologies, but I do not understand what you want to report. Farah, please email me directly: mike ~at~ winsteps.com

670. Tables 6.0.1, 6.0.2, and 6.0.3

bahrouni January 30th, 2010, 6:45am: Hi Mike,
You might recall from earlier correspondence that that I’m conducting a research about the effects of raters’ L 1 on their rating behavior upon scoring EFL essays. My participants consist of 60 raters divided into 3 groups according to their L1: 20 native speakers of English (NSE), 20 native speakers of Arabic (NSA), and 20 from a third linguistic background (NNSEA) (18 Indians, 1 Polish, and 1 Spanish). I want to look at the bias analysis at 2 levels: (a) group level, i.e. how raters interact as groups with the other facets; (b) individual level, i.e. how individual raters within each of the groups behave and interact with the other facets. I would like to deal with all the facets (e.g.: Experience, Education, Age, Gender) in this way. For example, if we the rater experience variable as a facet, it will also have 3 groups: high experienced, low experienced, and novices. I would like to see how each of these groups (as a whole) interacts with the other facets, and then how each individual rater within each of the experience groups interacts with the other facets.
In an earlier piloting last year, I succeeded to have table 6.0 (the vertical ruler) followed by three subset tables, 1 for each of the groups: 6.0.1, 6.0.2, and 6.0.3. How I did that, I have no clue. I retrieved the old file and compared it to the present one, they appear identical, and I could not see any difference in the control lines. Yet, I still get those 3 tables for the old file but not for the new one.
Could you please tell me how to get those subset vertical rulers, I mean tables 6.0.1, 6.0.2, and 6.0.3; or if there is another way of getting where I want to.
Thank you
Farah Bahrouni

MikeLinacre: Farah Bahrouni, thank your for your questions.

In Tables 6.0.1, etc., Facets was reporting disconnected subsets in your data. These indicate that the measures are not comparable. 6.0.1 cannot be compared with 6.0.2, etc. If you have disconnected subsets in your data (shown by subsets numbers in Table 7) then your analysis needs additional constraints (see Facets Help).

One way of producing Facets Tables showing only the elements that you want to see:
1. Perform your complete analysis with Facets
2. Write out an Anchorfile=
3. Comment out with ";" all the elements you do not want in Table 6.
4. Analyze the Anchorfile with Facets.
5. Produce Table 6. It will show only the elements you do want.
6. repeat 3,4,5 for each Table 6 that you want.

But perhaps what you want are separate analyses for each "experience" or interaction analyses for "experience" and some other facet.

671. Different number of items across time points

lovepenn January 28th, 2010, 10:17pm: Hi Mike,

I'm trying to construct a parent-child interaction variable from a number of items. My data is longitudinal and parents have been surveyed at Time 1, Time 2, and Time 3.

What I was planning to do was to obtain the item difficulties from the first time point, and then to anchor the item difficulties for subsequent time-points at those values. But my problem is that several items were asked only in one or two of three time-points. So, I got different number of items across multiple time points. In this case, should I include only items that were surveyed at all three time points? Or is there any way to deal with this problem??

Your help would be greatly appreciated... Thanks much,


MikeLinacre: Lovepenn, use the Time 1 anchors for all items that you can. Leave the other items unanchored. The anchored items will force the other items, and the people, to be measured in the Time 1 frame-of-reference.

lovepenn: Thank you, Mike, for your answer.

672. Pattern of probability curve

wlsherica January 28th, 2010, 7:23am: Dear Mike:

I have a question about probability curve in Winsteps. Now I have a questionnaire with 5-point Likert scale, and I tried to run its probability curve. The curve is as JPG file attached. In this figure, I found category 3 and 4 are very close. Therefore I think I could combine these two category into one category. Here are the problems:

Which control variable in control file I should use, IVALUE? If I have to combine them(3 and 4) , which number (3 or 4) I should use.

Thank you very much. I really appreciate with your help.

MikeLinacre: In your picture, categories 3 and 4 are relatively low-frequency curves, wisherica, but please look at the definitions of the categories before you decide to combine them. Are their meanings almost the same or different?
If you decide to combine the categories, you will only have 4 categories, not 5.
CODES = 12345
NEWSCORE = 12334

wlsherica: Oh!!! Enrhka!! Thank you for your suggestion.

Category 3 means "sometimes impact my life", and category 4 means "Impact my life very seldom". I guess it's ok to combine these categories then check the curve again.


MikeLinacre: Yes, Wisherica, and how about keeping the "sometimes" definition for the combined category? Then your rating scale is something like: Always, often, sometimes, never - which will make sense to your audience.
Especially because the "never" of conversation can include "very seldom" - https://www.rasch.org/rmt/rmt81b.htm

673. Linear logistic test model

ong January 23rd, 2010, 2:48pm: Dear Mike,
I am trying to figure out how to model linear logistic test model i.e. to explain item difficulty by item property.

Can Winsteps model linear logistic test model? If yes, how?

If No, why?

Many thanks


MikeLinacre: Ong, the Linear-logistic-test-model is a Rasch model in which each item is hypothesized to contain various components:
1. Item difficulty = a*component A difficulty + b*component B difficulty + c*....
where a, b, c are indicator variables of the degree to which the component is hypothesized to influence the item difficulty.
2. Person ability - Item difficulty = {data}
A 2-step approach is to compute the item difficulty for equation 2. using the original data and Winsteps, and then export those difficulties to Excel or a statistics package. In Excel or the statistics package, use the indicator variables in a multiple regression with the item difficulties as the dependent variables in order to estimate the component difficulties in 1.
You can the use the component regression coefficients to estimate the predicted item difficulties. Anchor these values in Winsteps. This analysis will tell you how well the components fit the original data. Use UASCALE= to adjust the item anchor values so that the average mean-square of the data is 1.0.
If you can do this successfully, please publish your research.

674. MC distractor characteristic curves

miet1602 January 19th, 2010, 2:35pm: Hi,
It's been a while since I used Winsteps so cannot remember exactly which of the options under graphs displays multiple-choice distractor characteristics curves. Is it the "option-curves" buttons?

If it is, I found the graphs somewhat confusing. For instance, only two options out of 4 (C and D) were chosen for my Question 1, yet, the "option-curves" graph displays curves for options A and B, while C/D don't appear on the graph...

I have tried comparing this with the distractor curves that I get when I analyse the same data in RUMM2020. The results turn out to be different for some questions, whereas for some they are similar to those produced by Winsteps. Also, RUMM displays the correct curves for my Q1...

Any ideas what I might be doing wrong? Or am I looking at the wrong graphs in Winsteps?


MikeLinacre: Yes, "Option curves" is the correct button in the Winsteps graphs window, Miet1602

But it sounds like Winsteps is malfunctioning when all the curves do not appear. I had not tested this, so I am investigating ...

You could also try adjusting the "empirical interval" slider beneath the graph. This might explain some of the Winsteps-RUMM2020 difference.

I will analyze a test dataset in both RUMM2020 and Winsteps to compare the results.

Thank you, Miet1602, for reporting this bug in Winsteps.

miet1602: Thanks, Mike.
I should perhaps mention that I was analysing my data and looking at option curves using Ministep rather than the full Winsteps version.

MikeLinacre: Milja, thank you for reporting this Winsteps deficiency. There are two bugs in Winsteps and the same in Ministep:
1. The distractors may be incorrectly identified when there are unobserved distractors.
2. Bogus distractor curves may be displayed for unobserved distractors.

Also, after correcting the Winsteps distractor plots in Winsteps, I compared them with RUMM2020. When the intervals on the x-axis are approximately the same (which can be difficult to achieve), then the distractor curves appear to be approximately the same.

miet1602: Thanks for your response, Mike. So, can this bug be fixed permanently, or is there a way in which I can fix these graphs myself? I suppose it happens relatively often that there are some unobserved distractors in tests, so it would be good to be able to see reliable distractor curves in those cases too.

MikeLinacre: Milja, please download Winsteps using the same instructions as Winsteps 3.69.1

675. Dealing with Missing Data

lovepenn January 21st, 2010, 5:57am: I understand that WINSTEPS can handle missing data very well and no deletion or imputation for the treatment of missing data is needed since it does not bias the measure estimates.

But I could find some previous studies using multiple imputation or EM algorithm to impute missing data before implementing Rasch analysis in WINSTEPS.

I'd like to seek your advice regarding how to handle missing data in my study.

I'm conducting Rasch analysis to construct a latent variable using a set of items in my data, which will be included as an indepdent variable in further analyses. So, I have (1) a set of items for Rasch analysis and (2) a bunch of other variables for further analyses in my data, and they contain non-small amount of missing data.

I was planning to use multiple imputation for my study and it will produce multiply imputed datasets (let's say 5). Since the WINSTEPS does not support pooling of results from analysis of multiply imputed datasets, it looks like I have to conduct separate analyses for each of 5 datasets and then export the results to other programs and then combine the results manually to display how the rasch model turns out .

So, I am considering an easier way: first conduct rasch analysis without any data imputation or deletion and obtain rasch measures file and merge it with my original data; and then perform multiple imputation on this merged data.

But I am not sure what kinds of difference it will make. Is it better to do multiple imputation before rasch analysis? or Does it make little difference?

Your advice would be very much appreciated. Thanks,

MikeLinacre: Thank you for your questions, lovepenn.

There are usually no advantages (and considerable disadvantages) to imputing missing response-level data in a Winsteps analysis.

But missing data in the "other variables" (for instance, age or gender) are a problem. They may lead to bias in your findings if the missing data are related to the latent variable being investigated.

This looks correct: "WINSTEPS does not support pooling of results from analysis of multiply imputed datasets,". Pooling data often requires meta-analysis, which has its own specialist software.

676. help newbie

hragaey January 16th, 2010, 10:27pm: I'm doing a research on job satisfaction and I want to use Rasch model
help me to build a questionnaire and need help to explain Rasch model I read a lot but i don't understand
I'm a beginner in statistical analysis

MikeLinacre: Hragaey, the best place to start is the book "Applying the Rasch Model" by Bond & Fox.

Omara: Hragaey, If you are a beginner you can also read the book "The Basics of Item Response Theory" by Frak Baker
This book is useful and its exercise will be good to understand the basics of the theory. This, of course, after reading the book recommended by Dr. Mike, who always refers to the best ways.
Ehab Omara

hragaey: thanks alot i'll reada tell u

hragaey: will u please send me a pdf copy ... [of a copyrighted book]

admin: Please do not ask us to contravene the Copyright Laws. Please contact the authors or publishers directly with requests regarding copyrighted material.

hragaey: thanks and sorry

hragaey: I'm just asking about free books
sorry for misunderstanding

677. What Is the best sample size?

Omara January 17th, 2010, 1:58am: I want to know the best size of the sample, which allows the use of dichotomous item response models in general and Rasch model in particular.

MikeLinacre: Omara, this is the type of information you can find with a Google search. For instance, https://www.rasch.org/rmt/rmt74m.htm

678. Help me in data generation

Omara January 13th, 2010, 11:23pm: Plz I need to understanding all about data generation which used frequently in researches in item response theory. Is there any book or guide to understand this point

MikeLinacre: Omara, thank you for asking for your help.
www.rasch.org/rmt/rmt213a.htm explains how to simulate data for Rasch models.
www.rasch.org/software.htm lists some free software for simulating data. Their documentation may answer some of your questions.

Omara: Thank u Dr Mike for ur mail and for ur comment here, and I will go to these links to see their content. And hope that you have time if I go back to you again and asked for assistance in some problems. Sorry for the inconvenience

Omara: Dr Mike
Is it possible to generate data for items differ in difficulty, discrimination and guising parameters?


Is it possible to do it for items functioning differentialy for two groups with varying in magnitude of DIF?

MikeLinacre: Certainly it is possible to simulate data for any condition that can be defined mathematically, Omara. If you don't want to write the software yourself, then Google to find software.

Omara: Thank you very much Dr Mike for help. I am very happy to talk with an expert like you
Ehab Omara

679. WINSTEPS Problem with Scatterplot

lovepenn January 15th, 2010, 3:01am: Hi Mike,

I was trying to produce the scatterplot to compare item measures from two separate Rasch anlayses.

To do this, I selected Compare Statistics on the Plots pull-down menu and clicked "Browse" button to find out the IFILE that contains item measures statistics.

As soon as I clicked Browse, WINSTEPS stopped and just closed the program. Whenever I tried this, this problem happened. Is this some kind of program bug or did I do something wrong?

MikeLinacre: Yes, Lovepenn, this is a known bug in Winsteps 3.69. Now repaired in Winsteps See www.winsteps.com/wingood.htm - Please download the current version of Winsteps. If you do not have the link, please email me.

lovepenn: Thanks, Mike.

Problem solved, but I confronted another problem. My data contains about 15000 cases and I wanted to scatterplot person measures at time 1 against person measures at time 2. When I try to do this, Winsteps says, "Excel will display soon. Please continue.." but it never displays the result. Is this because my data is too big?
Thanks for your answer, in advance..

MikeLinacre: Lovepenn, Winsteps has handed the data over to Excel. Excel 2007 should be able to plot 15000 points, but earlier versions of Excel will probably fail. Please try a smaller subset and see if that works.
For the first 100 cases, in the Winsteps Specification Menu dialog box:
then request the scatterplot.
BTW, plotting 15000 points is likely to be very slow in Excel. You could look at the Task Manager (Ctrl+Alt+Delete) to see if Excel is running ....

680. Correlation of Person measures

lovepenn January 13th, 2010, 4:57am: Hi Mike,

I have six items measured at four different time ponts. I stacked the datasets together and gained four sets of person measures for each case.

I would like to know if a person's performance at Time 1 is higly correlated to a person's performance at Time 2.
What would be the best way to calcuate the correlation between these four sets of person measures? Just doing correlational analysis in SPSS is okay?

MikeLinacre: Sounds good to me, Lovepenn.
Yes, output the PFILE= to SPSS (or Excel) and use SPSS (or Excel) to correlate the Time 1 measures with the Time 2 measures across persons.
If you are really enthusiastic, you could disattenuate the correlation for measurement error. This would tell you if the correlation is truly low, or only low because the measurement is imprecise.

lovepenn: Thank you, Mike, for your reply.

I find the formula to obtain the correlation coefficient disattenuated of measurement error, Rxy, is as below:
Rxy= observed correlation between x and y / sqrt (reliability(x)*reliability(y))

I wonder if the reliability of each measure here is a REAL PERSON reliability that we can obtain from a Rasch analysis (Table 3.1).

If so, I would obtain only one reliability since I stacked the data together.

If I want to disattenuate the correlation, should I do four separate analyses (one for each of four time points)?

Thanks much for your help.

MikeLinacre: Lovepenn, you have the measures you are correlating and their standard errors, so you can compute the reliabilities without doing further analyses:
Reliability = (Measure S.D.^2 - RMSE^2)/(Measure S.D.^2)
Use the "model" S.E. because it will be more conservative for disattenuation.

681. Construct under-representation

arthur1426 January 12th, 2010, 11:09pm: Dear Mike,

I want to determine whether the distance between item calibration values for adjacent items is statistically significant. Handley et al. (2008) used a z-test to evaluate the distance between the calibration values of successive pairs of items. I should note that I'm analysing a polytomous item scale (functional status). Also, Haley et al. (1994) indicated that adjacent items did not overlap by plus or minus 0.15 logits, thus providing evidence that items are effectively separated. Is this a valid criterion?
Lastly, could you reccomend a literature source that might coevr this topic?
Thanks for your assitance!
Kind regards,

MikeLinacre: In this situation we need to define exactly what we mean by "distance between", Robert.
1. Each polytomous item usually has an operational range of several logits, often much wider than the sample, so, in the sense of "operational range", there may be only a "distance between" the most extreme items, if any.
2. Another definition is based on the point-estimate of the difficulty of the polytomous item. The item difficulty is the point on the latent variable where the probability of observing the highest and lowest categories is equal. This is reported by Rasch software as an "item difficulty" with a measure estimate and its standard error. Statistically items 1 and 2 have different difficulties if: |(M1 - M2) / sqrt(SE1^2 + SE2^2)| > 1.96. The standard error depends on the sample size and the rating-scale structure.
3. Another definition is based on "Are the items observably different?" Usually this requires item difficulties which produce expected ratings at least 0.5 score-points different at the center of the rating scale. In many situations this is 0.5 logits or more. It depends on the rating-scale structure.
In "Best Test Design", Wright and Stone suggest more definitions of "different" for dichotomous items.
So, the Rasch literature is unlikely to be helpful to you, Robert, because each author has a unique perspective on this. Haley et al.'s "0.15 logits" must be based on their definition of "difference" combined with their sample size and the structure of their rating scale. Your situation is almost certainly not theirs.

arthur1426: Dr. Linacre,

Thanks for this.
I'm very pleased to have the forum as a resource.

Best regards,

682. Rasch Analysis on Separate Subscales

rblack January 6th, 2010, 4:27pm: Hi,

I ran a principal components analysis (PCA) using a promax rotation (permitting covariation across factors). The PCA yielded five components explaining approximately 62% of the total variance. It is worth noting that these five components made conceptual sense--It was clear what they represented. I then ran a confirmatory factor analysis (CFA) on the five component/factor structure on the cross-validation sample. Fit Indexes from the CFA revealed a relatively "good fitting model."

I would like to step away from the classical test theory (CTT) approach, and use IRT instead. Based on what I've learned from the CTT models, I'm thinking about running a Rasch model [in Winsteps] for each set of items that make up each of the five components/factors. I'm wondering if this sounds like a reasonable approach or if there might be a better approach. I could certainly start by running a Rasch model with all of the items--I suspect I will continue to see that the model is not unidimensional. If I continue to find that the model is multidimensional, should I run separate Rasch analyses or perhaps run a multidimensional IRT?

How might you proceed?



MikeLinacre: Ryan, thank you for this question.

Factor-analytical methods (such as PCA) tend to report spurious factors caused by nodes in the response distributions: see www.rasch.org/rmt/rmt81p.htm.

So please confirm your factor structure by analyzing all your data together in Winsteps, and then performing a PCA of residuals -Winsteps Table 23 - www.winsteps.com/winman/table23_0.htm. Since the PCA is of the residuals, the distributions of the original variables and cases (items and persons) have been removed and will not cause spurious factors.

Looking at the first "Contrast" plot - Winsteps Table 23.2 - we expect to see the biggest factor contrast with all the other factors. You can then remove the items for this factor from the analysis (with IDFILE= or IDELETE= in your Winsteps control file). Then reanalyze, and now you expect to see the second factor contrast with the other factors, etc. But, overall, we will not be surprised to see fewer distinguishable factors (sub-dimensions) of meaningful size.

rblack: Mike,

Thank you for replying. So, after going through this iterative process, let's say I arrive at 4 unidimensional constructs. I assume the next step would be to run a Rasch model for each set of items belonging to each construct and review item/person fit statistics, discrim etc. Once I am satisfied with the final 4 subscales, I'm then going to consider developing a computer adaptive test (CAT) version. I don't want to get ahead of myself. Any further advice regarding the previous steps and converting to 4 CAT versions would be most appreciative! I will keep you posted as I go through this soon.



MikeLinacre: Ryan, it is unusual to set up 4 different CAT systems, unless you are going to make 4 different pass-fail decisions. If you are going to make one pass-fail decision, then it would be better to treat the 4 sub-scales as 4 strands (like addition, subtraction, multiplication, division) and so make one "arithmetic" decision. In other words, analyze all the sub-scales together, even though the fit to the Rasch model is not perfect. This would be easier than trying to combine the 4 sub-scale measures into one "decision" measure at the end of the measurement process.

rblack: Understood. Thank you!

rblack: Hi Mike,

I took your advice and included all items in the Rasch analysis. I'm trying to interpret the "Contrast" Plot to decide which items belong to first factor before rerunning. I'm having a difficult time interpreting the "Contrast" plot. Might you have an example you could share? I looked at the Winsteps manual and I'm still a bit confused. Any help would be greatly appreciated.



MikeLinacre: Ryan, the "first factor" (in the traditional sense) is the Rasch dimension. By default all items are in the "first factor" until proven otherwise. The first contrast plot shows a contrast within the data between two sets of items orthogonal to the Rasch dimension. We usually look at the plot and identify a cluster of items at the top or bottom of the plot which share most strongly some substantive off-Rasch-dimension attribute. These become the "second factor".

683. Interpreting the result of DIF analysis

lovepenn January 13th, 2010, 5:01am: Hi Mike,

I have six items measured at Time 1, 2, 3, and 4, and I wanted to verify the invariance of the item difficulties across four time points. Thus, I stacked the four datasets together into one analysis and did a "Year*Item" DIF analysis. The part of the result is attached. I'm having a little hard time here interpreting the result.

I have read the manual, but still wonder what I should look at to figure out if the item difficulties have significantly changed over time, DIF Contrast and/or Prob.??

The manual says: (1) The DIF CONTRAST should be at least 0.5 logits for DIF to be noticeable; (2) For statistically significance DIF on an item, Prob. < .05.

In my result, Table 30.1, DIF Contrast of Item, ATTENB, for Time point 1 and 3 is . 32, but the prob. is .0000. Should I conclude that the difficulty of this item has significantly changed?

Since I am comparing item difficulties across four time points, is it better to look at Table 30.2 rather than 30.1? If so, what should I look at to determine if the item difficulties have significantly changed over time?

MikeLinacre: DIF analysis can be confusing, Lovepenn, so we need to conceptualize exactly what we mean by "invariance".
You have identified two of the three components:
1. The change must be big enough to be meaningful, e.g., 0.5 logits, but this depends on your situation. How big a change is enough to matter to you? Look at the spread of your items on an item map. How much of a change of item difficulty would change the meaning of the map?
2. The change must be unlikely enough not to happen too often by accident. This is the probability. Probability is dominated by the sample size, so with large samples, every change (however small) is reported as improbable.
3. What is the baseline?
Is everything compared with Time 1? (Table 30.1)
With the previous time-point? (Table 30.1)
With the average of all time-points? (Table 30.2)

The "correct" answers to these questions are the ones that will make the most sense when you explain them to your non-technical audience.

684. Creating a scoring system

rblack January 9th, 2010, 3:41pm: Hi Mike et al.,

I'm curious if there are any standard approaches to create a scoring system for a new scale developed using a Rasch model in Winsteps. It seems to me that, at least conceptuallly, one approach would be to develop cut-off scores in units of logits based on the person-item map. For instance, one might find that there are three places vertically along the continuum that are clearly separated. As a specific example on a scale of depression, let's say we observe that items on suicide and hopelessness are at the top of the map, items such as feelings of guilt and crying are located in the middle, and items such as feelings of sadness are at the bottom. Again, let's assume that there is a clear space along the continuum between these three sets of items. Could we then use the logits associated with the spaces as cut-off points? Am I headed in the right direction here? If yes, how then would I obtain predicted logits for new persons in order to determine where they fall (i.e. mild range, moderate range, severe range).

Any advice would be most appreciated.


MikeLinacre: This is a "standard-setting" problem, Ryan. A good approach is to have your content experts construct a dummy data records corresponding to persons of different competence levels. You can then compute the raw scores for these records and mark their locations on the latent variable (item map). This is usually an iterative process because showing the content experts what happens will probably make them revise their opinions about the dummy performances.

rblack: Hi Mike,

Thank you for sending me down the right path. I have begun researching "standard-setting." I found great articles online by a simple google search!

I have another question, if you have the time. Suppose I run a Rasch analysis in Winsteps on a set of polytomous items, and I'm happy with person, item, and overall model fit. I would like to give others (i.e. researchers) the opportunity to use my scale and obtain "predicted scores" [based on the initial Rasch analysis] on the latent construct for new people (i.e. research study participants). What would be the general steps involved in doing this? I'd be delighted for a reference if a response to this would be too lengthy.

Thanks again,


MikeLinacre: There are several levels to this, Ryan, depending on the depth of involvement of the other researchers. The most basic level is to provide your instrument and a score-to-measure table (Winsteps Table 20).
More elaborate is to give them a keyform. See https://www.rasch.org/pm/pm1-55.pdf - Winsteps can produce simple versions of these from the Plots menu.

Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse Rasch Measurement Theory Analysis in R, Wind, Hua Applying the Rasch Model in Social Sciences Using R, Lamprianou Journal of Applied Measurement
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch Rasch Models for Measurement, David Andrich Constructing Measures, Mark Wilson Best Test Design - free, Wright & Stone
Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias Diseño de Mejores Pruebas - free, Spanish Best Test Design A Course in Rasch Measurement Theory, Andrich, Marais Rasch Models in Health, Christensen, Kreiner, Mesba Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen
Rasch Books and Publications: Winsteps and Facets
Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes Rasch Models for Solving Measurement Problems (Facets), George Engelhard, Jr. & Jue Wang Statistical Analyses for Language Testers (Facets), Rita Green Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind Rasch Measurement: Applications, Khine Winsteps Tutorials - free
Facets Tutorials - free
Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan

To be emailed about new material on www.rasch.org
please enter your email address here:

I want to Subscribe: & click below
I want to Unsubscribe: & click below

Please set your SPAM filter to accept emails from Rasch.org

www.rasch.org welcomes your comments:
Please email inquiries about Rasch books to books \at/ rasch.org

Your email address (if you want us to reply):


FORUMRasch Measurement Forum to discuss any Rasch-related topic

Coming Rasch-related Events
May 17 - June 21, 2024, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 12 - 14, 2024, Wed.-Fri. 1st Scandinavian Applied Measurement Conference, Kristianstad University, Kristianstad, Sweden http://www.hkr.se/samc2024
June 21 - July 19, 2024, Fri.-Fri. On-line workshop: Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com
Aug. 5 - Aug. 6, 2024, Fri.-Fri. 2024 Inaugural Conference of the Society for the Study of Measurement (Berkeley, CA), Call for Proposals
Aug. 9 - Sept. 6, 2024, Fri.-Fri. On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com
Oct. 4 - Nov. 8, 2024, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Jan. 17 - Feb. 21, 2025, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
May 16 - June 20, 2025, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 20 - July 18, 2025, Fri.-Fri. On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com
Oct. 3 - Nov. 7, 2025, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com