Old Rasch Forum - Rasch on the Run: 2011

Rasch Forum: 2006
Rasch Forum: 2007
Rasch Forum: 2008
Rasch Forum: 2009
Rasch Forum: 2010
Rasch Forum: 2012
Rasch Forum: 2013 January-June
Rasch Forum: 2013 July-December
Rasch Forum: 2014
Current Rasch Forum

403. Identity & Empirical Lines

uve January 25th, 2011, 12:03am: Mike,

When checking to make sure common anchored items are functioning the same from form to form, I use the scatterplot to get a visual. I've been using the empirical plot to make these decisions, but remembered our recent course in which you went into more depth concerning the identity lines. I must admit that this was the most difficult part of the course and I am still struggling with how to best use both of these lines to make decisions about which items are truly common. Below is an example of the scatterplot output of a 30 item test with 12 common items. Only three of the items fell within the confidence bands on the empirical line.

Mean 0.031666667
S.D. 0.650484862
Identity trend 1.704415832
Identity trend -1.641082498
Empirical trend 1.33263639
Empirical trend -1.269303057
Identity intercept 0.135833333
Identity slope 1
Empirical intercept with x-axis 0.09794976
Empirical intercept with y-axis -0.153932164
Empirical slope 1.571542034
Correlation 0.763832287
Disattenuated Correlation 0.774444052


MikeLinacre: Thank you for those numbers, Uve, though I am having trouble interpreting them. I usually look at the pictures, not the numbers :-)
Empirical slope 1.571542034
This is a big departure from 1.0. We really do need to see the pattern of the common items.

uve: Here is the scatterplot. Again, this is a common item non-equivalent groups situation. Up to this point, I have been creating the scatterplots, then anchoring only the common items that fall within the confidence interval bands. I then use Table 20 to report the new scoring results.


MikeLinacre: Uve, it looks to me that all your most influential outliers are on one side of the trend line. This is not unusual if the test is off-target (easy or hard) for one of the samples. How about dropping items 5 and 20 as common items, then redrawing this plot? My guess is that a better trend line goes from item 39 to item 36.

Fortunately you don't have the "two trend line" problem, with a plot that looks like those in https://www.rasch.org/rmt/rmt72b.htm - Choosing which the correct trend line can be a real head-scratcher!

uve: I think there may be a large fundamental hole in my equating understanding. My process until now has been to remove items outside the confidence bands, then just run the analysis again using the IAFILE function under the Extra Specifications option. I just report the new Table 20 and that's it. But it sounds like I should be doing more to investigate trend lines. I'm not understanding how to use Identity and Empirical lines together in a more meaningful manner. It appears there are two primary methods of equating in Winsteps: using the IAFILE function or using the USCALE and UMEAN functions. In the introductory course, you focused on IAFILE and in the further topics course you focused on USCALE/UMEAN. I'm not sure which method is more appropriate given my situation.

MikeLinacre: Uve, test-equating is never easy. The history of test-equating tells us that automatic approaches must be carefully monitored. Frederic Lord suggested an automatic approach (somewhat similar to yours). It looks great in theory, but can produce obviously incorrect results in practice.

For common-item equating (IAFILE=) we need to look at the content of the items and their distribution. Trend lines and confidence-intervals help, but they are not decisive. We really want an equating line parallel to the identity line (so there is only an equating constant), but when we can't do that, then the empirical line provides the equating constant and equating slope. The same applies to common-person equating (PAFILE=).

For random-equivalence equating, we match sample distributions (UMEAN=, USCALE=). There are usually no common items or common persons. In this process we need to verify that the two samples of persons (or items) really are equivalent. UMEAN= is based on the equating constant, and USCALE= is based on the equating slope.

A recent published suggestion for an automatic equating method is
1) compute the best fit lines for every possible combination of two or more common items
2) the equating line is the weighted average of the lines.
Again, this sounds good, but I suspect that an influential outlier would skew the computation.

uve: How about this then: I'll start by removing the obvious common item outliers and see how well the empirical slope approaches the identity slope. If it doesn't I'll keep at it until I get as close as possible making sure to keep at least 5 common items. I'll then run the IAFILE with the new smaller set of common items and post the new Table 20. What do you think?

MikeLinacre: Yes, we need a balanced approach, Uve. There is usually a sequence of items that correspond to the trend line that any reasonable person would accept.

uve: Now here is an odd one. Item 27 seems like it needs to go, but if I remove it the empirical and identity lines depart from each other. The slope departs from 1 also. This equating thing is definatley an art.

MikeLinacre: Yes, this is a challenging situation. There appear to be two parallel equating lines of items, roughly following the two confidence bands. It looks like it is a judgement call where to place the equating line between them.

Saturn95: Greetings,

I have read through this and other threads on this forum related to equating, and have done an extensive search through the journals and web-based resources. I cannot seem to find clear-cut answers to the following questions:

1) When equating a new set of items onto the scale of an existing item pool using common-item equating (i.e., IAFILE), what is the most defensible way to determine the stability of the anchor item set?

Winsteps Help and the information on this forum seem to recommend plotting the separate calibrations first, and looking for outliers (although I'm still not entirely clear how to use the Winsteps-generated Excel empirical and identity plots together in this manner).

In past projects, I have typically performed a calibration of the "new" form with the common items anchored to their bank values, and then referred to the displacement statistic in Winsteps to determine which item(s) should be dropped from the anchor set. Items with displacement of .3 logits are iteratively dropped from the anchor item set until a stable set is obtained. However, at least one more recent reference I found (https://www.rasch.org/rmt/rmt213g.htm) questions whether the displacement statistic is the best index to use in this case.

Many other sources I've found reference Hyuhn's robust-Z statistic. I have never used this but am wondering if other Rasch practitioners have found it useful.

2) When determining if common-item equating with IAFILE is appropriate, I am unsure about how far from 1 my best-fit line slope needs to be before F-C equating becomes a more appropriate method.

A RMT I found on this topic (https://www.rasch.org/rmt/rmt223d.htm) states, "If we have two tests with common items that are supposed to be the same (such as alternate test forms, or pre-test and post-test forms), then we are reluctant to do F-C equating. We usually decide which form is the "correct" form (or combine the two forms) and use it as the basis for the equating." In my case, I am fairly confident in the calibrations contained in my large, historical item pool. I am placing new items onto that scale, so should I "force" the new items onto the scale using IAFILE, despite the fact that the slope of the best-fit line of the common items is not exactly 1? Or is it more appropriate to do F-C equating if the slope of the best-fit line of the common items is, say, .90 or worse?

Any references to other articles or texts for practical application of these rules would be most helpful and appreciated. Thank you!

Mike.Linacre: Thank you for your email, Saturn95.

1) Please always cross-plot the two sets of common-item difficulties from the two analyses before doing anchoring or anything else. This will tell you several things immediately, including

a) Have the common items been correctly identified? Look at off-diagonal points: are they truly the same item? Reasons that apparently the same item are truly different items include items that are printed in different ways, or have the response options in different orders.

b) Do the tests have the same discrimination? For instance, your historical item bank may be constructed from high-stakes tests, which usually have high test discrimination, but your new form may be administered in a low-stakes situation, which may cause low discrimination. If the slope of the best fit line through the scatter plot is not parallel to the identity line, then there is a Fahrenheit-Celsius situation. We need to adjust the item difficulties for the different test discriminations before eliminating outliers.

c) Is there only one equating line? The items may stratify into two parallel groups with each with its own equating line. We must decide which equating line is the true equating line.

d) After visual inspection, it is usually obvious which items belong in the equating set of stable items. We have performed an "inter-ocular traumatic test" (Google this if this term is new to you.) We do not need statistics to tell us. We may use a statistical screen to reassure our audience. If we require a statistical screen to identify the stable items, then we are in a fog. Accidents in the data may dominate our decisions. A slight skew in the original item distribution may magnify into a considerably incorrect choice of stable items.

2. We try to avoid F-C equating unless the discrepancy is obvious. Remember that slight changes in item difficulties rarely have large substantive consequences. We expect each administration of a test to have slightly different test discrimination.

If you decided to use IAFILE= from the old data on the new data with different discrimination, then use USCALE= to bring the new estimates onto the same measurement scale as the old estimates.

For example, in our scatterplot, we notice that the new item difficulties have twice the dispersion of the old item difficulties. So, in an anchored analysis of the new data:
IAFILE= (old item difficulties) ; [this has been revised] so that the old anchor values are directly comparable with the new item difficulties
USCALE=0.5 ; so that the reported numbers from the new analysis are on the measurement scale of the old data.

Google would be your best source for references. Can anyone reading this suggest any?

uve: Mike,

I think the issue with a novice such as myself is that there seem to be many paths to equating, but signs and guides are very few. Because of many factors, which in my case are workload, number of tests, timeliness of equating completion so teachers know they are looking at the final reported scores, plus much more, it is very tempting to request for a common set of rules that will help in the decision making process. But such a rule set might be too restricting given all the possible situations.

What would be extremely helpful would be more case study resources detailing processes: data and circusmtances, the equating process chosen and the results of those decisions. Perhaps I've been looking in the wrong places, but I have found very little.

Clarification: when you state IAFILE *2, I assume you mean that the old form common items are doubled first, then saved as the IAFILE, then used to anchor to the new form. After this is done, then the USCALE is implemented, or is this also done during the IAFILE run?

Thanks again as always.

Mike.Linacre: Uve, the problem is that sets of rules and automatic processes have proved to be misleading. Frederic Lord, the father of IRT, invented one of the earliest automatic procedures. It can produce incorrect results, but analysts blindly following that procedure did not notice the problems. They thought "Since Fred Lord invented this procedure, its results must be correct!"

The biggest challenges to automatic processes and sets of rules are:
1. Skewed distributions of outliers. Trimming the outliers from the "far" side pushes the equating line towards the outliers on the "near" side. Finally, the outliers on the "near" side define the equating line,
2. Two or more equating lines in the data. The final equating line is an accidental choice of one of those equating lines or a compromise between them. Inspection of the items reveals that one of the equating lines in the data is the true one. The other equating lines are caused by DIF, off-dimensional items, changes in item presentation, ceiling and floor effects, etc.

Sadly, there is no substitute for looking at the situation and thinking about it carefully. So, scatterplot first!

Usually the IAFILE= values are already well established. So in the new analysis which produces more dispersed item difficulty estimates:

(item entry number) (established value)
USCALE = 0.5

But please verify that this is working correctly by looking at the item displacements.

uve: Thanks Mike.

Yes, I very much agree that a set of guidelines will likely do more harm than good. I hope more equating research studies are published so that a wider variety of rationales and processes become the vocabulary of novices like I.

Mike.Linacre: Uve, the problem with most published "equating" research studies is that they are based on simulated data with unrealistic generators, such as true normal distributions. Real data are messy, and it is difficult to simulate a realistic, but replicable, mess. I have not seen a study that simulates skewed outlier distributions or false equating lines.

426. RPCA = help

tmhill April 3rd, 2011, 1:17am: I am having some trouble looking at the output for item dimesionality. I don't know which number to look at to determine which is over 60%.
There is the measure value which is made up of the people and items (it is not over 60%)
Then there is the unexplained variance which is over 60% but it has several contrasts which are over 4%. Is this were I need help... Is the figure I am supposed to be looking at the unexplained varience? And, how exactly does it get worded in a write up? Can any one help me?

MikeLinacre: Thank you for your post, Tmhill.

Are you looking for "Explained variance should be more than 60%"? Then that is the item+person explained variance.

Explained variance is dominated by the variance of the person measures, the variance of the item measures and the targeting of the items on the persons. If your "explained variance" is low, then either the person measure S.D. or the item difficulty S.D. or both are small. See https://www.rasch.org/rmt/rmt201a.htm Figure 4.

The important aspect, from a Rasch perspective, is the size (Eigenvalue) of the 1st Contrast in the PCA of residuals. If this is more than 2, then there may be a substantive secondary dimension.

For writing this up, Googling "Rasch PCA explained variance" may point you to a paper with a suitable text.

tmhill: So in the attached table I can say that 37.6% of the variance can be accounted for by the measure.
62.4% is in the first contrast and that it looks like there are more than one construct making this not a unideminsional measure.
Is that correct?

MikeLinacre: Tmhill:
Table of STANDARDIZED RESIDUAL variance (in Eigenvalue units)
-- Empirical -- Modeled
Total raw variance in observations = 40.1 100.0% 100.0%
Raw variance explained by measures = 15.1 37.6% 37.8%
Raw variance explained by persons = 3.9 9.8% 9.9%
Raw Variance explained by items = 11.2 27.8% 28.0%
Raw unexplained variance (total) = 25.0 62.4% 100.0% 62.2%
Unexplned variance in 1st contrast = 2.7 6.7% 10.8%
Unexplned variance in 2nd contrast = 2.1 5.1% 8.2%
Unexplned variance in 3rd contrast = 1.9 4.6% 7.4%
Unexplned variance in 4th contrast = 1.5 3.7% 6.0%
Unexplned variance in 5th contrast = 1.4 3.4% 5.4%

This Table shows 37.6% explained variance. If the data fit the Rasch model perfectly, then that number would have been 37.8%. So the explained variance is about as good as we can get with those measure variances and item-person targeting.

The variance explained by the first contrast is 6.7%, about 1/5 of the Rasch explained variance. More important, its eigenvalue is 2.7, about the strength of 3 items. So we would definitely want to investigate the substance of items A,B,C and a,b, to discover if any of those items are so far off-dimension as to merit exclusion from the analysis.
BTW, it really helps interpretation if the items labels indicate item content, then we won't have to dig around for the original test paperwork to discover the item content.

tmhill: Mike, your help is invaluable...
It was my understanding that if the data fit the model it would have a better than 60% variance explained by the measure. Is this incorrect?

MikeLinacre: Tmhill:
We would certainly like to have 60% explained variance, but 60% as a criterion value is an urban legend. The percentages for different person measure S.D.s, item difficulty S.D.s and targeting are given in https://www.rasch.org/rmt/rmt221j.htm. In practical situations, 40% variance explained is more likely.

saidfudin: hi Mike.
I'm clear of the 40% Variance Explained.

What about the 15% maximum limit for the unexplained variance in 1st Contrast ?
Appreciate your kind reference and explanation.


Mike.Linacre: Saeid: the expected unexplained variance is an eigenvalue (size) not a percent. See https://www.rasch.org/rmt/rmt233f.htm

15% would be for a test of approximately 10 items.

saidfudin: Hi Mike..
I'm a bit confused here. Fisher reported in terms of %. see: https://www.rasch.org/rmt/rmt211m.htm
Poor Fair Good V.Good Excellent
Variance in data explained by measures <50% 50-60% 60-70% 70-80% >80%
Unexplained variance in contrasts 1 of PCA >15% 10-15% 5-10% 3-5% <3%

Correct me, pls.

Mike.Linacre: Saeid: there is no "absolute truth" in measurement. There is only "utility". Please choose the conceptualization of measurement that is most useful for solving your problems.

Larry Laudan has an interesting quote about scientific "truth". It is the first quote at www.rasch.org/rmt/rmt73p.htm

Gideon: Hi, on a similar vein...
My explained variance is 40.6% and is higher than modelled (34.0%). This appears to be acceptable from what Mike has said in this thread.
However, the contrasts seem to be really high:
[QUOTE=" Table of STANDARDIZED RESIDUAL variance (in Eigenvalue units)"]
-- Empirical -- Modeled
Total raw variance in observations = 128.0 100.0% 100.0%
Raw variance explained by measures = 52.0 [B]40.6% 34.0%[/B]
Raw variance explained by persons = 19.7 15.4% 12.9%
Raw Variance explained by items = 32.3 25.2% 21.1%
Raw unexplained variance (total) = 76.0 59.4% 100.0% 66.0%
Unexplned variance in 1st contrast = [B]4.3 3.4%[/B] 5.7%
Unexplned variance in 2nd contrast = 3.3 2.6% 4.4%
Unexplned variance in 3rd contrast = 3.1 2.4% 4.0%
Unexplned variance in 4th contrast = 3.0 2.3% 3.9%
Unexplned variance in 5th contrast = 2.7 2.1% 3.6%

With the first contrast, does the Eigenvalue of 4.3 matter if it only contributes 3.4% of the variance (1/12 of the explained variance)?

Or, because there are many items, does the 4.3 actually imply more than 4 or 5 items are off-dimension?

The distribution of items on my contrast 1 plot looks as if it would be better off rotated 45°! The positive residuals are mostly for the easier items (the first ones asked) but the negative residuals appear to be a mix of items.

:-/I think I need a coffee!

Mike.Linacre: Gideon: It sounds like you have a situation in which the latent variable is changing its meaning as it advances. This is a common situation. For instance, as "mathematics" moves from "addition and subtraction" to "trigonometry" to "tensor analysis", its nature changes considerably. The result is the 45° plot that you see.

Look at the plot for the 1st contrast, the items at the bottom are at one end of the latent variable, the items at the top are at the other end. The contrast between the items is showing the changing nature of the latent variable. Since this change is systematic, and not random, there is an implicit Guttman structure to the data, causing the empirical explained variance to be higher than expected.

There are various stratagems for dealing with this situation, but they all make the measures more difficult to interpret. So there is a trade-off: statistical perfection or utility?

Gideon: Thank you! I shall work on this shortly.

Gideon: Mike, you were spot on! The top contrasts did all have a common theme after all, and the bottom contrasts all had a different, common theme.

Well, this does give me food for thought, and plenty of ideas for test development...

timstoeckel: Hello,

I am new to this Forum and have a question quite similar to Gideon's post of Aug 22 last year. I have gotten the following data while developing a test of vocabulary knowledge for second language learners:

-- Empirical -- Modeled
Total raw variance in observations = 136.9 100.0% 100.0%
Raw variance explained by measures = 50.9 37.2% 37.4%
Raw variance explained by persons = 18.5 13.5% 13.6%
Raw Variance explained by items = 32.5 23.7% 23.8%
Raw unexplained variance (total) = 86.0 62.8% 100.0% 62.6%
Unexplned variance in 1st contrast = 7.2 5.3% 8.4%
Unexplned variance in 2nd contrast = 5.0 3.7% 5.8%
Unexplned variance in 3rd contrast = 4.5 3.3% 5.3%
Unexplned variance in 4th contrast = 4.2 3.0% 4.8%

The test is limited to the most frequent 2000 words plus academic words, and examinees were all Japanese university students. I therefore expected a small range of ability and am not worried about the relatively small percent of variance explained.

I am less sure about the 7.2 eigenvalues listed in the first contrast. I compared items A, B, and C with a, b, and c but could see no meaningful distinction. A, B, and C were all very easy, but a, b, and c were not particularly challenging and did not seem to differ on any obvious dimension.

Should I be concerned about this? Do you have any suggestions?

Because points A, B, C, (and D) correlated perfectly (just one person answered these easy items incorrectly) would it be appropriate to remove the person (this person was not otherwise flagged as misfitting the model) to see whether that would improve unexplained variance?

Thanks in advance!

Mike.Linacre: timstoeckel, thank you for your question.

Your "variance explained" numbers look reasonable for a test where most students are scoring around 75% correct.

You may have encountered a "distributional" factor (component) in your PCA analysis - see "Too Many Factors" https://www.rasch.org/rmt/rmt81p.htm

So please do delete that one person:
PDELETE = (person entry number)
and see what happens!

timstoeckel: Mike,

Thanks for your quick reply. I will take your advice and see where it leads me.



441. Rasch standard error

aaronp808 March 4th, 2011, 7:48pm: I have a question about how to use the standard error reported from the Rasch model. I have a rating-scale pre- and post-test which I would like to compare (9 items, 160 responses). I first went the traditional way and did a paired-samples t-test on each item's differential between the item difficulty and person's ability. When I do so, I find that some items show significant differences and some do not. However, when I look at the Rasch standard error for all the items, they all show change well outside of the Rasch error.

Attached to this post is an illustration of what I mean. The error bars are the reported Rasch error for each item. The items with the asterisks are significant according to the paired-samples t-test. Some of the non-significant (via t-test) items show change way outside of the reported Rasch error. I would expect some difference since the t-test is based on the persons. But I didn't expect such a difference!

So my question is conceptual. Which is a better test of change for this set of items, comparing Rasch errors or a paired t-test on the difs?

Thanks for any advice!

MikeLinacre: Aaron, thank you for your post.

It sounds like you are doing two different types of DIF analysis:
1. paired-samples - comparing the estimates from two separate analyses.
2. combined analysis - DIF analysis for each item.

1. is "Differential Test Functioning". 2. is "Differential Item Functioning". We expect their findings to be somewhat different. https://www.rasch.org/rmt/rmt163g.htm

Or are you talking about something else, Aaron?

aaronp808: Thank you for the reply and the reference. I think I am mixing apples and oranges and opted to stick with DIF. That is what I do in the paper so it makes the figure align more with the manuscript (and it seems conceptually simpler to be consistent!). Thanks again.

chu082011: [quote=aaronp808]Thank you for the reply and the reference. I think I am mixing apples and oranges and opted to stick with DIF. That is what I do in the paper so it makes the figure align more with the manuscript (and it seems conceptually simpler to be consistent!). Thanks again.


Thank very much for your comment. It help me to think about for my ideals.

Get your ideas first! You can see suggestions at source: [url=http://www.humanresources.hrvinet.com/paired-comparison-analysis/]Paired comparison analysis

Tks again and pls keep posting.

454. Creating Domain TCC's

uve December 3rd, 2011, 11:20pm: Mike,

If a math test has, let's say, two primary domains, Number Properties and the other is Algebra. I can certainly code the questions within each domain so that I can view how the ICC's for each question within a certain domain function and then select the questions in the Multiple Item ICC list. I'm also aware that the TCC is for the entire test, but I was wondering if there was a way to produce two TCC's of sorts. One would be the Number Properties domain and the other would be the Algebra domain. Is this possible?

Mike.Linacre: Yes in Winsteps, Uve.
1. Put a domain code in the item labels column 1
2. ISELECT= domain code
3. Produce the TCC (Graphs menu, Table 20, etc.) - this will be for the selected domain

uve: Thanks Mike. Clarifiying question: is the analysis run first then ISELECT chosen in Specification option, or is the ISELECT chosen at the Extra Specifications stage before the analysis runs?

Mike.Linacre: Uve, if you want all the TCCs to be in the same frame-of-reference, then
first, analyze everything together
second, "Specification" menu box: ISELECT=
third, TCC

uve: Thanks Mike, that worked. However, I was hoping to chart multiple domains on the same graph. This would be interpreted a bit differently in that given an ability level, staff could see the likely percent correct a student might receive on one domain versus another. In other words, where the ability level meets the domain TCC curve, the probability is not one of getting the domain correct on a MC test, but getting a percent correct on that domain. With multiple domains on the same chart, we could see the difficulty function of the domain in relationship to another.

Mike.Linacre: Uve, Winsteps only goes so far :-(
That's when Excel comes to the rescue!!
On the Winsteps Graph screen: TCC
Click on "Copy Data to Clipboard", then paste (ctrl+v) into Excel
Do the same thing for the other TCC
Excel can then put all the data into the same plot.

uve: Mike,

Much attention is focused on performance levels for our exams, but within those exams can be many different domains. Our data system requires performance level of sorts on these as well, which I have mixed feelings about considering that some may consist of as few as 4 questions. Still, the setting of these seems arbitrary to some degree--based more on percentages than probability. The attached graph took a bit of work. I followed your instructions, coded the items under domain codes, then used Specification and exported the TCC's for them. If our staff likes this, they may request I do it for 100+ additional assessments. That would take quite a lot of time that I'm not sure I have. I know you get a lot of requests, but perhaps adding this type of functionality to Winsteps, including the ability to add several vertical bars to represent the cut point for the overal exam, as I did here, might come in handy. So using this graph, I can tell staff that it would be reasonable to expect a person who scored proficient on the exam to have gotten about 60% of the questions correct under the Structural Features of Literature domain.

Mike.Linacre: Thank you for sharing your efforts, Uve.

Outputting a TCC for each domain from a Winsteps analysis can be done using Winsteps Batch mode. Ibn BATCH=YES mode:
1. Do the Winsteps analysis. Output the IFILE=if.txt and PFILE=pf.txt SFILE=sf.txt
2. Then run numerous anchored Winsteps analyses with IAFILE=if.txt and PAFILE=pf.txt SAFILE=sf.txt.
For each each analysis, ISELECT= the items in the domain, SCOREFILE=TCC-for-domain.txt

An EXCEL VBA macro could be used to load the TCC-for-domain.txt files into an Excel worksheet and then to draw the TCCs. Alternatively, this could be done using the R Statistics or similar package.

460. Partial Credit Rasch Model

alicetse December 29th, 2011, 10:00am: Dear all,

I am currently doing the analysis of a reading test, in which there are multiple-choice items, short questions and commentary questions. Some answers allow carry 2 marks each with full marks given to complete and accuracte answers, 1 mark given for partially correct information and 0 for completely incorrect answers.

I know the partial-credit Rasch model can run statistical analyses with more than one answers in values (0,1) per item. May I know how I should enter and analyse the data for the statistical analysis?

Thanks a lot!


Mike.Linacre: Alice, the format of the data depends on the software that will analyze it. For instance, for Winsteps,
1) the data would be the student's responses (or scored responses) for the MCQ items, and the scored responses for the other questions.
2) a scoring key for the MCQ items
3) ISGROUPS=0 to specify the partial credit model

See https://www.winsteps.com/winman/index.htm?key.htm Example6

alicetse: Dear mike,

Thank you very much for your reply.


461. Beginner question about the rating scale model

Mar104 December 30th, 2011, 10:33pm: Hi, I'm new to Rasch as well as this board (just joined today) so I apologize in advance for writing anything obvious or stupid.
I attended the Rasch workshop in October held in Maple Grove MN and have been trying out Rasch analyses using winsteps on my own. I'm not a psychometrician and I'm probably in way too deep & way over my head, but what the heck.

I have some data and I'm wondering if a rating scale model would make sense to answer my questions, but I have a few concerns.

Some context---We ask ask individuals that participate in a family group decision making meeting to complete a measure of fidelity to the group model. There are 16 items that are positively worded and uses a strongly agree, agree, disagree strongly disagree rating scale (also DNA and missing). All of the items are in the same direction (no reverse coding). So the smith family participates in a family meeting, along with the child welfare caseworker and therapist, and everyone in the team completes their own measure of their perceptions of the experience of the group. So individuals are "nested" within their group--

It would be very useful to look at responses at the item level particularly looking at the responses of the young adults and the caseworker. I'd also like to look at item fit statistics.

However, I am concerned about the nested nature of the data. These are not independent observations since the youth and the caseworker as well as the rest of the members are giving their perception of how their group functioned.
Also, these data are heavily skewed to the positive (strongly agree and agree) although some items have more variance than others.

Should I be concerned about the lack of a normal distribution and the nested nature in doing Rasch and how would I corrrect for this? Or just forget about it?
thanks in advance


Mar104: I love that I am listed as a "baby member". So true!!

Mike.Linacre: Glad you have joined us, MBR :-)

We certainly don't want to forget about anything, but we also don't want to worry about anything until we have discovered that it is truly a problem.

Your design is often used- for example: patient self-rating, patient rating by therapist, patient rating by physician. The first step is to be sure to code each data record with the ids of everyone involved and suitable demographic codes. Then we can analyze all the data together (using the rating-scale model), and then look to see whether the results make sense or are skewed. For instance, in the patient example, we expect the therapists to be the most optimistic (its their job to be!). We expect the physicians to be conservative (not too optimistic nor pessimistic) and the patients to be all over the place (depending on their personalities, case histories, social situations, etc.)

In summary, you need to formulate some theories about what you expect your data to report before you analyze the data. Then you can see what has been confirmed, what has been challenged, and where there may be problems .... OK?

Mar104: Mike, yes it does. I have worked with pilot data on this measure (with a much smaller N and smaller number of categories of types of individuals) and I do have some theories about what I expect to see with this larger data set. Thanks I'm sure that I'll be using the board again.


aiman December 28th, 2011, 12:57pm: hii Mike..
i'm aiman... how to separate or split the Likert Scale categories 12345 to 123456 by using the command in the PRN file?

Mike.Linacre: Thank you for your question, Aiman.

It is easy to go from 123456 to 12345, but difficult to go from 12345 to 123456.

Let's suppose that we want to split category x into categories x and x+1. Then one approach would be to analyze the original data. Then output response-level information. Then use this to convert every instance in the data where category "x" is a lower-category than expected to category "x+1". This approach would be highly influenced by accidents in the data, but might tell us whether it was worth the effort to do something more strongly based on the meaning of the categories.

aiman: ok....Thank you ... :)

463. Scoring Dichotmous & Polytomous Items Together

uve December 18th, 2011, 8:50pm: Mike,

I was asked to include two additional dichotomous yes/no response items to a 4 choice Likert survey scored 1-4. I'm having difficulty deciding how these two should be scored. Currrently, it's No=1, Yes=4. I felt that agreement on either one would represent a very confident high aptitude. In fact, I felt most of the respondents would choose No to the questions. It was split 50/50 on one item but about 76% No on the other. How is this scenario generally dealt with?

Mike.Linacre: This situation can be perplexing, Uve. Here are some options:

1. Strict Rasch model: each ordinal category represents a higher level on the latent variable. The dichotomy is scored 1-2.

2. Weighting. Each dichotomy is scored 1-2, but is given the influence of a 1-4 category item, so the dichotomy is weighted 3. In Winsteps: IWEIGHT=3

3. Structural zeroes: the dichotomy is conceptualized as a 1-2-3-4 item but with 2 and 3 unobserved. Score 1-4, with STKEEP=Yes

4. Analyze the dichotomies and polytomies separately, and then combine the measures. This is impractical with only two dichotomies.

A similar problem is discussed at www.rasch.org/rmt/rmt84p.htm

uve: Thanks Mike,

The weighting procedure seemed to work well. I was wondering if the structural zeros function could be applied only to the 2 dichotomous items or would it apply to all?

Mike.Linacre: Uve, structural zeroes apply whenever a polytomous item has a numbered category that cannot be be observed. There are also incidental zeroes when an observable category is not observed (usually because the sample is too small). In both cases we must decide what analysis makes the most sense: maintain the unobserved category number in the category hierarchy or squeeze it out by renumbering the higher categories.

464. about winsteps programme

krong11 December 15th, 2011, 6:11am: I have 25 items, including 15 items scored 0,2; 10 items scored 0,3; I want to know which programme is right? such as:
Programme 1:
ISGROUPS ="1111111111111111111111111"
CODES = "ABCD 012345"
IVALUEA = 10000******
IVALUEB = 01000******
IVALUEC = 00100******
IVALUED = 00010******

Programme 2:
ISGROUPS ="2222222222222223333333333"
CODES = "ABCD 012345"
IVALUEA = 20000******
IVALUEB = 02000******
IVALUEC = 00200******
IVALUED = 00020******
IVALUEX = 30000******
IVALUEY = 03000******
IVALUEZ = 00003******

If these programmes above are not right either, Should you tell me how to corect it? Thank you!

Mike.Linacre: Krong11, programme 2 is almost correct.

krong11: Thank you for your reply, and I know how to use it. Thanks again.

465. Lower & Upper Asymptote Guide

uve December 12th, 2011, 1:51am: Mike,

I've read through your explanations of how Winsteps displays the lower and upper asymptotoes, but I am still a bit unsure how to interpret the output. Do you have any guiding suggestings about what values could be interepted as signifying guessing or carelessness? For example, what lower asymptote value would likely equate to a probability of 25% chance of guessin on a 4 distractor MC test based on the Winsteps output?

Mike.Linacre: Uve: Winsteps shows an estimated value of a lower asymptote as an expected score at the guessing level, so 25% chance of guessing = 0.25.

Similarly for carelessness. 10% chance of carelessness = 90% chance of not carelessness = 0.9 expected score on a dichotomous item.

uve: Thank you for the clarification. I had thought that the asymptote output was reported using an index like the discrimination output.

I have a Likert scale scored 1-4 and one item has the lower asymptote as 0 and the upper as 2.83. Another item has the lower at 1.82 and the upper at 0. Guessing is obviously not an issue with a survey, but perhaps carelessness might be. How then would one interpret asymptote output for these two polytomous items?

Mike.Linacre: Uve, "guessing" and "carelessness" are unlikely to be relevant to a Likert scale, but there can be other aspects of "misbehavior", such as "social pressure", "central tendency". Winsteps is reporting that there may be a ceiling-effect in the data that is preventing full use of the upper category. One method of avoiding this in survey design is to include deliberately outrageously extreme categories, such as "absolutely always". These super-extreme categories allow respondents to use the intended extreme categories freely. Any observations in a super-extreme category are automatically recoded into the intended extreme category.

466. Person Statistics: Measure Order

fara03 December 4th, 2011, 1:10pm: Dear Mike,

I have a difficulty in understanding about "Person Statistics: Measure Order".
I already analyze the data, and i found that i have value "Maximum Measure" for a column Infit & Outfit. [Attachment]

Actually i don't understand which column should be focus on & what range of value need to be consider in analyzing this statistics.

If possible, can you guide me regarding the process of analyzing this data.

Your cooperation are highly appreciated,

Mike.Linacre: Fara, in your Table, the column labeled "Measure" is the estimate of the Rasch-model parameter corresponding to the raw score. This is the raw score transformed into additive units on the latent variable. The estimates corresponding to maximum and minimum possible scores are theoretically infinite, so Winsteps reports a reasonable value for the extreme Measure. There are no fit statistics for these values because they must fit the model perfectly, so "Maximum" is shown instead of fit values.

For understanding Rasch analyses, please look at the book "Applying the Rasch Model" (Bond and Fox) or at the material on https://www.rasch.org/memos.htm#measess - for Measures see Chapters 5 and 6.

fara03: Thanks Mike,

I will go through the chapter & the link that you have provided.
If i have any doubt, i will revert back to.


467. Bias & 2nd Dimension

uve December 3rd, 2011, 1:57am: Mike,

There seem to have been several posts recently speaking of both bias and dimensionality being used in the same context to describe the same thing.

I understand that an item biased against a group, say, a reading comperhension question, will function differently for an English learner for example. The 2nd dimension might be the idiomatic use of prepositions, for example. I live on 2nd Street. Well, I really don't live on top of the actual street, but if the EL is not aware of the idiomatic usage of English prepositions then the comprehension failure on the item could be due to the idiomatic factor which the question was not intended to measure.

My big problem with all of this is that whenever I find what appears to be a biased item, it rarely surfaces as an item loading high either negatively or positively. So that has always made me think bias and dimensionality are very different. Then again, perhaps I'm not picking the right group/s to measure the bias since bias is dependent on the groups I choose.

So, is bias merely a yes/no or strength of bias test while PCA is intended to provide some possible guidance as to the reason? When could bias exist without a 2nd dimension, or is this even possible?

Thanks as always for your insight.

Mike.Linacre: Uve, in Rasch theory, every item has two dimensions. One is the shared latent variable, correlated with the shared latent variable of all the other items. The other is unique to the item, uncorrelated with the unique dimension of any other item. This situation is called "local independence". Similarly for every person.

Of course, perfect local independence is never observed. There are somewhat-shared subdimensions among the items and somewhat-shared subdimensions among the persons.

For convenience, we identify some of these subdimensions with a name: "guessing" and "guessability", "carelessness", "gender bias". We have also devised specific techniques for investigating and quantifying these subdimensions.

But there are always unidentified subdimensions in a dataset, so we use a general-purpose technique, such as PCA, to investigate these. When we find a sub-dimension, we may be able to identify it and so use a technique designed for it. For instance, we may discover a cluster of highly-correlated items that we can identify as a "testlet". We can then collapse this cluster of items into a polytomous super-item.

A PCA analysis may discover a "boy vs. girl" subdimension, in which some items favor boys and some items favor girls, causing the boy items to be slightly correlated in the opposite direction to the slight correlation among the girl items. When we see this situation, we look for confirmation in a DIF analysis.

If only one item has a boy vs. girl subdimension, PCA cannot separate that subdimension from the expected item-unique dimension in the item. We can identify the subdimension because we have information outside of the response-level data, namely the genders of the students. OK, Uve?

uve: Yes, thanks agiain Mike. This helps.

uve: Mike,

I thought of something else along these lines. You wrote: "If only one item has a boy vs. girl subdimension, PCA cannot separate that subdimension from the expected item-unique dimension in the item."

Supposing we use DGF? I realize if we only have one item, then this may do us no good. But if the bias has deeper roots, it may permeate other items in ways which DIF may not detect. Then again, it seems if DGF could pick up on potential bias, then PCA would likely do the same.

Mike.Linacre: Uve, "if DGF could pick up on potential bias, then PCA would likely do the same."

Yes, but here we have a difference often seen in statistical analysis between a general test and a specific test. The specific test identifies aspects of the data that the general test may not be able to distinguish from background noise.

468. 2nd Dimension Mystery

uve December 4th, 2011, 8:33pm: To All,

I invite everyone to give me their opinions on the nature of the 2nd dimension of this analysis. I have been struggling to define what this might be, but nothing jumps out. I've provided data on the 1st and 2nd contrasts with top and bottom favoring data on the 1st contrast. This is followed by a DGF analyis, item measures ordered by fit and difficulty, and finally a person measure plot of the high loading items versus the low.

Background information: This is a MC 50-item 4 distractor English test given to about 100 5th graders. The 1st contrast reports an unusually high 2nd dimension eigenvalue of 4.5, but I am at a loss as to what it might be. There are no distinct clusters of items at the top or bottom that I can see.

There is the obvious negative correlation problem with item 27 and the very large misfit of item 16. Removing these items only reduced the eigenvalue down to 4.2. A simulation run of the data produced an eigenvalue of 2.8, so 4.5 is well above what we should expect. The top and bottom favoring of items by persons shows now particular group affected, so I can't seem to come up with a reasonable interpretation problem for a specific group. The 2nd contrast did not uncover anything helpful in my opinion.

I ran several DGF analyses on the different groups, but only the gender version seemed to reveal something. Some of the big loading items are in the Writing Stragies and Vocabulary & Concept Development domains. So perhaps the contrast is related to gender is some way, but I still can't think of what the construct might be.

The person plot of the measures for the items I felt should be plotted doesn't seem to reveal much either.

This is a real mystery to me that such a large unidimensional departure should be so vague at the same time. I normally don't see eignevalues as high as on this exam, however, I have to say that the contrasts and plots I generate are very typcial to this one, so this is a problem I run into very often with most of our tests when attempting to assess the nature of a possible 2nd construct.

I'd greatly appreciate any comments from all of you.

469. Apologies: Messages since Sept. 8 2011 deleted

December 2nd, 2011, 10:58pm: Oops :-(

470. Question about Linacre & Wright 2002 paper

hasezhang June 15th, 2011, 5:12pm: I am reading the paper "Construction of measures from many-facet data" by Linacre and Wright 2002. The equations 1, 2, and 3 make sense to me, but I am not sure how to realize these three models in Facets program. Now my model has 3 facets, examinee ability, rater severity, and item difficulty. The task difficulty is confounded with rater severity, so it's not listed as an extra facet. I set up the code similar to Example 5 presented in Facets Help. Could you please let me know how I can specify for the other two models?


MikeLinacre: Thank you for your questions, hasezhang.

You write: "Now my model has 3 facets, examinee ability, rater severity, and item difficulty. The task difficulty is ..."

My question: Is this 4 facets? Examinees, Raters, Items, Tasks ?

You wrote: "The task difficulty is confounded with rater severity"

My question: Does this mean that each rater rates only one task, and for each task there is only one rater? If so, "rater+task" is only one facet.

You wrote: "how I can specify for the other two models?"

Facets = 3 ; examinee, rater, item

Equation 1 is a 3-Facets rating-scale model:

Models = ?,?,?,R

Equation 2 models each rater to have a personal rating-scale structure

Models = ?,#,?,R ; Each element of facet 2, rater, has its own rating-scale structure

Equation 3 models each rater to have a personal rating-scale structure for each item:

Models = ?,#,#,R ; Each element of facet 2, rater, has its own rating-scale structure for each element of facet 3, item

hasezhang: Hi Mike,

Thank you so much for the models. It's very clear.

Yes, in our study most of the raters only rate on one task, so we decide to have rater and task treated as one facet.


MikeLinacre: Good, hasezhang. Happy analyzing!

hasezhang: Hi Mike,

I have more questions about my FACETS study. Besides looking at examinee ability, rater severity, and item difficulty effects, we also want to see whether gender of rater and types of tasks play certain role in scores. We are interested in whether female raters are more lenient than male raters, and whether raters are more lenient in certain tasks but not others. Can I incorporate these two effects as two extra facets? For gender effect, I think it's more like a differential item functioning issue, but I don't know how to handle this in FACETS. Or should I use it as a dummy facet and look at its interaction with other facets? How about types of tasks? Here another issue with my previous post is that I said most of raters in my study rate only one task, but some of them rate two tasks and some tasks are rated by two raters. Each examinee only completed a centain number of randomly selected tasks but not all. I tried to use rater and task as separate facets, but there is data connectivity problem, so I have to combine them as one facet.


MikeLinacre: Thank you for your questions, Hasezhang.

Dummy facets with interaction analyses answer the male/female and rater/task questions.

"I tried to use rater and task as separate facets, but there is data connectivity problem, so I have to combine them as one facet."

Reply: It sounds like you have a nested design. So raters and tasks are separate facets, but group-anchor the raters or the tasks. If the raters are randomly assigned to the tasks, then group-anchor the subsets of raters at 0.

hasezhang: Hi Mike,

Thanks for your suggestions! Raters in my data are not randomly assigned to tasks. Some raters only rate one task, some raters rate two tasks. Similarly, some tasks have only one rater each, while others have two raters each. There is no pattern for the assignment. In this case, is it still a nested design, and can I still use group-anchor method.

Could you please direct me to some references about this nested design?

MikeLinacre: Hasezhang, sorry, I don't know any references for your design. There may be some in the G-Theory literature. Most judging plans start out from an ideal design, but are altered to match the realities of the situation. Something always seems to go wrong in practice! Even in a tightly-controlled study conducted by ETS they discovered a mistake in its administration too late to remedy it.

hasezhang: Hi Mike,

I have a difficulty explaining bias analysis results. Here are my results. The first two and last two items in the list show significant bias. F means Female group and M means Male group.
Does it mean that Female raters are more stringent than males on the first two items and more lenient for the last two items? But how to explain their Obs-Exp averages are in a different direction? E.g. for Item 5, Female target measure is higher than males, but female's Obs-exp is negative, but positive for males.

| Ta | Target Obs-Exp Context| Target Obs-Exp Context| Target Joint Welch |
| N | Measr S.E. Average N Gende| Measr S.E. Average N Gende|Contrast S.E. t d.f. Prob. |
| 5 | .42 .01 -.04 1 F | .37 .01 .08 2 M | .05 .01 6.43 39854 .0000 |
| 3 | -.08 .01 -.02 1 F | -.11 .01 .11 2 M | .03 .01 3.27 39848 .0011 |
| 6 | -.26 .01 .00 1 F | -.26 .01 .11 2 M | -.01 .01 -.64 39840 .5230 |
| 1 | .08 .01 .01 1 F | .10 .01 .07 2 M | -.02 .01 -1.92 39844 .0552 |
| 2 | .13 .01 .01 1 F | .16 .01 .08 2 M | -.03 .01 -2.80 39836 .0050 |
| 4 | -.30 .01 .03 1 F | -.24 .01 .09 2 M | -.06 .01 -5.91 39821 .0000 |


MikeLinacre: Hasezhang, the "Obs-Exp" difference tells you the direction of the bias. If Obs-Exp is positive, then the bias is causing the observed scores to be higher than expected.

The signs of the measures are controlled by whether the table conceptualize the bias as "persons are locally more able" or "items are locally more difficult" for this group. So, in the first line. The observed score on item 5 is less than expected for the femails, and more than expected for the males. Overall, The males are .05 logits more able on this item than the females (relative to their overall performances on the test), or this item is .05 logits more difficult for the females than the males (relative to the overall difficulties of the items comprising the test).

hasezhang: Dear Mike,

I recently revisited my previoius questions about bias and also your kindly responses, they make a lot sense to me now. However, I came across another question that I don't know how to elaborate. For the same data analysis as in the previous question, the observed scores for females are consistently higher than males for every item. My instinct is that overall females do better than males. In general, the females should have positive obs-exp score difference, and males should have negative obs-exp score difference. But the bias results gave different directions for different items, e.g. Target contrast is positive for item 5, while is negative for item 4. How should I explain this discrepancy? Does it mean that the bias direction have no relationship with observed scores?

Thanks a lot!

Mike.Linacre: Hasezhang, it sounds like females are, on average, more able than the males. This is not bias! On individual items, Facets adjusts for overall male and female abilities before investigating bias. We expect about half the items to favor males and half the items to favor females. There may be bias if the size of the "favor" is large.

472. winsteps item discrimination

kulin July 12th, 2011, 6:51am: Dear forum members,

For the purposes of linking two tests (common item design; 2PL applied for each of the tests separately) following equations can be used:

Theta (transf)=A*Theta+B ; Difficulty (transf)=A*Difficulty+B;
Discrim (transf)=Discrim/A .

My question is: Can we use equations of the same form, for the purposes of linking two tests for which the Rasch model has been applied whereby WINSTEPS first approximation of 2PL item discrimination has been calculated, too (can we use the equation WinstepsDiscrim (transf)=WinstepsDiscrim/A).

If the answer is negative, then my question is:

How can we perform the linking for the Winsteps item discrimination parameter (two tests;common item design)???

Please help! :'(

MikeLinacre: Kulin, the "item discrimination" reported by Winsteps is not a parameter (unlike 2-PL). The Winsteps item discrimination is a description of the slope of the empirical item characteristic curve. It is computed in the same way as the 2-PL item discrimination, but not used to adjust the theta estimates (unlike 2-PL where item discrimination adjusts the theta estimates). So the item discrimination is not part of Winsteps linking. Winsteps linking uses only the item difficulty parameters.

But test discrimination may be part of Winsteps linking. This is like "Celsius-Fahrenheit" linking in physics. We cross-plot the Winsteps difficulties of common items, and the relative discrimination of the two tests is indicated by the slope of the best-fit line.

kulin: Thank you for providing me with a fast answer.

Only one more question: How can I make WINSTEPS item discriminations from two tests (with common items) comparable?Can I do this by means of the "test discrimination linking" procedure as described in your previous post (i.e. by dividing all of WINSTEPS item discriminations within one test by the slope of the best fit line from the item difficulties cross plot)?

My final goal is to use the item discriminations from the whole pool of items (test1+test2), within a single linear regression procedure (dependent variable: item discrimination;goal: identifying predictors of item discrimination). Therefore I am trying to establish a link between the two tests (with the purpose of obtaining a larger item pool for regression analysis).

I am sorry for bothering you with my questions - there is no one else in my environement who could give me the correspondent answers (and I can't find them in the literature)... :-/

MikeLinacre: Kulin, not sure what you are looking for. The average item discrimination in every Rasch analysis is set to "1.0" if the analysis is reported in logits (or 1/1.7 if the analysis is reported in probits).

Test discriminations differ because of the environment of the test. For instance, the same test given in high-stakes situations usually discriminates more (= has higher test reliability) than the same test given in low-stakes situations. So we need to do a Fahrenheit-Celsius equating to bring the two different test-discriminations into agreement. But in both of the analysis (high-stakes or low-stakes) the average item discrimination will be reported as 1.0.

kulin: Thanks Mike, I finally got it ;D

473. Test Targeting

TAN July 13th, 2011, 4:41am: Hi Mike
I was wondering if there is a classification on test targeting: For example, if the difference between average item difficulty (¥ä) and average person ability (©¬) is zero it will be a perfect targeting. And, if absolute value is .5 logits it will be a relatively good targeting and so on. Thanks very much. TAN

MikeLinacre: Tan, most educational tests (that I have seen) are targeted at around 75%-80% success. 50% success tends to give most students the feeling that they have failed, even before they see the scores.

From the perspective of statistical information, the maximum information in a dichotomous item is 0.25 when there is zero difference between person ability and item difficulty. For 70% success, this is 0.21. Logit distance = ln(.7/.3) = 0.85 logits. This reduces to 0.16 for 80% (or 20%) success. The logit distance is ln(.8/.2) = ln(4) = 1.4 logits

474. urgent questions

dijon2 July 13th, 2011, 3:54am: I've got very urgent questions.
I'd appreicate if you would help me ASAP.

I've got five facets: examinee(33), gender(2), level(2, i.e., graduate/undergraduate), rater(5), criteria(16).
After running the Minifac, my results seem to have problems: 1) they show only undergradute in Table6.0; 2) they show only 25 examinees(from 1-25); 3) it reads "there may be 2 disjoint subsets" under Table 3.

I attach my data file here.

MikeLinacre: Quick answers, Dijon2:

1) they show only undergradute in Table6.0; 2) they show only 25 examinees(from 1-25);

Yes, that is correct. You have 2,640 ratings. Minifac is limited to 2,000 ratings.

3) it reads "there may be 2 disjoint subsets" under Table 3.

Yes, that is correct, and with 2,640 ratings there are 4 disjoint subsets. This is because your students are nested male/female and also graduate/undergraduate. Please define your measurement model precisely. Now it is:

examinee + gender + level + rater + criteria -> rating

Minifac does not know how you want to partition each examinee's ability between "ability", "gender" and "level"

475. Another Application for Disattenuated Correlation?

uve July 12th, 2011, 7:11pm: Mike,

If you recall, I had asked previously about the use of disattenuated correlation process using residuals from items loading high versus low on the 1st component in the dimensionality analysis.

I was wondering if one could use the same process to distinguish item groups as an alternative to using Table 27. If you recall, I had a survey in which the first 9 questions asked about attitudes about writing before taking a writing course and the next 9 asked about attitudes about writing at the end of the course. If I interpret Table 27 correctly, it's providing a statistical measure of the average difference in item difficulty of the two item groups.

That's fine, but I also decided to split the test in two. I created ability measures based on the first 9 and another set of measures on the next 9. I also recorded the test reliabilities for both. I plotted the ability measures of both these subtests for a visual, then correlated the measures. I then followed the formula and divided this correlation by the square root of the product of the subtest reliabilities.

My question is: can I interpret this disattenuated correlation as the degree to which both item groups are telling the same story?

MikeLinacre: Uve, the disattenuated correlation has removed the effect of measurement error. We can also imagine the perfect "same" story. This is the "true" latent trait to which both sets of measures are maximally, but independently, correlated. If so, then the "true" (=disattenuated) correlation of each set of measures with the "true" same-story latent variable = square-root (the disattenuated between-measures correlation).

476. chi square item fit indices

kulin July 11th, 2011, 6:14pm: Dear forum members,

I applied a 2PL model in order to obtain information on item discrimination for
the items from a physics achievement test (4220 students;59 items). For this purpose I have used BILOG-MG.
However, the chi-square item fit statistics proved to be significant for most of the used items (in other words, most items are misfitting).

Earlier, the same achievement data set has been analyzed by using the Rasch model within WINSTEPS.
For this analysis, the values of INFIT statistics proved to be within the 0.8-1.2 interval (in other words, items seemed to fit the Rasch model very well).
The correspondent analysis within BILOG (fitting the 1pl model and choosing the Rasch option) leads to approximately same difficulty parameters, but again the
chi square item fit statistics suggests that most of the items don't fit the model.

According to Deng&Hambleton ( 2008 ) , "...there is not much confidence in these (chi square item fit) statistics because they are very dependent on sample size. With
very big samples as we used in this study, it will appear that none or only a few items will actually fit the data." Further, they recommend to plot a distribution
of standardized residuals in such cases, in order to evaluate item fit. "An item would show good fit with the model if its SR falls into the range (-2, 2)..."

BILOG provides standardized posterior residuals. For my 2PL model, these residuals (for each item) are between 0 and 1 (is my interpretation correct???).

Mislevy and Bock (1990) indicated that "a standardized residual greater than 2.0 may indicate failure of an item response model to fit the data at that point."

Question: Can I use the standardized posterior residuals data (see Table stand_resid_table) as an evidence of satisfying item fit for my 2pl model-
is my interpretation of "stand_resid_table" data correct?

MikeLinacre: Kulin, sorry, I don't know how to interpret that table, perhaps someone else does ... If your sample size is much above 500, then you are probably "over-powering" the fit tests. Suggestion: divide the chi-square by its d.f., if the value is in the range 0 - 2, then the fit is probably OK. If the value is bigger than 2, then this suggests that there is considerable unmodeled behavior in the data. Since 2-PL attempts to fit a model to the data, this suggests that 2-PL is an unsuitable model for those hugely misfitting data.

kulin: Thank you for providing me with useful suggestions.

477. Displaying pathway plots

max June 27th, 2011, 8:29pm: Hi all,

I'm trying to generate figures for a publication of a Rasch analysis and need some advice on creating the pathway plots. I'll be doing the item and person plots separately since they're each very busy (I have almost 250 people, and my questionnaires have 12, 11, and 20 items, with 5, 11, and 5 response categories, respectively).

My first question is about the x-axes; I was advised to use the mean square values for the fit of the items because I have enough patients that even reasonable mean squares are statistically misfit. However, to calibrate the items I excluded persons with absolute infit zstd's >3, so is it reasonable to plot the item infit mean square in one plot, and the person infit zstd in another (since I actually did make decisions based on person misfit, but not item misfit)? I plan on including the persons excluded from the calibration in the plot (using values derived from a calibrated anchored analysis), so I think it would look better if there was a definite region along the x-value that was included, and the mean square and zstd scores aren't perfectly correlated...

My other question is about the y-axes; how should I be scaling the model standard errors? I noticed the pathway plots produced by excel don't scale based on the actual error measures, but rather just the relative error measures, so I'm creating my own plots from scratch. I also noticed the errors for the items are much smaller than the errors for the persons, but I'm not sure if the item errors are completely correct since I don't have errors associated with each category threshold, only with the item as a whole (the item means and errors I'm getting from the IFILE, and threshold values from the SFILE). Currently I'm plotting each threshold using the error for the whole item. If I make the diameter of the bubbles equal to the error I get pretty large bubbles in my person pathway plot (mean error=.376), but very small bubbles in my item pathway plot (mean error=.077), is this correct?


uve: Hi Max,

Usually Dr. Linacre responds to all our posts. I hesitate answering simply because I am a mere Rasch model mortal compared to him :) But seeing as how you haven't gotten a reply yet, I'll take a chance and give it my best.

If you type in Misfit Diagnosis in Winsteps Help, you'll see that we usually use Outfit before Infit, and Mean Square before ZSTD. So from my interpretation, this would mean that your Infit ZSTD would be the least important of the 4 measures. You can resample your persons choosing to use, say, only 100 and thus reduce your ZSTD to something more reasonable, but you might want to do this first and then rethink your criteria for removing persons. A caution about resampling: the item measure order should not change significantly. If it does, then it won't work. You can try different methods in Winsteps until you get something acceptable. If not, then perhaps stick with mean square using the help table as a guide. I'm not sure why you want to report persons in ZSTD in one plot and means square for items in the other. I would keep them the same for easier interpretation. Both mean square and ZSTD are telling the same story in different ways.

You really don't need to start from scratch with your plots. Simply click on one of your bubbles and select the Edit Data Series option. You can rescale your data to be more realistic and do so separately for persons and items in your case. It is perfectly reasonable to have standard errors for persons larger than items. Remember, you've got 40+ items on which to base ability measures for persons, but several hundered persons to build difficulty measures for items, so those standard errors will be much lower.

If you're plotting persons and items then you do want the item error, not structure error, in my humble opinion. However, I suppose you could use the Structure Measures, which are basically the item measures added to the calibration measures. They're reported with standard errors. But it's my understanding that the bubble chart's primary purpose is to provide an item/person map that incorporates fit and/or mean square. If I see that I've got items that are targeted for, say, a high ability level, but the size of the bubble is large (S.E.), then accuracy is called into question for this item. So item error is not a threshold error. It's a measure of the accuracy of the item difficulty on the latent variable. Thresholds are the transition points of the categories for an item. So in a bubble chart, it makes more sense to be to plot it that way.

Anyway, I hope that helps.


MikeLinacre: Thanks, Uve. Max, conceptualizing items with long rating scales is challenging. There are many ways to do this, which is why there are many subtables to Winsteps Table 2 and Winsteps Table 12. Rasch-Andrich thresholds (SFILE=) are not really "category" thresholds. The Rasch-Andrich thresholds are pairwise between categories, but better category thresholds (for most audiences) are the Rasch-Thurstone thresholds that dichotomize the latent variable between higher and lower categories at each category boundary. www.rasch.org/rmt/rmt54r.htm www.rasch.org/rmt/rmt233e.htm

The standard errors of the individual Rasch-Andrich thresholds are awkward because they are interdependent across the rating scale and highly dependent on the category frequencies of the adjacent categories. Most audiences would be confused about them because there is no equivalent in Classical Test Theory or common experience. My advice is to impute the item S.E. to all the thresholds (with a suitable notation) rather than to apply a unique S.E. to each threshold.

max: Thanks for the replies!

With respect to the MNSQ vs ZSTD scores, I'm looking at a few questionnaires and basically just want to report their Rasch properties without being influenced by a few outlier responders, so I excluded based on a test of fit to the model (though I used an arbitrary |ZSTD|>3), but I'm completely up for changing that. I can't find much about appropriate person-fit MNSQ values for exclusion, it seems there's much more about item-fit. I'd worry about excluding 0.5 MNSQ < 2 since that would result in removing a lot of overfit persons, and relatively fewer underfit... Do you have any suggestions?

I was then planning on reporting the item MNSQ regardless of what they are, since I don't necessarily plan on dropping any items (if they're bad, they're bad!).

So I guess my rationale for using ZSTD in one place and MNSQ in another was that I'm actually using it for decision making in one, and not in another. At the same time, I understand that it creates more confusion and would be up for using MNSQ as exclusion criteria if I can :)


max: Hmm, reading back over the page, I see I'm reading it wrong; I thought 2 and .5 where the limits before "degrading", but it looks like there's no lower limit (which I guess makes sense because the lower values are overfit?). Would it be reasonable to exclude <.25 anyway just so they don't overly influence which people are being excluded at the high end MNSQ's?


MikeLinacre: Max, it is unlikely that overfit (low mean-squares) is causing underfit (high mean-squares). Much more likely the other way round! This is because the arithmetic is asymmetric. But please do experiment!

478. step distances and slopes of ICCs

lovepenn July 11th, 2011, 11:37am: Mike,

I have two more quesitons.

As step distances gets narrower, the slope of the Item Characteristic Curve gets steeper.
What does the slopes of the ICCs mean? How are you going to interpret the slopes of ICCs in relation to step distances?

When evaluating response categories, I examined average measures, structure calibrations, step distance,s category fit statistics, and overall person and item reliability. After collapsing one category to another because of step distances less than 1.00, I reanalyzed the data, and found that category fit improved and step dstances became satisfiactory, but that overall person reliability decreased slightly (about 0.01 to 0.03). Is this change considered neglible? Can I still say collapsing categories improve measurement? Or, should I stick to an original scale?

Thank you always for your help.

MikeLinacre: Lovepenn, we are in a trade-off situation.

More categories = higher test discrimination, higher reliability, smaller standard errors, more sensitive fit statistics.

Fewer categories = better fit, easier interpretation, more stable inference.

So we must choose what we want. It is the same in physical measurement. We can weigh ourselves to the nearest gram or ounce. Is that better or worse than weighing ourselves to the nearest kilogram or pound? The choice depends on our purposes.

479. Rasch-Andrich vs Rasch-half-point thresholds

lovepenn July 11th, 2011, 9:02am: Dear Mike,

I'm havinvg a trouble understanding the difference between Rasch-Andrich thresholds and Rasch-half-point thresholds. I've read the manual, where you gave definition for each, but still I just don't get it. I understand that they represent the different ways of conceptualizing thresholds, but don't understand how they differ.

Would you help me to understand the differences between the two and how they are related to each other?

MikeLinacre: Lovepenn, there are several ways of conceptualizing the category-intervals on the latent variable for rating scales.
1) Rasch-Andrich thresholds: category x starts where the probability of observing category x = probability of observing category x-1. Category x ends where the probability of observing category x = probability of observing category x+1.

Imagine 1,000 people whose abilities exactly align with the start of the category. Then, perhaps 300 people will be rated in category x, 300 people will be rated in category x-1, and the other 400 people will be in other categories.

2) Rasch-half-point thresholds: category x starts where the expected average score on the item is x - 1/2. Category x ends where expected average score on the item is x + 1/2.

Imagine 1,000 people whose abilities exactly align with the start of the category. Then some of those people will be observed in each category, but the average of all their ratings will be x - 1/2.

3) Rasch-Thrustone thresholds: category x starts where the probability of observing category x or higher = probability of observing category x-1 or lower. Category x ends where the probability of observing category x or lower = probability of observing category x+1 or higher.

Imagine 1,000 people whose abilities exactly align with the start of the category. Then, 500 people will be rated in category x or higher, 500 people will be rated in category x-1 or lower.

Relationship: the thresholds are mathematically related to each other. If we know the values of one set of thresholds, then we know them all. So the choice is purely conceptual.

480. Fair averages and measures not lining up

aarti July 6th, 2011, 9:38am: Hi Mike,
Hope you remember me from the courses last year...
Well, I have a weird problem and Matt suggested this forum. We have a few datasets where we've seen this problem and are wondering if it is a bug in the software.
In the most recent case, we have data on 3 facets from a 360-degree leadership survey: leader, rater and items. While the leader and rater reports look ok, the items table is confusing:
1. The measures and fair averages are inversely related (i.e. lower the measure, higher the fair average and vice-versa). They are actually correlated -.78!! This is only for the items, not for the leaders or raters.
2. The ordering of the items doesn't make sense whether I sort them by fair averages or measures.
Please let us know if you have any insights into what might cause this problem. If you need a sample output file I'll send it to you privately by email.

MikeLinacre: Thank you for your questions, Aarti.

1. "lower the measure, higher the fair average" - this may be because of positive and negative facets. Suggestion: make all facets positive (Positive=1,2,3,4,5,...). Then the measures are for ability, easiness, leniency, ... and fair averages must align with measures.

2. "The ordering of the items doesn't make sense whether I sort them by fair averages or measures." - this may be because different items have different rating scales, or patterns of missing data. If different items have different rating scales, or you are using a partial credit model ("#"), then you may need to "pivot-anchor" the rating scales at their axial category, rather than use the Rasch default of the point on the latent variable where the highest and lowest categories are equally probable.

aarti: Hi Mike! Thanks for your responses...
1. We do have positive and reverse-scored (negative) items and have coded them as such. We'll try what you suggest.
2. The items all have the same rating scale but yes there's different patterns of missign data, because we use a computer adaptive method. If this explains it, and we want to use these items in a future administration, should we calibrate them afresh each time instead of using past logits as inputs for the adaptive technology?

MikeLinacre: Yes, Aarti.
1) The fair-averages are for the rescored ratings, not for the original ratings.
2) "The ordering of items does not make sense" - this is still problematic. The ordering will be based on the rescored ratings, but the ordering should definitely make sense with the design you describe. Please zip up and email me your specification and data file(s), along with an indication of what does not make sense.

481. When to use partial credit

uve July 5th, 2011, 5:17pm: Mike,

In my previous example of the poetry course survey, there were 3 item groups: opinion of poetry before and after the course, opinion of the teacher, and opinion of the mentor. I did not group the questions, so each had its own structure. But I could have created three groups for each distinct item category. Or I could even have created just one considering the fact that all items use the same 4 distractors. So this begs the question: what is the best method?

I chose no groups because I felt it was the only way to see how Winsteps compared to SPSS chi-square. But in creating measures, I'm wondering if it's better not to use partial credit and create one item group. I would greatly appreciate any suggestions, examples and resources you know of that could guide me. I'm sure every situation is different :)

MikeLinacre: Partial credit vs. Rating Scale model has many aspects, Uve: https://www.rasch.org/rmt/rmt143k.htm

Suggestion: model the 3 groups (ISGROUPS= in Winsteps), then plot the 3 model ICCs on the same graph (Winsteps, Graphs window, Multiple ICCs). Only treat them as different if they are obviously different.

uve: Thanks! I have plenty of observations for all categories for all questions thankfully. I followed your suggestions and I'm assuming we want Measure Relative to Item Difficulty. Is that correct?

MikeLinacre: Yes, Uve. "Measure relative to item difficulty" places the ICCs on top of each other, making it easier to compare their shapes. We use the statistical technique known as "IOTT" (inter-ocular traumatic test = hits you between the eyes). https://www.rasch.org/rmt/rmt32f.htm :-)

482. Splitting the Test for Dimensionality Analysis

uve June 20th, 2011, 5:09pm: Mike,

In Winsteps Help regarding dimensionality you state the following:" A straightforward way to obtain the correlation is to write out a PFILE= output file for each subtest. Read the measures into EXCEL and have it produce their Pearson correlation. If R1 and R2 are the reliabilities of the two subtests, and C is the correlation of their ability estimates reported by Excel, then their latent (error-disattenuated) correlation approximates C / sqrt (R1*R2). If this approaches 1.0, then the two subtests are statistically telling the same story."

So for example, suppose I have a 50 item test and after running the dimensionality table, 30 load positive and the other 20 load negative. Are you saying to split the test into a 30 item and a 20 item test, run Winsteps and get the ability and reliability measures?

Or after looking at the substantive nature of the items and determing that there are, say, 5 items loading positive that contrast strongly with 5 items that load negative, I create two 5-item subtests and run the correlation with them instead?

I know this is not an easy one to answer, but at what point working away from 1 would the correlation begin to warn us that we might have an issue?

MikeLinacre: Uve, there are not any set rules about multidimensionality. We must think about the situation. Suppose we discover a test is multidimensional, then what will we do about it?
1) omit some of the items from the test? Example: a few geography items in an arithmetic test.
2) split the test into two, and report two measures? Example: a driving test with practical and written components. We must pass both components to be permitted to drive.
3) accept that the latent variable is multidimensional, but we must make one pass-fail decision on the combined overall performance. Example: certification tests in which strength in one topic area is allowed to compensate for weakness in another topic area. Example: arithmetic test where low performance in division can be compensated for by high performance on multiplication.

So the item split is at the "fault line" through the items. This is unlikely to correspond exactly to zero loading. When we examine the content of the items we are likely to see that there is one group of items (such as "long division") that splits off from all the other arithmetic items ("addition", "subtraction", "multiplication", ...")

Correlation size: if we are at all concerned about a low correlation, then we must look at a scatterplot of the person abilities. Who is impacted? High performers? Low performers? Alternative curriculum? Non-native speakers of the language? .... Example: on a "minimum competency" test, we are not worried about multidimensionality that only impacts high performers. On a "mastery" test we are not worried about multidimensionality that only impacts low performers, etc.

There are multidimensional models that build the correlation into the estimation of the final measures, but it is not clear (to me) that these produce more accurate measures than those estimated using the unidimensional assumption.

uve: Thanks Mike. So it appears the dividing line is very subjective. I can live with that :)

Once I've made my best estimation where that line should be and I run the correlation procedure, is there a point from 0 to 1 where you would begin to say that the two subtests are not statistically telling the same story? I imagine this could depend on the type of test being examined, but our tests are all criterion-referenced dichotomous exams.

MikeLinacre: Statistics can get us started, Uve, but then our decisions become substantive: the purpose for test and how fine-tuned we need the measurement to be. For instance, if we are giving a battery of educational tests, with one test for each of eight content areas, then we would want to verify that each test was confined to its own content area. We would not want a math word-problem to overlap the language-proficiency test or the geography test.

uve: Thanks Mike.

Is an inch tall? Well, if we're talking about a human, then no. But if we're talking about a microbe, then yes. But if I don't conceptually know the dimension of an inch, then I can't make the statement about either. I guess where I'm headed is clarifying what needs to be done at the point just before we begin the substantive process. I never expect statistics to provide definitive answers, merely to point me in a direction. Perhaps I should restate my initial question in this way: can we interpret the correlation value in the same manner as we do with any other such test?

MikeLinacre: That is interesting, Uve. Which comes first? Theory or statistical analysis? In physical science, theory comes first, then the data. A recent example was the hunt for the "top quark". My advice to psychometricians is to imitate the successes of physical science wherever possible. This indicates that we need to decide what we intend to measure (a construct theory) before we worry about reliability, correlations, etc.

"can we interpret the correlation value in the same manner as we do with any other such test?"
Let's assume we are talking about a significance test of a hypothesis. Then the choice of null and alternative hypotheses is crucial. An example in physical science is https://www.rasch.org/rmt/rmt111c.htm - the choice of the "wrong" null hypothesis would have caused Michelson and Morley to reject the the hypothesis that the speed of light is a constant. Critics point out that many of the incorrect findings reported in medical journals result from poor choices of the null and alternative hypotheses.

483. Chi-Square & Winsteps Discrepancy

uve July 2nd, 2011, 9:40pm: Mike,

We are trying to assess the effects of a new poetry class on student attitude toward writing and speaking in public. Towards the end of the course, two groups of students were given a survey. The first 9 questions asked students about their attitudes towards, writing, poetry, speaking in public, and other related questions before they took the course. The next 9 asked about these same issues now that the course was over. The next 6 questions centered on attitudes toward the teacher. In addition to the course teacher, one group had a mentor who would counsel and guide the students. This group had an additional 10 questions on their survey asking them about their attitudes towards the mentor. I coded the first 9 questions �B� for �before taking this course� and �A� for after the course or �now I. . .� The teacher questions were labeled �T� and mentor questions �M�. Question distractors were, A: Totally Disagree, B: Kind of Disagree, C: Kind of Agree, D: Totally Agree, and were scored 1-4 respectively.

I combined both groups and coded the mentor group �M� and the non-mentor or comparison group �C�. I ran the file in Winsteps with the last 10 questions deleted so both groups would be compared only on the 24 questions in common. I then generated Table 28 which confirmed no significant difference between M and C. I then ran a Chi-Square analysis in SPSS which also confirmed this. So it would appear that the mentor had no significant impact on student�s attitudes. You can see the output in the attached Excel file.

I then wanted to see if each group had significant changes in attitudes based only upon the 18 questions coded �B� and �A�. I created a separate file for the mentor group and one for the comparison group and ran Table 27 for each. The Winsteps output indicated no significance between the two question types for either group. However, when I ran this through the Chi-Square in SPSS, there were a few B/A question pairs that indicated a difference, though most didn�t. Most intriguing, at the bottom of the Chi-Square table it seems to indicate that overall there was a difference in attitude. This seems to contradict Winsteps output.

So when comparing groups on the first 24 items, both Winsteps and SPSS agree, but when comparing Before/After items for each group, there is a discrepancy. I realize that with Chi we are comparing frequencies and with Winsteps we are comparing measures. Perhaps the Chi indicates that there is significant difference in the distractor response frequencies of Before & After items, and Winsteps indicates that there is no significant difference in the average difficulty of Before & After items, meaning that if the course had a positive effect, it would have been easier to answer higher on the scale for After items, thus producing a significant difference. But this still seems to be a contradiction between SPSS and Winsteps, or frequencies and difficulty levels. Can this be reconciled? I am curious to know if I am running into one of the weaknesses of using CTT.

Thanks again as always.


MikeLinacre: Uve, as you have noticed, the Winsteps and SPSS chi-square computations appear to be comparing somewhat different aspects of the data. According to the right-most columns in the Excel "Chi-Square Summary" worksheet, the SPSS chi-square is looking at the pattern across the 4 rating-scale categories. This suggests that SPSS is also reporting an internal change in the rating-scale structure. The equivalent in Winsteps would be to incorporate a comparison of Table 3.2 for separate analyses of the two groups.

It is difficult to draw inferences from internal changes in rating-scale structures. Consequently these tend to be ignored, and instead a "pivot point" in the rating scale is chosen between "good" and "bad" categories. Each 4x4 cross-tab then becomes a 2x2 cross-tab of "good" and "bad" responses. This makes inference from the cross-tabs much easier.

uve: Mike,

When you say, "The equivalent in Winsteps would be to incorporate a comparison of Table 3.2 for separate analyses of the two groups." This table contains a lot of information, so what should I compare and at what point do differences become an issue?

MikeLinacre: Yes, "at what point do differences become an issue" - this is an unresolved issue in Rasch analysis. Probability curve plots are too much influenced by accidents in the data, so I plot the two model ICCs (for instance, Winsteps Graph screen, Multiple ICCs) and only consider action if the two ICCs are obviously different. Often these curves tell me that different ICCs are really the same one.

"What should I compare?" - We are comparing estimates based on estimates, so the basis of comparison is flimsy. This is another reason to dichotomize the rating scale to focus on the crucial decision-point when making this type of comparison.

484. Guidelines for testlets?

KY.Chiew June 26th, 2011, 3:50pm: Dear Mike,

You listed guidelines to check if a rating scale for polytomous item is functioning effectively.

Linacre, J. M. (2002). Understanding Rasch measurement: Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3 (1), 85 - 106.

Is there guidelines listed for ‘testlets’?


MikeLinacre: KY, testlets are artificial accumulations of items. Those rating-scale guidelines do not apply to them. We do not expect them to act like groups of independent dichotomous items. Thus we expect some or all of the Rasch-Andrich thresholds to be closer together than 1.5 logits. We are not surprised if some thresholds are reversed. We do expect the fit of the polytomous testlet item to be more central (mean-square closer to 1.0) than the fits (mean-squares) of the separate dichotomies. We do expect the average-measures of the respondents with each testlet score to advance with increasing score.

KY.Chiew: Thanks Mike. This is helpful.

I presume I can check this using WINSTEPS? Using the Diagnosis ‘C) Category Function’, ‘A) Item polarity’ and ‘E) Item misfit table’.

For 1) Expect some or all of the Rasch-Andrich thresholds to be closer together than 1.5 logits

In Diagnosis C) Category function, Table 3.2 onwards. The ‘STRUCTURE CALIBRATN’ is the Rasch-Andrich thresholds.

Q: Do you mean that some or all the ‘STRUCTURE CALIBRATN’ values will be less than 1.5 logits?

2) Thresholds can be reversed

This means that it does not matter if the ‘STRUCTURE CALIBRATN’ does not advance with increasing SCORE

3) Expect Fit for polytomous testlet item to be more central (MNSQ closer to 1.0) than fits of separate dichotomies.

In ‘Diagnosis E) Item misfit table’, Table 10.1. The OUTFIT MNSQ and INFIT MNSQ for the polytomous testlet item should be closer to 1.0.


In ‘Diagnosis A) Item polarity’, Table 26.1. The OUTFIT MNSQ and INFIT MNSQ for the polytomous testlet item should be closer to 1.0.

4) Expect Average-measures of respondents with each testlet score to advance with increasing score

In ‘Diagnosis C) Category function’, Table 3.2 onwards. The ‘Average-measures’ is the ‘OBSVD AVRGE’ here, and this should advance with increase ‘SCORE’.


In ‘Diagnosis A) Item polarity’, Table 26.3. The ‘Average-measures’ here is the ‘AVERAGE ABILITY’, and this should advance with increase ‘SCORE VALUE’.

Q: Should I check if the ‘OUTF MNSQ’ for each of the score is close to 1.0?

Q: What if the ‘AVERAGE ABILITY’ is not advancing with increase ‘SCORE’? Can I fix this? Or when is it serious enough that I definitely need to fix this?

Thanks. Much appreciated

MikeLinacre: Thank you for your questions, KY.

"For 1) Expect some or all of the Rasch-Andrich thresholds to be closer together than 1.5 logits
In Diagnosis C) Category function, Table 3.2 onwards. The ‘STRUCTURE CALIBRATN’ is the Rasch-Andrich thresholds.
Q: Do you mean that some or all the ‘STRUCTURE CALIBRATN’ values will be less than 1.5 logits?"

Reply: Yes, the ‘STRUCTURE CALIBRATN’ values (Rasch-Andrich thresholds) may not advance by as much as 1.5 logits for each category.

"2) Thresholds can be reversed
This means that it does not matter if the ‘STRUCTURE CALIBRATN’ does not advance with increasing SCORE"

Reply: Yes. We prefer STRUCTURE CALIBRATN to advance, but the situation is artificial. We already know that the situation has problems, so we accept that we will not be able to convert our imperfect data into perfect data. Our conversion is only into better data.

"3) ...."

Reply: Table 10.1 and Table 26.1 are the same numbers in different orders.

"4) Expect Average-measures of respondents with each testlet score to advance with increasing score"

Reply: Yes, definitely 3.2 (global for the "rating scale" model) and most of 26.3 (local).

"Q: Should I check if the ‘OUTF MNSQ’ for each of the score is close to 1.0?"

Reply: Yes.

"Q: What if the ‘AVERAGE ABILITY’ is not advancing with increase ‘SCORE’? Can I fix this? Or when is it serious enough that I definitely need to fix this?"

Reply: This depends on the situation, and the purpose for the test. First, we need enough observations of the category (at least 10) so that the report is unlikely to be caused by accidents in the data. Then we need to consider the inferences that we intend to make from the estimates. If the inferences are based on the test as a whole, then a few flaws in a few items will not alter those inferences. If we intend to make inferences based on individual items (as in diagnostic situations), then we require the individual items to function in accord with the intended inferences. Collapsing (combining) categories, or treating irrelevant categories as missing data, are common solutions. Another approach is to identify problematic persons in the data and omit them from the analysis (e.g., guessers on a MCQ test).

KY.Chiew: Thanks Mike. This is much appreciated. I think I got this now.

Will now read up on collapsing categories, etc.

485. When Person reliability is low: Unidimensional?

KY.Chiew June 24th, 2011, 2:44pm: I am getting confused.

Let’s say I have a 12-item scale. All the items fits the Rasch model (i.e., PTMEA > 0 and Infit and Outfit MNSQ values between 0.5 and 1.5). There was no indication of multidimensionality from a Principal component analysis of the standardised residuals (i.e., Contrasts with eigenvalues < 2.0).

Does it matter what the Person separation reliability or index is when we try to establish ‘unidimensional measurement’? I mean, is it any less unidimensional if the Person separation reliability is .30 than if it is .50 or .80?

And what does it mean when for example if 10% of the sample has a Person ability measure that yielded Person Outfit MNSQ > 2.0? Should I worry that this scale is not functioning well despite all the items fitting the Rasch model?


MikeLinacre: Thank you for your questions, Ky.

1) "Does it matter what the Person separation reliability or index is when we try to establish ‘unidimensional measurement’? I mean, is it any less unidimensional if the Person separation reliability is .30 than if it is .50 or .80?"

Reply: multidimensionality is almost independent of reliability. Reliability is controlled by (i) test length, (ii) sample raw-score variance, (iii) item type. So, if you combine an arithmetic test with a geography test, there will be more items. Reliability will probably increase.

2) "And what does it mean when for example if 10% of the sample has a Person ability measure that yielded Person Outfit MNSQ > 2.0? Should I worry that this scale is not functioning well despite all the items fitting the Rasch model?"

Reply: Perform two analyses:
1) Analyze all the sample. Save the item difficulties (IFILE= in Winsteps)
2) omit the 10% of the sample with large mea-squares (PDFILE= in Winsteps).

Cross-plot 1) against 2). Off-diagonal items are impacted by the misfit. If this is consequential, then:
From 2) output the "good" item difficulties: IFILE=if.txt
Rerun 1) imposing the "good" item difficulties on the entire sample: IAFILE=if.txt

Now, abilities are estimated for all the sample, but the misfitting person performances do not distort the measurement of the fitting persons.


KY.Chiew: Thanks Mike. Let me understand this.
‘Multidimensionality is almost independent of reliability’ - so, when the data fits the Rasch model, it tells us that it complies with the structure of quantity. But it does not indicate if the unidimensional scale constructed (due to its test length, the item type included, and characteristics of the target sample) is a reliable scale used for decision making on the target sample.

It is the Separation reliability or index that tells us if the unidimensional scale constructed yields Person estimates that can distinguish ‘securely’ between persons with high ability or trait level from those with low ability or trait level. Separation reliability > .50 is decent, but ideally separation reliability > .80 is essential if the measure is used for important decision making.

In other words, the items may fit the Rasch model, complies with the structure of quantity and unidimensional measurement, but the scale may not be at the necessary levels of reliability for decision making. In order to do so, the right type of items should be added to the scale that is suitable for the target sample (or wrong items removed)? Are the best types of items the ones that can successfully differentiate between the individuals in the target sample?

Still learning 2).

KY.Chiew: For 2)

I understand this as:

STEP A) Perform two analyses to retrieve the Item difficulties (?)
1) Analyze with all the Persons in the sample. Then save the Item difficulties. WINSTEPS ‘Output files’ -> ‘ITEM File IFILE = ‘ will do this. [TICK]

2) Omit the Persons with the large mean-squares (OUTFIT MNSQ > 2.0). Then analyze to retrieve the Item difficulties. ‘PDFILE = ‘ - I’ve done this by entering this directly into the Control file and saving it as a new file. [TICK]

“Cross-plot 1) against 2). Off-diagonal items are impacted by the mifit”
Do I need to do this using “Compare statistics: Scatterplot”? I’m not sure how to tell if the items have gone ‘off-diagonal’. :-/

“From 2) output the “good” item difficulties: IFILE=if.txt”
I assume I pick out the Item difficulties for the items that are not 'off-diagonal'?

“Rerun 1) imposing the "good" item difficulties on the entire sample: IAFILE=if.txt”
Read WINSTEPS help for Rasch Analysis on 'IAFILE = item anchor file'. Afraid this is beyond me ??)

More help?


MikeLinacre: KY, yes, this is complicated. It takes me hours to teach this in workshops and courses.

Any convenient method of comparing the item difficulties from 1) and 2) is OK. If find that cross-plotting the item difficulties is easiest.

"Off-diagonal" - this decision is guided by statistics, but is largely subjective. How tight do you need your measurements to be? We have the same situation in physical measurement. Our heights change during the day due to a "standing" dimension, and our weights vary during the day due to an "eating" dimension. We ignore those dimensions for most purposes, but medical clinicians may consider them to be important for diagnosing ailments.

“From 2) output the “good” item difficulties: IFILE=if.txt”
I assume I pick out the Item difficulties for the items that are not 'off-diagonal'?

Reply: Hopefully you have both substantive and statistical reasons for choosing the definitive set of item difficulties. Probably every item difficulty in 2) is a good difficulty estimate (not perturbed by idiosyncratic persons).

“Rerun 1) imposing the "good" item difficulties on the entire sample: IAFILE=if.txt”

Reply: Have you tried to do this? You will see that the item difficulties in this analysis are the item difficulties from 2), but the person measures are for the whole sample.

486. Comparing Different Components

uve June 28th, 2011, 4:14am: Mike,

I'm becoming more familiar with the process of using the first component of residuals to examine items that load positive versus negative to investigate multi-dimensionality. My question has to do with the contribution of other components differently than you have mentioned previously.

Is there a possibility that items that load very positive or very negative on, say, component 2 can be causing multi-dimensionality when compared to component 1? Put another way, if I have 6 items loading very positive on component 1, can items loading positive on component 2 be contributing to the multi-dimensionality in addition to the negative loading items on component 1?

MikeLinacre: Uve, this interaction between dimensions could be happening, but a reason for using orthogonal PCA of residuals is to make each component as independent as statistically possible of the other components. Compare this with oblique axes, where factors are deliberately correlated in order to give the maximum discrimination between items loading on each factor.

487. control file in Winsteps - combine cases

Peter_Paprzycki June 27th, 2011, 9:07pm: I have a very basic question as I am beginning my adventure in Winsteps. I am trying to comine cases that relate to the same evluator in Winsteps Control file, and I cannot find a command to do it and do not know where to inlude it if I had it.

3333333333333333333 A
2222233222232322222 B
2222233222222222222 C
2222222222222222222 D
2222232222322223222 E
3333233233332333332 E
3232223222322222222 E

E in my example is the SAME person, but in the output is being treated as 3 different raters.

Also, in output tables, when I run Table 3.2 - Rating (partial credit) scale, it gives me the breakdown of all individual items, but it does not provide me with the actual all results. If someone would help, I would greatly appreciate!!!

Peter :K)

MikeLinacre: Peter, let's understand the situation.

"E in my example is the SAME person, but in the output is being treated as 3 different raters."

Winsteps analyzes each row (case, subject, person) as a separate set of observations. For instance, E, E, E could be 3 different subjects rated by the same rater, E. If this is the case, then we can get a sub-total for rater E in Winsteps Table 28.

If E,E,E are three equivalent sets of observations, then Winsteps cannot analyze them as one case, but Facets can.

"it does not provide me with the actual all results." - What are the "actual results" that you want? If they are for the items, then Table 13, or for the cases, Table 17, or ....

488. DIF Classification

uve June 13th, 2011, 7:45pm: Mike,

I know something very familiar to the following below is found in the help menu, but I copied this from a technical report provided by our state education staff. I am trying to create a spreadsheet that will let me flag item DIF by the categories below. This may be more trouble than it's worth considering two things:

1) It appears MH values do not export to Excel from Winsteps
2) Significantly different from 1 is an odd concept for me. I'm not sure I know how I would get this information from Winsteps

I'd greatly appreciate any suggestions. Thanks.

DIF Category Definition (ETS)

A (negligible)
� Absolute value of MH D-DIF is not significantly different from zero,
or is less than one.
� Positive values are classified as �A+� and negative values as �A-.�

B (moderate)
� Absolute value of MH D-DIF is significantly different from zero but
not from one, and is at least one; OR
� Absolute value of MH D-DIF is significantly different from one, but is less than 1.5. � Positive values are classified as �B+� and negative values as �B-.�

C (large)
� Absolute value of MH D-DIF is significantly different from one, and
is at least 1.5.
� Positive values are classified as �C+� and negative values as �C-.�

MikeLinacre: Thanks, Uve.

It is straight-forward to import values from Winsteps Tables into Excel.
1) Copy the Table to the Windows clipboard (ctrl c)
2) Paste into an Excel worksheet (ctrl v)
3) Excel: "Data", "Text to columns"
4) Adjust the column markers in the Excel wizard.
5) The Winsteps columns are now Excel columns

May we assume the DIF classification is similar to Zwick et al. shown in https://www.rasch.org/rmt/rmt203e.htm ?

"Significantly different from 1" for MH. Here is an approach:

Winsteps reports the MH size and its probability. So, from these we can go backwards using Excel functions:
MH Probability -> MH unit-normal deviate
MH size / Unit-normal deviate -> MH S.E.
(MH size - 1)/ MHS.E. -> significance relative to 1 as a unit-normal deviate

uve: Thanks Mike. I got the data into Excel. Here's an example for an item:

[Uve: I have amended this. Mike L.]

MH Prob = .0355 (this is a two-sided probability)

MH Unit normal deviate = 2.1

MH Size = .44 [Uve: are you sure this is in the same units as the ETS criteria? Their criteria are usually in Delta units, not logits]

MH S.E. = Size / Unit-normal deviate = .44 / 2.1 = 0.21

.44 is less than 1.0 so the next test does not apply. Let's pretend the MH size = 1.44, then

(MH size - 1)/ MH S.E. = (1.44 - 1) /2.1 = 2.1. This is a unit-normal deviate.

We probably want a one-sided t-test of significance of MH size above 1

Probability of a unit-normal deviate of 2.1 or higher (one sided) = .02 which is highly significant

MikeLinacre: Uve, in case you did not notice, I responded to your post by editing it. :-)

uve: Thanks Mike. I guess I'm still a bit confused. The .44 is the MH Size reported in the Winsteps DIF output Table 30.1, and the .0355 is the MH Prob from the same output. When you provided the initial formula, I just assumed that's what I needed to get. So are you saying we need to convert the .44 into Delta units first?


MikeLinacre: Uve, we need to be sure we are comparing "apples" with "apples". Winsteps reports in logits. If you are using screening-values from a published source, then please verify that those values are also in logits. For MH, published values are often in Delta units, because those are what ETS used originally. There is a simple conversion from logits to Delta units.1 logit = 2.35 Delta units.

uve: I guess what confused me here was the comment you made:

"MH Size = .44 [Uve: are you sure this is in the same units as the ETS criteria? Their criteria are usually in Delta units, not logits]"

All my data comes from Winsteps so this would be logits. As I understand it, the conversion table you provided: https://www.rasch.org/rmt/rmt203e.htm, shows the Delta unit cut points for DIF classification in logits. So we have the table in logits and the Winsteps Table 30.1 output in logits. So using table 30.1 and the ETS table as our guide, we have our apples to apples comparison. Or am I still missing a critical step here?

MikeLinacre: Fine, Uve. I don't have the ETS values memorized, but since it is easy to confuse units (such as probits and logits), it is always worth verifying that all the numbers are in the same units or else another space probe could crash into Mars! http://articles.cnn.com/1999-09-30/tech/9909_30_mars.metric_1_mars-orbiter-climate-orbiter-spacecraft-team?_s=PM:TECH

489. Scale construction: Full vs calibration sample

TAN June 23rd, 2011, 12:20am: Hi Mike

I have all data for Grade 6 students available (n=10,000) and would like to construct a Numeracy scale. Which of the following is the best way to do this?

1. Using the full sample for scale construction.
2. Using a small proportion (e.g. n=500) of the sample (calibration sample) for scale construction.


MikeLinacre: Tan, 500 is probably too small (depending on test length, test targeting, etc.) My guess is that 2,000 is the smallest robust sample size for this design.

But, with very large samples, we have the opportunity to verify the calibrations are the same for different target groups of students. If so, analyze all the students together and perform DIF studies. With a large sample, many DIF effects will be reported as significant, so the focus is on which DIF sizes (differences in item difficulty across student group) are so big that they would impact decision-making.

TAN: Thank you very much Mike. I found this approach very useful. TAN

490. Forming 'testlets': What does it mean?

KY.Chiew June 24th, 2011, 3:30pm: Let’s say I have a scale with 30 dichotomous items.

I found that 8 of the items showed local item dependency (LID) with each other (i.e., items with residual correlations .20 above that of the average residual correlations in the data set).

I elect not remove the items, but form a ‘testlet’ for them - making a polytomous item with a score between 0 and 8.

Am I right to think that it should now be viewed as a 23-item scale (1 polytomous item + 22 of the remaining dichotomous items)? The ‘testlet’ formed now carries the weight of ONE item in the scale, rather than the weight of 8 items in the previous Model?

Is this the right way to understand and differentiate the two Models?


MikeLinacre: Ky, almost. The 0-8 testlet has the same weight as the original 8 dichotomous items. The range of raw scores is always 30:
1) Originally 30 dichotomous items
2) Now 22 dichotomous items + 1 polytomous item scored 0-8.

But the structure of the randomness in the data has changed. This changes the person estimates. Dependency in the data makes the data more Guttman-like (deterministic). This increases the logit range of the estimates. The analysis with the polytomous item will have a narrower range of person abilities than the original analysis. The order of the person estimates (which matches the raw scores) will not change.

Unless the dependency is huge, the change will not alter decision-making.

KY.Chiew: Thanks Mike. I’m getting there.

When you say ‘the structure of randomness in the data’, is this the structure that is indicated by the residual correlations between the items? By forming a ‘testlet’ with items that show local item dependency, this “manages” the structure of randomness in the data more accurately and therefore produces more accurate Person estimates for the data and for the scale?

I guess I’m looking for the right way to explain what happens when we apply a testlet approach to the data.


MikeLinacre: Yes, that is the idea, Ky. Rasch theory expects the randomness to be distributed uniformly in the data. Dependence causes less randomness. When we combine the dependent items, we model the dependency as a rating scale, and accumulate the randomness, so that the randnomess of the polytomous responses is closer to the average randomness in the data.

KY.Chiew: Brilliant. Thanks Mike. I got this now.

491. Comparing Two Values with Error

drmattbarney June 17th, 2011, 11:13am: Dear Mike and other Rasch Enthusiasts

I've found some important differences (non-overlapping confidence intervals) between two nice facets in a 4-Facet analysis I've completed. I'd like to report these more precisely in my write up (e.g. p<.05 or p<.001), and it appears that page 46 in the Facets manual's display of the Welsch T-test may fit the bill. I'd look at the logit difference divided by the square root of the joint standard errors.

Is this the correct way to compare two measures, and report the significance, or is there a better way?

MikeLinacre: Yes, that is the correct way, Matt. Each measure is like the mean of a distribution, and the measure S.E. is like the S.E. of the mean of the distribution. So this becomes a standard t-test. The Welch t-test is more precise than the Student t-test, but the Student t-test is good enough (and may be more familiar to your audience).

drmattbarney: thanks very much, Mike

492. Two Groups (persons): one anchor group

sprogan June 22nd, 2011, 7:49pm: I need help setting-up my Winsteps command file to analyze my data; I hope someone can help me. I'm not even sure if I'm using the correct terms...

I have 35 test items in my pool of questions. I tried-out my questions on a sample of people to determine initial measure, infit, outfit, point-biserial characteristics, etc,. I used this information to help make 3 test forms of 10 items each. Each test form has the same 3-items in common (#3, #13, #30). People in the population I am interested-in then completed one of the three test forms.

I want to re-analyze my data for the entire pool of test items using the two groups of persons; the initial groups who completed all of the items and the group from the population of interest who completed one of the test forms. The second group represents the population I am interested in, so I want to use the second group as the 'anchor' group.

Of course, my 'anchor' data have a lot of 'missing' or "NO DATA" values because I could not ask all 35 questions for all subjects. I've tried to represent the patterns in the data (below) with "0" for "NO DATA" and "1" for present'. The first three characters are the subject identifier. The "." between the subject identifiers indicates that there are more cases that are not shown.

Thank you for any help you can give me!


. 11111111111111111111111111111111111
. 11111111111111111111111111111111111
. 01100000100010000111000000001100001
. 01100000100010000111000000001100001
. 00100100001010101000010010000100010
. 00100100001010101000010010000100010
. 00110001000110010000000100100100100
. 00110001000110010000000100100100100

MikeLinacre: Sprogan, thank you for your question.

An approach is:
1) analyze all 4 datasets separately to verify that they all work correctly.
2) analyze the 4 datasets together using MFORMS= in Winsteps. In each person label put a group-membership code.
3) from 2)
a) PSUBTOTAL= to obtain group sub-totals
b) PSELECT= each group to report it
c) DIF= by group to obtain changes in item difficulty across groups


493. Appropriate use of Facets

neilad June 6th, 2011, 4:44pm: I have an interesting data set and am trying to figure out the best way to analyze the data. What I have are ~100 bilingual adults who have completed a naming test in both english and spanish. The stimuli consist of 162 nouns and 100 verbs. I intend to analyze noun and verb results separately since they have distinct linguistic differences. However, I originally thought that perhaps I could run a Facets Analysis using english and spanish as the two facets. My thinking is this would give me some measure of proficiency (i.e. ability) for naming in each language. Am I taking Facets in a direction it was not intended? Thank you for any input you may have. neilad

MikeLinacre: Neilad: a "language" facet of two elements (1=English, 2=Spanish) would support differential item functioning of stimulus-by-language, and differential person functioning of blilingual-adult-by language. Is this what you want?

neilad: Yes, that is exactly what I was looking for. Many thanks.

MikeLinacre: Great, Neilad. Hope all goes well :-)

494. Directional vs Non-directional Approach

TAN June 17th, 2011, 12:09am: Mike
I am going to locate two Numeracy tests (Year 3 and Year 5) on a common scale. Two tests are linked using 10 common items. Upon checking the stability of vertically linked items, I noticed that some of link items appeared to be easier at the lower year level than at the upper year level and vice versa. I applied two different approaches to remove unstable link items:
1. Directional approach: In this approach, those link items that appeared to be easier at the lower year level than at the upper year level were removed from link and were treated as unique items. Applying this approach resulted in a significant growth (.62 logits) from the lower year level to the upper year level.
2. Non-directional approach: In this approach, those link items that were outside the 95% confidence bands were removed from link. This approach led to a small growth (.25 logits) from the lower year level of the scale to the upper year level.
In the context of vertical equating, in contrast to horizontal equating, it is expected that the vertically linked items function differentially between lower and higher year level. Thus, this raises the question of which approach (directional or non-directional) is most appropriate when using vertical equating.

MikeLinacre: Thank you for your question, TAN.

1. Directional: this makes sense, especially if you can explain the change in difficulty when you look at the content of the items.

2. Non-directional: this depends heavily on your choice of trend line around which the 95% confidence bands were drawn. Please carefully select the position of the line. There are only 10 data points, so the position of the line is easily influenced by accidents in the data. For instance, are the outliers all to one side of the line? Then the statistical best-fit line may not track any of the paired-item difficulties well.

In summary, as Albert Einstein remarked, "It is really strange that human beings are normally deaf to the strongest arguments while they are always inclined to overestimate measuring accuracies."
Albert Einstein in a letter to Max Born, quoted in Paul Feyerabend, "Against Method", 1975, p. 75

495. Facets ability estimation vs total raw score

Mokusmisi May 31st, 2011, 7:40pm: Hi there! The question I'm about to post is in connection with ability estimation in a language exam situation applying IRT using Facets with its JMLE procedure.
In simple English, if I'm not mistaken, Facets estimates candidate ability on the basis of the parameters of each candidate (ability) and each test item (difficulty). As the same total raw score can be achieved through different response patterns, candidates with the same total raw score can have different levels of ability. Hence the difference between CML and UCON.
Suppose that you enter two extra parameters into the system: Item discrimination (which really is sample dependent) and task difficulty. Suppose further that you have four tasks in one test.
Wouldn't it be logical to assume that candidates with the same total raw score will end up having different levels of ability in logits (or percentages for that matter)? I mean, as long as there is variance in item difficulty / discrimination / task difficulty and candidate ability, and in candidate response patterns?

MikeLinacre: Thank you for your email, Mokusmisi.

When candidates obtain the same raw scores on the same items on the same tasks rated by the same rater (so that their circumstances are identical), then they will obtain the same Rasch measures. But their fit statistics will differ.

This is fundamental to Rasch models, because raw scores are the sufficient statistics for the Rasch estimates of the Rasch-model parameters. See, for instance, https://www.rasch.org/rmt/rmt32e.htm - "Rasch Model from Counting Right Answers: Raw Scores as Sufficient Statistics"

Mokusmisi: Thank you very much for your response, I've found it very useful. However, I must confess it created more problems than it solved for me.
I see that in the Rasch-model, which I understand to be very similar to the one-parameter logistic model (although I know I'm making lots of people really angry now), the total raw score is the sufficient statistic for the estimation of latent ability measures, like you said, also it was in the article you gave me the link to, and as Verhelst, Glas, and Verstralen (1995) state it in their Manual to OPLM. All this might sound counterintuitive, but I think I understand why this should be so.
In the case I was writing about above, the test was made up of dichotomous items, organised into four task (so item independence might be violated to some extent). The items are proven to differ in difficulty AND discrimination. Also, as Facets tells me, the tasks are dissimilar in difficulty, too. Raters don't enter into the picture then.
Now, my problem is that I thought it all boiled down to the estimation procedure, i.e. that Facets uses JMLE where each individual enters the calculation with their ability level and each item with its difficulty level, discrimination index, plus the task where it belongs. I understand, and probably this is where I'm wrong, that these facets serve as weights; therefore, I thought the ability would be the same for each candidate with the same weighted score rather than the unweighted raw score.
Another thing that mixes me up is an article published in Modern Nyelvoktatas, 2009 (4). It's an article by Szabo, Gabor. Sadly, it's in Hungarian, but let me translate a few sentences: "And its importance lies in the fact that while the total raw score can be achieved through various response patterns, and it does not contain any information about which items the candidate got right, ability levels provide a more realistic picture of the candidate's attainment level". And then a little later a graph is attached with a line that stands for the total raw scores and another for the ability levels: And it clearly shows that the same raw score can result in different levels of ability. In a sentence explaining this, the writer says: "All this well illustrates that the same total raw score does not necessarily mean equal ability levels". (These are my translations, but you'll have to trust me on them).
Finally, and this is going to be my last point, if the same total raw score yields the same ability levels, i.e. we're basically applying the Rasch-model, we need to have acceptable levels of model-to-data fit. And in my experience, this almost never happens: The items discriminate to differring degrees (and as a consequence, item difficulty immediately becomes a painfully local measure, where a poorly discriminating but easy item can be more difficult to get right for a high-ability candidate).
I know it's a lot, and I'm afraid probably there's lots of nonsense in my words, but you'd help me a great deal with an answer (or a link).

MikeLinacre: Thank you for your question, Mokusmisi.

You wrote: "I thought the ability would be the same for each candidate with the same weighted score rather than the unweighted raw score."

My question: what are the weights, and how are they being modeled in Facets?

You wrote: ""All this well illustrates that the same total raw score does not necessarily mean equal ability levels".

Comment: there are some estimation methods for Rasch models that do give different estimates for the same total raw scores on the same items. Most end-users find this confusing (even for 3-PL models). I have never seen pattern-scoring of ability applied in a high-stakes testing situation.

You wrote: "we need to have acceptable levels of model-to-data fit."

Comment: What is acceptable? In both physical measurement and psychometric measurement "acceptable fit" depends on the purpose of the measures. In physical measurement, "acceptable fit" (= tolerances) for high-precision engineering is much tighter than "acceptable fit" for measuring ingredients when cooking. "Reasonable mean-square fit statistics" - www.rasch.org/rmt/rmt83b.htm - suggests some acceptable fit ranges for different Rasch measurement situations.

Mokusmisi: Hi Mark, thank you again, esp. for responding to me so fast. I'll try to clarify what I meant and how.
First of all, in my terminology and in the case I described I thought the weights would be a) item discrimination, and b) task type. And I thought they would work as factors to multiply the difference between candidate ability and item difficulty by.
In an item-response matrix, I'd picture this as each dichotomously scored item gets a weight for item discrimination, and another one for task type. This way the total score would change from the simple sum (of say 40, in the case of 40 items) to the weighted sum, i.e. sum of item*discrimination*type. And then each person with the same total weighted score would get the same estimated ability in logits.
Second, when I said acceptable levels of fit, I basically meant a p value, like in the case of a t-test. A test to see if the model is valid. Since then, I think I'm getting more of what you are trying to point out to me: In the Manual to Facets I've found that the model is assumed to be valid, and if there are discrepancies, it means the test taker or the item has poor fit.
Thanks for the link, by the way, I find it very useful. One remark, though: In an article in Cambridge ESOL Research Notes (Issue 35: March, 2009) Vidakovic and Galaczi say this: "Only 3 raters out of 24 were found to be inconsistent, with the degree of infit higher than 1.5. Four raters had infit values lower than 0.5, which indicates that they exhibited the central tendency, i.e. mostly used the middle part of the scale." Apparently, their tolerance interval is 0.5 - 1.5. And although I see why somebody might puzzle over whether to use the values from the "rating scale" line in your table of reference, or the "judged", since both might be the case if you take oral proficiency testing with two examinees, 0.5 - 1.5 does not feature anywhere.
Next, I am aware that the 3PLM has a pseudo-chance parameter to compensate for guessing. I don't know why strictly speaking this is not a logistic model, but I can live with it (it's probably the mathematical properties).
Finally, let me attach this article I told you about: I know it's in Hungarian, but the graph itself is universal thank God. It's from a high-stakes test and its sample is relatively small as you'll find on the horizontal axis.
In the end, suppose I accept that everyone with the same total raw score gets the same ability level with possibly and probably differring degrees of fit: What do I do with the candidates that paid for the exam but fall out of the range of reasonable fit? While I can simply cross out an item and delete it from the bank and later use, I cannot tell a candidate (or their mum) to go home and try again with more credible responses next time, can I?
As always, thanks in advance for your very professional and informative comments on my issues.

Mokusmisi: So so sorry, I meant to say Mike, not Mark. Silly me.

MikeLinacre: Mokusmisi:

You wrote: "in my terminology and in the case I described I thought the weights would be a) item discrimination, and b) task type. And I thought they would work as factors to multiply the difference between candidate ability and item difficulty by."

Reply: Now I understand, but a model with this weighting is not a Rasch model. It would be an extended 2-PL IRT model. This model would produce different ability estimates for different response patterns.

You wrote: "What do I do with the candidates that paid for the exam but fall out of the range of reasonable fit?"

Reply: This depends on the purpose for the test, and the policies of the Examination Board. For instance, on a driving test, severe misfit probably means failure by a generally competent driver on an easy item = fails the test.
On an MCQ test, severe misfit often means guessing. If the test is intended to place students into classrooms at their correct instructional level, then guessed responses are coded as missing data. We don't want a student to guess their way into a classroom where the instruction is too advanced for the student.

Fit ranges: "0.5 - 1.5" - these correspond to ranges based on measurement theory. Between mean-squares 0.5 and 1.5, the item is productive for measurement (>0.5), but not dominated by unmodeled behavior(<1.5).

Mokusmisi: Dear Mike, All perfectly clear now, and thank you very much indeed. You've been a great help to me. I don't know if the article I attached reached you at all, and also as I said it was in Hungarian. The point I'm trying to make is that in my country it's fiendishly difficult to make conversation like this for two reasons:
a) People, even those that label themselves measurement professionals, often know little or nothing about IRT (or CTT for that matter);
b) if they do know a little or a lot, they are rather unwilling to share their knowledge for fear that you might threaten their position in the profession in Hungary.
I know it's a sad story and it's personal, too: I just wanted to tell you why I'm so happy that finally I got some answers to questions that were rather blurred for me.
Thanks again.

MikeLinacre: Thank you for this background information, Mokusmisi. I wish you success :-)

Mike L.

Mokusmisi: Dear Mike, A new problem seems to have emerged and I'd highly appreciate it if you could verify my hunch.
The thing is that the marketing manager of the exam office I work for finds it difficult to accept the fact that percentages based on a candidate's total raw score will be different from logit-based percentages (after standard setting, of course). It's really just to the margin, but he firmly believes that the tests are always good meaning more or less the level required, the candidature is always the same ability because of their large numbers, and the problem that arises from the differences in percentages is rooted in some inherent mistake in the calculations or the theory or both.
Now, he's recently pointed it out to me that while in the majority of cases the difference between the raw score-based and the logit-based percentages can be explained in case a complaint should be filed, there are differences he couldn't explain. In his wording it's more than 10%.
Another thing is that we've recently found that as a certain testlet was more difficult than ideal, fewer points will suffice to get a pass. And this lead us to the problem I'm trying to phrase: As I understand test difficulty is an average. In other words, a test or testlet may be difficult for some candidates, match the ability level of another, and be easy for a third. On average it may well be a difficult test, but there will always be testees for whom the test proved little challenge. Further, I'm hypothesising that except for extreme cases, the smallest absolute difference (by which I mean squaring and then taking the square root of the difference values) between the raw score-based and the logit-based calculations is going to be where candidate ability matched test difficulty. Is this correct?
In a graph it would look like there's a straight line for the raw score %, and a wavy line for ability % with a steep rise first, this comes to a gradual rise in the middle, and then at the higher ability levels it rises dramatically again. And it's where the two lines cross in the middle that I can find the average test difficulty.
As a further consequence, I don't think we punish high ability candidates if they get lower ability % than raw score % - In my opinion it's exactly that the test was easy for them even though on average it was a difficult one. (I understand that extreme high and low ability candidates should be disregarded in this respect because ability estimation is less precise at the two ends of the spectrum).
Thank you so much.

MikeLinacre: Mokusimi, what are "logit-based" percentages?

The general relationship between logits and raw scores is a logistic ogive. If the data are complete, and we plot "raw scores" vs. "logit ability measures" for all the candidates, then we will see one logistic ogive. The raw scores have strong floor and ceiling effects. The logit estimates have much reduced floor and ceiling effects.

If the data are incomplete in different ways for different candidates because they are administered different items or different testlets, then there will be a different raw-score-to-logit ogive for each different combination of items. The raw scores and the percentages based on them will not be directly comparable, but the logits will be directly comparable if all the candidate abilities are estimated together in one analysis in one analysis.

Does this help?

Mokusmisi: Well, the thing is that this marketing manager has his own views about what is acceptable for him and what is not: And when he finds that there aren't enough people who pass, the calculations must be mistaken because the sample is large enough. He won't accept the fact that it's not all about how many candidates you have, but more about what ability levels they have.
And this is just the tip of the iceberg. Let me clarify this issue a little bit more so you can help me ;)
What he does is this: Take a test of 40 items. Sadly, four of these proved to be useless in the live exam, so they were simply eliminated from the calculations (whether this is good practice or not). Now, he converts these corrected total raw scores into percentages like this: n*100/36.
Logit-based ability levels are calucated using Facets. It's an incomplete design in the sense that all candidates on one administration take the same items, but their achievement levels are compared with earlier levels and items in an item bank through anchor items (actually anchor tasks). Therefore, comparing examination periods with one another we can say that the May examination was slightly more difficult than earlier testlets, or than ideal.
(In my language it basically means taking the standard in one session, say in March, in logits, and apply this value as your standard in May as long as the two sets are linked).
The concept of ideal is the following: Given that the pass mark (standard) is always set at 60% because of the internal regulations of the exam office, ideal is when the candidate that answered 60% of the items correctly just about passes. In our case I'll say one had to have 50% of the items to pass based on qualitative judgement as well as on a comparison based on anchoring.
Where it gets complicated is this: I think my manager uses norm-referenced concepts for criterion-referenced tests; and I'm pretty sure that the NR concept of test difficulty is a mean value. And so, he claims that each candidate's raw score percent should be weighted by 60/50, regardless of ability level.
By contrast, in criterion-based testing you always have to consider the standard, plus according to IRT test difficulty - to me - sounds like a local measure: The test is difficult for candidates at certain ability levels, but still easy for candidates beyond this ability level. Consequently, I don't see why he labels it punishing better candidates when they get lower percentages based on logits than they would based on their raw scores.
I'll get back to my original question now: I think that as long as the raw score-based percentages are based on a reliable standard that is consistenly applied; and as long as the logits are converted into percentages based on consistently working anchor items and again a reliable standard that is the same as in the case of the raw score, then a graph with the raw scores on the x axis and % on the y axis would produce a straight line for the raw score % values and this ogive for the logits converted into % values AND the two would cross where test difficulty and candidate ability were equal. Beyond this point, raw score conversions will be higher; and below this point, logit conversions will be higher.
Does this make sense at all? In a nutshell, all I'm trying to do is explain why in some cases raw score based percentages are higher and in others logit based percentages are higher. I'm sure it sounds idiotic, but believe me: After I've got the definitive answer it'll be a million times more difficult to explain to an absolute lay audience who have no respect for even some of the fundamentals of statistics or psychometrics.
And once again, thanks for reading my lines and giving me very informative responses all the time.

MikeLinacre: Sorry, Mokusimi, but this is outside my expertise. Perhaps one of the consultants at www.winsteps.com/consult.htm can help you.

Mokusmisi: Thanks anyway, you've helped me an awful lot; and quite frankly, if it is outside your expertise, it ought to be something a marketing manager should be ready to take without a challenge. I still think it's not so much about your inability to answer my question as my inability to put it clearly. Thanks again.

Mokusmisi: Just a quick note: You told me you probably didn't understand what I was going on about. Actually, this person I was telling you about just wrote me an e-mail, and apparently I don't understand it, either. He keeps repeating the word "achievement" without any reference whatsoever to a standard, and he clearly denies that the word "ability" is meaningful. Perhaps this is why my letters to you were so depressing.

MikeLinacre: What a struggle, Mokusmisi :-(

496. Using IRT for Pre/Post Control Group Design

Steve May 17th, 2011, 3:41am: Hello - I am relatively new to IRT. Other than completing about 70 pages of an IRT book I have only been exposed to a few high level overviews. I am working on running an experiment for my dissertation and was looking at using a pretest-posttest control group design and basically using an ANCOVA to distinguish between the control and treatment groups.

I am enjoying IRT thus far and see the applicability for my dissertation. I feel relatively well regarding scale construction and basics of Rasch model at this point. Before I abandon classical methods, I wanted to make sure that IRT is a good option for me to complete my scale construction currently and then run my experiment later. If IRT has advantages, could you provide me with a basic idea of how to do that and be able to determine significance between the groups. I would appreciate a more in-depth look if you have any references to suggest.

Thanks for your time :) - Steve

MikeLinacre: Steve, in your application the primary advantage of Rasch methodology over raw scores is that Rasch linearizes the raw scores into Rasch measures. Raw scores can have strong ceiling and floor effects. Rasch will also validate the functioning of your test(s).

If you can identify the members of the control and treatment groups, then the mean measures and S.D.s of each group on the pre- and post-tests can be computed. Then, from these 4 pairs of summary statistics, effect sizes etc. can be calculated. OK?

Steve: Yes, that makes sense. Thanks!

Steve: I am currently calibrating an instrument (have played around with Winsteps with the initial 120 participatns) and should have a total sample size of just over 200. From another post that I read, I got the impression that I would just add the data from my experiment's participants' pre and post test to my existing file and rerun. From that, I would get the 4 pairs of summary statistics you had mentioned in your post?

Also, in my current practice calibration with the 120, I am trying to figure out how to best change my instrument such that I have a good instrument for my study. I am currently looking at INFIT an OUTFIT to keep close to 1 (assuming .7 to 1.3 is ok??) Then looking at table 10.4 and 10.5 to see if any questions are producing the problem. Is there other suggestions for optimizing?

Thanks again for your time!

MikeLinacre: Steve: summary statistics. If you include codes for participant-group in the person labels, you can get summary statistics by group code.

For your instrument, also take a look at dimensionality (Table 23) to verify that all the items are loading on the same dimension well enough for practical purposes.

Steve: Will do - I have checked out the "Introduction ot Rasch Measurement" and have found that quite helpful.

497. Sampling with Winsteps

uve May 31st, 2011, 2:14am: Mike,

Is there a way to random sample a specified number of persons in Winsteps ? I see the simulated file output option, but this alters the scores/measures. I looked at PSELECT but did not see anything there.

MikeLinacre: Uve, the sampling option allows for resampling from the person records. A simpler option is to take every nth person record. This is from: www.winsteps.com/winman/format.htm

Example 7: Pseudo-random data selection

To skip every other record, use (for most situations):
FORMAT=(500A, /) ; skips every second record of two
FORMAT=(/, 500A) ; skips every first record of two

You have a file with 1,000 person records. This time you want to analyze every 10th record, beginning with the 3rd person in the file, i.e., skip two records, analyze one record, skip seven records, and so on. The data records are 500 characters long.

FORMAT = (/,/,500A,/,/,/,/,/,/,/)


FORMAT = (/,/,100A2,300A1,/,/,/,/,/,/,/) ; 100 2-character responses, 300 other columns

uve: Thanks Mike. The reason I asked was in response to an earlier question in which I noticed small MNSQ's were reported highly significant by ZSDT. You mentioned sampling a smaller subset of the persons.

But the list of items looks completely different after sampling. I've attached the comparison. I was hoping the smaller sample of persons would provide a more useful ZSTD for the same items, but apparently the sampling process is so dramatic, it's almost like looking at a completely different test. Did I do anything wrong here? Is there a better way to sample? Should I make multiple attempts until both analyses have close to the same items in the fit order?

MikeLinacre: Uve, neither Firefox not Internet Explorer downloads the attachment correctly. Please .zip it, and attach it to a new post. Thank you.

498. item discrimination parameter

kulin May 31st, 2011, 7:05am: Dear forum members,

Eighth grade students from my country participated in two large scale assessments that measured science achievement. In both cases the students sample was representative.
The study A included 4220 students and 59 physics items, and the study B included 1377 students and 64 physics items.Students from consecutive generations participated in both studies.


The first objective of my study has been to identify predictors of physics item difficulty. I accomplished this task by: 1) virtual test equating (thanks to Mike Linacre for providing me with the correspondent suggestion!) 2) running linear regression with item cognitive descriptors as independent variables and Rasch difficulty measure (obtained by virtual equating) as dependent variable.

Another objective of my study is to identify cognitive factors (competency differences) that had contributed to the observed achievement differences between high and low achievers (from the mentioned studies). For this purpose I decided to create a linear regression model of physics item discrimination power.
In order to include as many items as possible in my regression analysis, I created a database that included all 123 items.I described the items by using dichotomous variables which represented some cognitive constructs. Finally, I am going to run the regression analysis with items' cognitive descriptors as independent variables and item discrimination power as dependent variable.


a) Within CTT, discrimination index can be calculated as: "proportion of correct answers in the upper group (e.g. upper 27%)" - "proportion of correct answers in the lower group (e.g. lower 27%)" .
For each of the studies there are primary databases which contain ability measures (obtained within IRT framework) for each participant. By using these databases it would be easy to define the groups of high and low achievers, and to calculate the discrimination index for each item.
The interpretation of the results would be straightforward, too (I obtained very interesting results by using this methodology).
However, I suppose that the combination of CTT and IRT measures (as described above) is not allowed???-Probably, the discrimination indices frome the two studies are not comparable (since they are CTT measures), too???
b) If the described combination of CTT and IRT measures is not allowed, could I proceed in the following manner (similar as has been done for item difficulty): 1) calculating the IRT discrimination parameters (I would have to use the 2 parameter IRT model) for each study separately 2) virtual equating of discrimination parameters
c) I suppose Differential item functioning analysis can't help me to explore cognitive differences between high and low achievers (groups which are to be compared should represent approximately equally able students???)
d) Rasch difficulty is measured in logits and its interpretation is straightforward: "one logit is the distance along the line of the variable that increases the odds of observing the event (e.g the correct answer) specified in the measurement model by a factor of 2.718"
What is the unit for the item discrimination parameter estimate (from the 2 parameter IRT model)???-By example, how can we interprete the increase of item dicrimination parameter by a value of 0.3 (within CTT it would be straightforward)?-Is there a similar interpretation like the one of Rasch difficulty?

Finally, I want to thank forum members for previous assistance - forums like this one are for researchers from some undeveloped countries the only way to obtain answers to some research questions...

Sincerely yours,


MikeLinacre: Thank you for your questions, Kulin.

a) "suppose that the combination of CTT and IRT measures (as described above) is not allowed???"

Kulin, they are allowed, but your audience (the readers of your findings) may become confused. So you need to be really careful to explain what you have done, and why you did it.

"Probably, the discrimination indices frome the two studies are not comparable (since they are CTT measures), too???"

The item discrimination indices usually tell the the same story but in different ways. This is fundamental to good science. Two reasonable alternative methods of investigation should report the same finding. If the findings disagree, then we don't trust either set of findings until we have convinced ourselves about which method gives the correct findings.

b) Sorry, I don't understand this. Perhaps someone else does.

c) DIF would help with cognitive differences. DIF methodology stratifies sub-sample samples into equal-ability strata for the purposes of computation. There is no need for the sub-samples to have equal mean abilities.

d) Item discrimination is usually an estimate of the slope of the ICC = the rate of change of the probability (or standardized frequency) of success relative to the latent variable. So a difference of 0.3 in item discrimination corresponds to a difference of 0.3 in the rate of change. (This is as far as I can go. Can anyone go further?)

kulin: Firstly, I want to thank you Mike - your answers were a great help for me! :)

Additionally, I will try to be more precise regarding the question b) :

Within the process of virtual test equating, cross plot of item difficulties (for similar items from two tests) is used in order to express item difficulties from one test within the framework of the other (more reliable) test. My question was: could we perform a similar procedure (cross plot of item discriminations) in order to express item discriminations from one test within the framework of the more reliable test (or is the equating procedure completely superfluous in the case of my study) ?

Best regards!!!

MikeLinacre: Kulin, the test discrimination is the average of the item discriminations, so, when we have equated tests we have also equated the average item discriminations. We could cross-plot the item discriminations to verify that the functioning of each item has not changed, relative to the average item discrimination. This would be equivalent to a comparison of item fit. The more deterministic (less unpredictable) the item, the higher the item discrimination.

499. ability, effectiveness, difficulty

uve May 25th, 2011, 4:57am: Mike,

I am beginning to do more and more professional development with teachers and learning coaches on the new measures I've developed for our tests using Winsteps. There seem to be three primary factors I have to address well to them concerning student test performance:

1) teacher effectiveness
2) the ability level of students
3) changes in the test construction due to new items replacing old

The key issue is our rational for establishing proficiency cut points for each of our 115+ summative assessments. To establish this rationale, I looked at our state assessments. Proficinecy on these had already been established using content area experts from all over our state. These scores range from 150-600 with 350 being proficient. I merely calculated into what percentile 350 fell for our students on the state tests. Say it was the 78th percentile. I then checked what score was the 78th percentile on a similar subject district test and that score became the cut point for proficiency.

The problem is that I'm having to adjust those cut scores because after I did the calibration work, content experts here locally decided that many of the questions had to be replaced. I used the anchoring feature in Winsteps to adjust for any difficulty changes due to this work.

So here is my dilemma: what if changes in student performance between the old and new versions of our tests have nothing really to do item replacment but instead are a factor of teacher effectivenss or student ability. In other words, the true effect of the difference between old and new versions has nothing to do with items so anchoring is misleading us.

If the tests had remained exactly the same from year to year, I wouldn't change the proficiency cut level just because this new group of students is performing better or worse than last year's group. I wouldn't do any anchoring. I would assume that changes are a factor of student ability or teacher effectivenss or both. Both the minute we have to account for new items entering the picture, then suddenly anchoring takes over. Put another way, if I discovered that 60% of the students met or exceeded the proficiency mark on a test and the next year it was 70%, I would attribute that to ability and teacher effectiveness. However, if I then discovered that this year's version had 50% of the items replaced, I would assume the change was due to the items. I would anchor the common items and the proficiency cut put would likely be raised to account for the difficulty change on this year's version.

It appears that new item introduction supercedes ability and teacher effects. Since items will likely be replaced every year, we will always be making adjustments to account for this, all the while never really knowing whether students have more ability or teachers are doing a better/worse job with the new group of kids. It seems the only way to truly know if we are doing a better job teaching kids is to allow tests to remain the same for a couple of years. I'm wondering how you would address this issue. Sorry for the long explanation.


MikeLinacre: Uve, thank you for this explanation and inquiry.

It seems that you are half-way between norm-referenced testing (i.e., cut-point is 78% of the sample) and criterion-referenced testing (i.e., cut-point is a certain level of proficiency on a set of items).

Let's assume that we want to maintain the criterion-referenced cut-point on Test A and apply it to Test B, but the items are changing from Test A to Test B.

1) If any items are the same (common) between Tests A and B (and they are not exposed or otherwise compromised), then if Test A and Test B are analyzed separately, a cross-plot of the common items should have a 45 degree best-fit line. We can then use the Test A item difficulties as anchor values in the Test B analysis, and bring forward the Test A cut-point measure and apply it to the Test B person measures. If the best-fit line departs noticeably from 45 degrees, then the test discriminations are different. We have a Fahrenheit-Celsius situation. The pairs of item difficulties are cross-plotted. The best-fit line provides the slope and intercept to equate Test B ("Fahrenheit") to Test A ('Celsius"). Then the Test A cut-point measure can be brought forward to Test B.

2) If no (or almost no) items are the same between Tests A and B, then we need to use "virtual equating". Pairs of items of roughly similar content and conceptual difficulty are identified: one item in Test A and one in Test B. The difficulties of the items are obtained from separate analyses, and then the pairs of item difficulties are cross-plotted. The best-fit line provides the slope and intercept to equate Test B ("Fahrenheit") to Test A ('Celsius"). Then the Test A cut-point measure can be brought forward to Test B.

uve: Each test has enough common items to anchor. When I crossplot the common items many times the lines depart from 45 degrees. I remove items until the line approaches 1 as close as possible. But this seems like we are forcing Test B to function like test A. It seems that we should crossplot the common items and let the chips fall where they may.

1) Should I not be removing common items that don't function the same in order to get the line to adjust or simply accept the line as it is?

2) If, say, 75% are common items and we encounter a Fahrenheit-Celsius situation why do we assume the test is functioning differently from an item perspective as opposed to a person ability or instructor effectiveness perspective?

MikeLinacre: Thank you, Uve.

If the line of commonality is not at 45 degrees, then the test discrimination has changed. It could have changed for many reasons including those you mention. A common reason for a change in test discrimination is "high-stakes" vs. "low-stakes" testing situations.

So, either we impose the test discrimination of one test on the other test (i.e., impose a 45 degree line of commonality based on one of the two tests) or we adjust for the change in test discrimination (Fahrenheit-Celsius). The decision about the correct analysis is yours (or the examination board's, etc.)

uve: So when I anchor the common items of the two versions and the line of my scatterplot is not 1 (45 degrees), then it is reasonable to investigate and unanchor those common items that will allow the line to adjust as close as possible to the ideal. Would that be one good course of action?

MikeLinacre: Uve, it is a good course of action if you believe that the two tests should be equally discriminating, and that any departure from equal discrimination is due to chance or unique changes in some of the common items (due, for instance, to a learning effect or to exposure).

uve: Mike, thank you for your patience and guidance with me on this issue. I am getting clearer on some concepts of the equating issue but others still allude me.

1) Can you provide me an example of common item changes that are not the result of chance or some learning effect?

2) Can you provide me an example of when I would not want the tests to discriminate equally? Please keep in mind that our multiple choice tests are usually given at the end of a quarter or trimester and are relatively high stakes used in grading and course placement.

3) Maybe my purpose is off the mark there, but ultimatley I must ensure that the proficiency levels set for the tests are as reasonable, accurate and fair as possible. If we have done the best job possible in setting these levels and then later new items are introduced to the new versions, perhaps significantly changing the difficulty, wouldn't I need to account for this and anchor only those common items that still function the same as the previous version?

4) Put another way, if 75% is proficient on Version A of a test but the new replacement items have made the test much harder as Version B, wouldn't I need to lower the proficiency level of B so we maintain fairness and comparability?

5) Keeping in mind my objectives just mentioned, if this 75% on Version A was 1.2 logits and after anchoring properly 1.2 logits now equates to 70%, this then should be my new proficiency level for Version B. If I keep all common items regardless if some changed significantly (beyond the boundaries of the confidence interval bands) in their difficulty levels, then wouldn't I be getting inaccurate equating results for my purposes?

Sorry again for all the questions.

MikeLinacre: Uve:

1) Items at different locations in the test. For instance, the first and last items on one test may change difficulty if printed in the middle of another test.

2) It is essentially impossible for two tests to discriminate equally. A typical way of increasing test discrimination (i.e., to make the test more sensitive to changes in ability) is to use rating-scale or partial-credit items. Even inside a test, we may want the test to discriminate more highly near cut-points.

3) We need anchor items (or items in an item bank) to maintain their difficulties across administrations (in the criterion-level sense, not the p-value sense), regardless of the ability of the particular sample.

4) Yes, we need to maintain the criterion-level proficiency. So if the test is harder, then the proficiency pass-fail raw-score is lower.

5) First sentence, yes. Second sentence, common items that change significantly are no longer common items, they have become two different items. Example: in the 1950s, the counting item, "count downwards from 10", was difficult for children. In the 1960s, this became a very easy item. It became a new item!

uve: Mike,

As always, I can't thank you enough. It seems I'm on the right track here though I'm still nervous. We have some 800+ teachers counting on me to set the cut points accurately so they can grade properly. They operate under the traditional method of 90, 80, 70, 60, 50 percent with 80% being the proficiency mark for all tests regardless of difficulty or subject matter. Linking to state tests sometimes means that proficiency can be higher, but also much lower. When a teacher sees 65% they doubt my methods very much. Then when the new version comes out and this has changed to 70%, they think even less of it. I've got to make sure I've got the right rationale for equating one version to another.

On many of our tests, I've noticed that items have changed locations and am beginning to think this is causing many of our common items not to be common any longer. I'm amazed by how some tests which may have kept 80% of the original items have over half of those outside the CI bands, which means they are not common any longer and must be removed as anchor items, while others don't change at all.

Again, thanks for your help.

MikeLinacre: Uve, yes, the teacher's are thinking in a norm-referenced world. As W.E. Deming demonstrated in an industrial setting, this leads to considerable drift of standards across time.

You wrote: "over half of those outside the CI bands" - please do not eliminate based on this criterion alone. CI is too sample-size dependent. Please set a substantive value for acceptable drift. Typical values are 0.5 logits for tightly-controlled tests, and 1.0 logits for classroom tests. In many situations, 1 logit = 1 year's academic growth.

Also remember that small random fluctuations in item difficulty tend to cancel out.

uve: Thanks for the correction.

I would wager to guess that our tests should come under the "tightly-controlled" category due to their use and high-stakes qualification. So using 0.5 logits as my gold standard for keeping common items anchored seems good to me. However, what then is the purpose of using the scatterplot? It seems we would no longer be concerned with the angle of the line any longer. If we have unanchored all the items of 0.5 logits and higher, we should be satisfied with the line as is and the scatterplot would be mere curiosity at that point.

MikeLinacre: Uve, we want both a substantive size (e.g., 0.5 logits) and sufficient improbability that the change is due to chance (e.g., .05 confidence bands). The scatterplot helps with these. We can see if the the outliers are all on one side, suggesting that the initial best-fit line is not the final one. We can also encounter situations where there are two diagonal clusters of items, on adjacent parallel best-fit lines. One cluster of items represents items that have been exposed or "taught to" (accidentally or deliberately), and the other line represents neutral items. In this case, we need to identify which line is which, and make the decisions about unanchoring items accordingly. In short, it is much easier to select good anchor or truly common items from a picture than from tables of numbers.

uve: Mike,

You mentioned earlier that: "In many situations, 1 logit = 1 year's academic growth."

How can I prove this if we're constantly adjusting the test due to equating? This is what I meant initially in my post. If proficiency was initially determined at, say, 1.2 logits and now it's at 2.2 and there were new replacement items, I'm assuming the test got easier due to those new items. But if there had been no replacement items on version B, then we would have assumed the 1 year growth instead.

That's been my dilemma: how we can separate the effects of new items from true academic growth. It seems like common item equating masks some of this possible growth.

MikeLinacre: Uve, when we equate tests through common items on the Grade 5 and Grade 6 tests, or anchor common items from the Grade 5 test to the Grade 6 test, then the logit (or whatever) measurement scale of the Grade 5 test is brought forward to the Grade 6 test. A result is that 1.2 logits on the Grade 5 test is on the same measurement scale as 2.2 logits on the Grade 6 test. This can be extended up all grade levels. We can definitely separate changes in test difficulty from changes in sample ability (which was Georg Rasch's original motivation for developing his models!)

500. Subtotals or DIF/DPF

uve May 19th, 2011, 6:36am: Mike,

Would I be correct in stating that when we want to see if one group performs significantly different than another on a test, say male versus female, we should use the subtotal function or Table 28. However, if we want to see if an item or an item group performs differently for males versus females, then we should choose DIF or Table 30?

Do you have any links you could provide that would detail how DIF is calcualted? Thanks again as always.


MikeLinacre: Yes, those statements are correct, Uve.

For the Winsteps DIF computations, see Winsteps Help: https://www.winsteps.com/winman/index.htm?difconcepts.htm - section "The Mathematics ..." and also https://www.winsteps.com/winman/index.htm?mantel_and_mantel_haenszel_dif.htm

uve: More DIF clarification: For a DIF of males and females on a 50-item test starting with item 1, would I be fairly close in stating that the measures for males on item 1 and females on item 1 are created while anchoring items 2-50 for both genders together and then repeats for each item?

MikeLinacre: Uve, yes, fairly close. In classical DIF analysis the persons are "anchored" at their raw scores. In Rasch DIF analysis the persons are anchored at their ability measures. When this anchoring is done, DIF for all items can be estimated simultaneously because each item's DIF computation is independent of the DIF computations for the other items.

uve: So in my example the DIF process will calcuate an item difficulty measure for each item for males and for females while anchoring ability measures from the original run of all persons. Is that correct?

MikeLinacre: Correct, Uve.

501. Significance & Sample Size

uve May 24th, 2011, 3:45pm: Mike,

I typically deal with test populations of 1,500 or more students for each test. It's not uncommon that a MNSQ of only 1.20 could have a ZSTD of 4.0 or higher. A while back you had mentioned a method for dealing with outfit/infit statistics not very far from 1 reported as significant under ZSTD because of large sample sizes, but I can't quite remember how to do this.


MikeLinacre: Uve, your findings accord with https://www.rasch.org/rmt/rmt171n.htm

The conventional way to deal with this statistical "over-powering" problem is to resample from your sample, say 150 cases.


pmaclean2011 May 17th, 2011, 7:52pm: For those with experience using WINSTEPS, I have created the control datafile that has 30 items and 3 individual level variables. The items are divided into 5 domains. I would like to analyze the data by domains and compare the results by individual variable (e.g., SEX). How do you implement these procedures in WINSTEPS? I can run the entire 30 items.

MikeLinacre: Pmaclean, not sure what you really want, but here is one approach:

1. Code each item with its domain, and code each person (case) with its sex.

2. Analyze all the data together to produce one set of item difficulty estimates together with one set of person ability estimates.

3. Do a "differential group functioning" (DGF) analysis of "Domain x SEX", etc.

pmaclean2011: I am importing the data from SPSS. I have the variable SEX in the dataset.
How do I to code each item with its domain while in WINSTEP? Know I can use DIF variable for differintial group functioning analysis. The problem is greating the Domain variable. Part of the infifile is as follows:

ITEM1 = 1 ; Starting column of item responses

NI = 39 ; Number of items

NAME1 = 41 ; Starting column for person label in data record

NAMLEN = 6 ; Length of person label

XWIDE = 1 ; Matches the widest data value observed

; GROUPS = 0 ; Partial Credit model: in case items have different rating scales

CODES = 1234567 ; All data not in this code are treated as missing

CSV=Excel ; Put value in excel format in the output file

CURVES=010; Show Item Characteristic Curves

DISCRIM=YES ; REPORT item discrimination

ISgroups="" Estimate Andrich RSM see page 132 in the manual

CLFILE=* ; Label the response category and to specify the ARSM

1 Strongly disagree ; Names of the responses

2 Disagree

3 Disagree somehow

4 Neutral

5 Agree

7 Strongl agree

* ; End of the list of the response variable

MNSQ=N ;Show mean-square or standardized fit statistics

OUTFIT=YES; Sort infit and outfit using mean-square

PRCOMP=R ; Principal component analysis of residuals

MATRIX=Yes ; Produce correlation matrix

TOTALSCORE = Yes ; Include extreme responses in reported scores

PTBISERIAL=ALL ; Point-bisetial correlation coefficient

; This Starts the Person Label variables: columns in label: columns in line

@COUNT = 1E1 ; $C41W1 ; Country

@SEX = 3E3 ; $C43W1 ; Speciality

@KNOW = 5E5 ; $C45W1 ; Knowledge on ICT

;differintial group functioning analysis starts here


@DOMAIN ; How do I create this varaible?



&END ; This end the control variable specifications

; Item labels follow: columns in label

;Domain 1

IE1 ; Institutional Environment Q1 ; Item 1 : 1-1

IE2 ; Institutional Environment Q2 ; Item 2 : 2-2

IE3 ; Institutional Environment Q2 ; Item 3 : 3-3

IE4 ; Institutional Environment Q2 ; Item 4 : 4-4

IE5 ; Institutional Environment Q2 ; Item 5 : 5-5

;Domain 2

TE1 ; Item 6 : 6-6

TE2 ; Item 7 : 7-7

TE3 ; Item 8 : 8-8

;Domain 3

SE1 ; Item 9 : 9-9

SE2 ; Item 10 : 10-10

SE3 ; Item 11 : 11-11

SE4 ; Item 12 : 12-12

;Domain 4

DI1 ; Item 13 : 13-13

DI2 ; Item 14 : 14-14

DI3 ; Item 15 : 15-15

DI4 ; Item 16 : 16-16

;Domain 5

TTE1 ; Item 17 : 17-17

TTE2 ; Item 18 : 18-18

TTE3 ; Item 19 : 19-19

;Domain 6

KM1 ; Item 20 : 20-20

KM2 ; Item 21 : 21-21

KM3 ; Item 22 : 22-22

KM4 ; Item 23 : 23-23

KM5 ; Item 24 : 24-24

KM6 ; Item 25 : 25-25

KM7 ; Item 26 : 26-26

KM8 ; Item 27 : 27-27

KM9 ; Item 28 : 28-28

KM10 ; Item 29 : 29-29

;Domain 7

ST1 ; Item 30 : 30-30

ST2 ; Item 31 : 31-31

ST3 ; Item 32 : 32-32

ST4 ; Item 33 : 33-33

ST5 ; Item 34 : 34-34

ST6 ; Item 35 : 35-35

ST7 ; Item 36 : 36-36

ST8 ; Item 37 : 37-37

ST9 ; Item 38 : 38-38

ST10 ; Item 39 : 39-39

MikeLinacre: Pmaclean, please try this:

ITEM1 = 1 ; Starting column of item responses
NI = 39 ; Number of items
NAME1 = 41 ; Starting column for person label in data record
NAMLEN = 6 ; Length of person label
XWIDE = 1 ; Matches the widest data value observed
; GROUPS = 0 ; Partial Credit model: in case items have different rating scales
CODES = 1234567 ; All data not in this code are treated as missing
CSV=Excel ; Put value in excel format in the output file
CURVES=010; Show Item Characteristic Curves
DISCRIM=YES ; REPORT item discrimination
ISgroups="" Estimate Andrich RSM see page 132 in the manual
CLFILE=* ; Label the response category and to specify the ARSM
1 Strongly disagree ; Names of the responses
2 Disagree
3 Disagree somehow
4 Neutral
5 Agree
7 Strongly agree
* ; End of the list of the response variable
MNSQ=N ;Show mean-square or standardized fit statistics
OUTFIT=YES; Sort infit and outfit using mean-square
PRCOMP=R ; Principal component analysis of residuals
MATRIX=Yes ; Produce correlation matrix
TOTALSCORE = Yes ; Include extreme responses in reported scores
PTBISERIAL=ALL ; Point-bisetial correlation coefficient
; This Starts the Person Label variables: columns in label: columns in line
@COUNT = 1E1 ; $C41W1 ; Country
@SEX = 3E3 ; $C43W1 ; Speciality
@KNOW = 5E5 ; $C45W1 ; Knowledge on ICT

@DOMAIN = 1E2 ; Domain code in item label
;differential group functioning analysis starts here
33 ; automatically produce Table 33 - DGF

&END ; This end the control variable specifications
; Item labels follow: columns in label
; Domain 1
IE1 ; Institutional Environment Q1 ; Item 1 : 1-1
IE2 ; Institutional Environment Q2 ; Item 2 : 2-2
IE3 ; Institutional Environment Q2 ; Item 3 : 3-3
IE4 ; Institutional Environment Q2 ; Item 4 : 4-4
IE5 ; Institutional Environment Q2 ; Item 5 : 5-5
; Domain 2
TE1 ; Item 6 : 6-6
TE2 ; Item 7 : 7-7
TE3 ; Item 8 : 8-8
; Domain 3
SE1 ; Item 9 : 9-9
SE2 ; Item 10 : 10-10
SE3 ; Item 11 : 11-11
SE4 ; Item 12 : 12-12
; Domain 4
DI1 ; Item 13 : 13-13
DI2 ; Item 14 : 14-14
DI3 ; Item 15 : 15-15
DI4 ; Item 16 : 16-16
; Domain 5
TTE1 ; Item 17 : 17-17
TTE2 ; Item 18 : 18-18
TTE3 ; Item 19 : 19-19
; Domain 6
KM1 ; Item 20 : 20-20
KM2 ; Item 21 : 21-21
KM3 ; Item 22 : 22-22
KM4 ; Item 23 : 23-23
KM5 ; Item 24 : 24-24
KM6 ; Item 25 : 25-25
KM7 ; Item 26 : 26-26
KM8 ; Item 27 : 27-27
KM9 ; Item 28 : 28-28
KM10 ; Item 29 : 29-29
; Domain 7
ST1 ; Item 30 : 30-30
ST2 ; Item 31 : 31-31
ST3 ; Item 32 : 32-32
ST4 ; Item 33 : 33-33
ST5 ; Item 34 : 34-34
ST6 ; Item 35 : 35-35
ST7 ; Item 36 : 36-36
ST8 ; Item 37 : 37-37
ST9 ; Item 38 : 38-38
ST10 ; Item 39 : 39-39

503. Possible Bug?

uve May 17th, 2011, 10:39pm: Mike,

I want to be able to provide my coworkers with the various output files from Winsteps. In this case I was attempting to use the ASCII=DOC format so that they can view it without any issues. The problem I noticed was that when I attempt to do this with a test in which IWEIGHT=0 for one or more items, the format clips off the first item and the mean totals at the bottom of Table 13. I've attached an example of the output when IWEIGHT is and is not used. Let me know if I need to do something different. I am using Office 2010 and wonder perhaps if there is an issue there. Thanks.


MikeLinacre: Thanks for reporting this bug, Uve. The internal line buffer in Winsteps is too short. I am modifying the code. You can have the correct formatting by shortening the item labels.

uve: Will do, thanks!

MikeLinacre: Uve, have emailed you download instructions for an amended version of Winsteps. OK?

504. Creating an Item Charactersitc Curve

SueW May 16th, 2011, 11:18pm: Hello
Please could somebody tell me how I would create an item characteristic curve. I want to use it as a diagram in my thesis (and don't want to use any others on the internet or other places because of copyright etc) and i would rather create my own anyway.
I have spss and winsteps.



MikeLinacre: Sue, Winsteps creates item characteristic curves for you on the Graphs menu. Analyze a data set (your own or a Winsteps example data file), then "Graphs" menu, and
also in Winsteps Help

505. mean of people vs mean of items

tmhill May 11th, 2011, 11:24pm: Hi, It's me again,
I am seriously a baby member... What does it mean if the mean of the items is two SD's above the mean of the people? I am taking it to mean that the insturment is more difficult than the people's ability to answer it. Is this correct? Like giving a 10th grade level math test to 5th graders? Could it be because the number of negatively worded questions outweighs the positively worded questions? With a sample size of over 800... I would think that population is varied enough... Is the response scale wrong? It is a frequency scale... maybe it should be dichotomous?

MikeLinacre: Thank you for your question, Tmhill.

You are using a 4-category rating scale. Looking at Table 12.5, we can see that the mean of the sample "M" aligns with an average score of 1.5 rating scale points on item "FeltSupported+"

The mean of the items "M" is slightly below an average score of 2.5 rating scale points on item "FeltSupported+"

So we can see that the mean of the person sample is targeted about 1 rating scale category below the mean of the item sample.

The meaning of this depends on the definition of the rating scale categories. For instance, if this is 1=Strongly Disagree, 2=Disagree, 3=Agree, 4 = Strongly Agree. Then the mean of the items is between "Disagree" and "Agree" for item "FeltSupported+", but the mean of the sample is between "Strongly Disagree" and "Disagree" "FeltSupported+".

Does this help you, Tmhill?

tmhill: No... I have no idea what any of this means. I follow your description. I get what this is saying for the individual items... But can we not make an interpretation about the instrument from 12.2? Generally speaking, can I say that the instrument is generally more difficult than the people's ability to answer it? Like giving a 10th grade math test to 5th grade students? or by saying that it is not addressing the full spectrum of items for this sample, ie. it is only surveying the most severe symptoms of depression as opposed to the mild dysphoric symptoms?

MikeLinacre: Tmhill, before we can make any inferences (like those you suggest) from Table 12, we need to know the definition of the rating scale categories.

For instance, since this is a depression scale, it could be that:
1 = happy, 2 = neutral, 3 = unhappy, 4 = miserable
4 = happy, 3 = neutral, 2 = unhappy, 1 = miserable

These alternatives would lead to opposite inferences about the depressed state of the sample.

tmhill: It is a 1-4 scale 4=more fequency, 1=less frequency.
We reduced it from a 6 point to a 4 point and I believe the wording will be something like

MikeLinacre: Thank you, Tmhill, so now the direction (polarity) of the items is the next question. In which direction are the items scored?
1. How often are you happy?
2. How often are you sad?

Probably items are written both ways on the questionnaire. In which direction are they all scored (or rescored) for the analysis? 4 = happier or 4 = sadder?

tmhill: Yes, some of these items are written both ways...
princcriticized is 4=always
Felt competent (reverse scored)
for example.

When I take the response options down to a dichotomous scale the means of the people and means of the items align. However, the creator of the instrument does not want it reduced that far. I just want to be able to explain this phenomenon to her so it makes sense.

MikeLinacre: OK, Tmhil, "some of these items are written both ways... "

1) All the items need to be rescored to point in the same direction. This can be done in Winsteps with
(item number of a reversed item) R ; R is the code of the rescoring group
(item number of next reversed item) R
IVALUER = 4321 ; rescore all in rescoring group R reversed

Now, in the analysis, we expect the person mean and the item mean to be offset by the overall level of depression in person sample. We will be surprised if the average level of depression in the sample matches the middle of the rating scale of the average item.

tmhill: I did that.
Here is my control file.

TITLE = CARE EFA w/EFA eliminated items and eliminated people and one of three eliminated items
NAME1 = 1
ITEM1 = 5
NI = 36
CODES = 123456
;NEWSCORE = 123456
;IVALUEA = 123456
;IVALUEB = 654321
IVALUEA = 122334
IVALUEB = 433221
PDELETE = 778, 356, 354, 586, 635, 334, 153, 735, 717, 235, 513, 828, 003, 140, 150, 566, 525, 081, 618, 346, 304, 014, 670, 834, 011, 717, 184, 715, 605, 491, 153, 272, 522, 752, 170, 187, 135, 619, 010
2; EFA Eliminated items
26; given name on factor three...
;30: not given name - also on factor three neutral item at the top of the scale.

MikeLinacre: Looks fine to me, Tmhill. Does it do what you want?

tmhill: Yes, the problem is that I would like to give an explanation for why the mean of the people if 2 sd's below the mean of the items. Can I say that the instrument is more difficult to answer than the people were willing to answer it? Can I say that the instrument only looks at the most negative perceptions and not those that may be more mild or positive? Is there an alternate explanation? Is it bad that the means are so separate? Shouldn't they be closer together? What does it mean that they are so far apart and what does it mean that when I make the rating scale dichotomous the means align (even though that is not an option for this study)?

MikeLinacre: Tara, we need to compare 2 S.D.s with your rating scale. 2 S.D. probably indicates that the mean of the sample is between "never" and "occasionally" on the rating scale for an average item. If this is a depression scale, then this is good news! We want people to tell us that they rarely experience symptoms of depression.

This is a typical situation in studies that have both "normal" (asymptomatic) and "clinical" (symptomatic) groups. The analysis reports a skewed distribution with its mean toward the asymptomatic end of the latent variable.

If we want the mean of the persons and the mean of the items to align, then we need items that identify much fainter and more frequently experienced symptoms. Such as "Do you ever have feelings of regret about missed opportunities?"

506. valid codes

uve May 13th, 2011, 9:53pm: Mike,

When we pull responses to items out of our data system, it puts a dash in for any question skipped by the student. I'm assuming Winsteps treats all characters in response strings not listed next to Valid Codes as incorrect. Example:

Valid Codes: ABCD

Answer Key: AABCDD
Response: A-BCDD
Score: 101111

Would this be correct?


MikeLinacre: Uve, by default Winsteps treats undefined codes as not-administered items.
If you want undefined codes to be scored 0, then

uve: Mike,

So in my example above:

Missing -1 would be five out of five items correct.
Missing 0 would be 5 out of six correct.

Is that right?


MikeLinacre: Exactly right, Uve.

507. Can test Equating help here? Is there better?

KY.Chiew April 22nd, 2011, 5:04pm: Dear Rasch measurement forum

I’m relatively new to Rasch and hope for some ideas.

Hypothetically speaking, if we have a paper-based reading comprehension task consisting of six passages, considered to progress from ‘easy’ to ‘difficulty’. Each passage has eight reading comprehension questions, scored ‘1’ (correct) or ‘0’ (incorrect).

In the hope of reducing testing time, some participants for example completes passage 1 and 2, others completed passage 2 and 3, or 2, 3, 4, and others completed 4 and 5, and 5 and 6. Which passage the participant starts with is determined by the tester and testing discontinues once the participant fail to answer a minimum required number of reading comprehension items.

Without data from participants that have completed all six passages (i.e., all items), is it possible to still establish the items’ Item difficulty estimates and the participants’ Person ability estimates on the task with these data from participants performing on these different passages?

The goal is to still attain an accurate and reliable Person reading comprehension ability with only a sample of passages tested on the task. In other words, regardless of which passages were tested, we are still able to attain the participant's Person reading comprehension ability.

I’m thinking may be test Equating might be possible. Although the persons taking the different combination of passages are different, but there remains some overlap in the passages (hence at least a minimum of 8 test items that were tested). Unfortunately, this is beyond my current experience of Rasch.

If this is possible, what is the absolute minimum number of participants with each combination of passages completed that is needed for such an analysis before the results become untrustworthy? Please do teach me if I am completely off target.

If this can be done, I welcome any better solution to assess the participants reading comprehension ability and any weaknesses in this approach of attaining a participant's Person ability depending on the passages that were tested in the task.

[[Alternatively, a simple but undesirable solution would be to assume that participants who started with passage 4 had been accurate in all the test items in passage 1, 2, and 3. I hope to avoid such a solution for now]]


A Student

MikeLinacre: Thank you for your question, A Student.

Since the network of passages overlap across participants, all the participant responses can be analyzed as one linked dataset. This enables the abilities of all the participants to be located along one latent variable of "reading comprehension", regardless of which passages were administered to each participant.

A place to start is with an Excel spreadsheet. Each row is a participant. There are 6 passages x 8 items = 48 columns of responses. Enter 1 or 0 where a participant met an item. Leave the column blank otherwise.

The next step depends on what Rasch software you are using. For instance, with Winsteps, convert the Excel spreadsheet into a Winsteps control and data file by means of the Winsteps Excel/RSSST menu.

KY.Chiew: Thank you Mike.

Glad to know I’m not far off. I only have WINSTEPS.

We’ve set up the Excel file.

As these different group of participants overlaps with at least one passage (e.g., Participants who completed passage 1 and 2 overlaps with Participant who completed passage 2 and 3 on Passage 2, hence 8 items), I’m thinking this fits “Common Item Equating”. There are at least 5 items that are shared (according to your WINSTEPS Help notes on Equating and Linking tests). Still have to check if this items spread across the continuum.

I’ll try to follow your WINSTEPS Help instructions on Common Item Equating.

My first concern now is that we currently do not have enough participants for such an analysis. Hopefully we can add more if it is practical. I’m hoping 30 participants in each group at least for an ‘exploratory’ study, but I feel this is naively optimistic - the boy in the match box.

Thanks. Much appreciated.

A Student

MikeLinacre: KY:
If you have the data in an Excel file in a reasonable format: one row for each participant, one column for each passage+item, then you can analyze these data immediately with Winsteps. Winsteps automatically does the equating for you.
Winsteps Excel/RSSST menu:
"Excel" to convert the Excel file into a Winsteps control and data file.
"Launch Winsteps"

KY.Chiew: Appreciate the feedback Mike. Apologies for the delay, I took a much needed Easter break.

I have since retrieved the data from my colleague, arranged the data accordingly and launched WINSTEPS.

Firstly, the Persons that have completed each passage.
11 Persons completed Passage 1,
22 completed Passage 2, with 10 of them also completed Passage 1
42 completed Passage 3, with 13 also completed Passage 2
48 completed Passage 4, with 35 also completed Passage 3
35 completed Passage 5, with all 35 also completed Passage 4
12 completed Passage 6, with all 12 also completed Passage 5
There are 64 Persons altogether in the dataset.

I’m writing this here to ask if I should be worried about the small number of participants in Passage 1 and Passage 6 for examples. Will this affect the analysis?

I’ve also started with the basics. I hope you’ll correct me if I’m wrong or fill in anything that you think I should pay attention to. Firstly, ...

1) A) Item polarity
i) Check if all PT MEASURE are positive
ii) Check if the AVERAGE ABILITY advance monotonically with category, i.e., if SCORE VALUE ‘1’ should correspond to a higher AVERAGE ABILITY than SCORE VALUE ‘0’ in this case.

2) E) Item misfit
i) Identify items with OUTFIT MNSQ > 2.0 and > 1.5
ii) Identify items with OUTFIT MNSQ < .50

3) Dimensionality
i) Identify if the contrast yields eigenvalue greater than 2.0

I have also looked into

4) Local item dependence (LID)
I used the “Correlation file ICORFILE=”. Calculate the average for all the correlation, and assess if each of the correlation is .20 above this calculated average.
Not unexpected I suppose that items on Reading comprehension tasks would show Local Item dependence.

Do you have any advice on solving Local item dependence? How would you approach LID? I’ve read to create Testlets, i.e., summing the scores from the items that are correlated together to form a ‘bigger’ items.

Much to learn. The help is much appreciated.


KY.Chiew: And WINSTEPS is super. It was a smooth process to set up the data for analysis. :)

508. Ability estimation

felipepca May 10th, 2011, 2:32am: Hello, I started to work with irt on my tests, but i had a big doubt.

I calibrate a test (math) 30 items to 500 students. I calibrate the test with 1PL and 2 PL. So i already have the a, b paramenters of each item and the ability estimation of the students.

Now, I need to apply the test to 100 more students... but how can i estimate the ability of these new subjects considering the a,b parameters that i already had. (without calibrating again all the process)

In other words, how can you estimate the ability of the new people in function of the parameters? (both cases... 1PL and 2PL)

Thanks a lot

Sorry with my english...

MikeLinacre: Felipepca, can you anchor (fix) the parameter estimates with the software you are using? If so, your estimates of a, b are the anchor values for an analysis of the 100 more students.

509. Presenting data

Juan March 10th, 2011, 7:52am: Hi Mike

One of the challenges of working with the Rash model is presenting data so that it makes sense to a broad audience.

My dilema relates to the two surveysthat I have, a pre and post-test situation. The two surveys have a commonality of about 70%.

On the one hand I want to present the functioning of the item. Here I compared the common items estimated measures with each other on an excell graph and I can show correlation values. This seems to be easy..

Secondly I would like to show change or development in the students who completed both survey, which we expect to happen. I calibrated each survey seperately for person measures and used the plots option to provide a visual representation of what is happining (DTF). My reservation is that not everyone, say at a conference will be able to comprehend a scatterplot. Are there alternatives. I started with partial correlations and can use that, but they seem to be limited, in example they only state that as the one increases the other one also increases, to cerain extent (alpha).

Please advise :-/


MikeLinacre: Juan-Claude:to compare students across both surveys, we need all the students in the same measurement frame-of-reference. The easiest way is to analyze all the data in one analysis, with every student in twice. Then you have two measures for each student (pre- and post-), and you can compare show those two measures in the same way as you would show two weights for each student.

If the Pre- and Post- data files have the items in different orders, then MFORMS= is a convenient way to combine them.

Juan: Mike, are you referring to "stacking" the persons in a pre- and post-test? I did that and got the same measures as when I did the tests seperately on the common items. I compared the person measures on a plot (STARS is Pre-test and FYES is post-test).

if my interpretation is correct, very few student showed any form of development - scoring higher on the planning construct. It seems as if students tend to have higher scores as they enter and lower scores toward the end of the year. Based on this I wonder if we are actually measuring development? Pearsons correlations tend to show no relationship with STARS planning and a verly low relationship with FYES planning.

Do you think such a presentation would work for a lay audience? I'm not to sure how to present the "weight" idea.

What is your take on the results?


MikeLinacre: Juan-Claude, it seems that your results are conclusive. A conventional statistician would do the equivalent of the "stacked" analysis, then report Winsteps Table 28, and the t-test of the difference between the means of the pre- and post- distributions.

Juan: Thanks for the advice Mike.

Juan: Hi Mike,

I have another question pertaining to the 'presentation' of data. This time as a Performance Indicator (PI).

Have you ever transformed measures into classical PI's. Once again I have a satisfaction questionnaire (5 point Likert) with five dimension. Each of these dimensions have to be weighted and transformed into a ratio.

Can you advise me on this?


MikeLinacre: Juan-Claude, this is new to me. Here is my guess ....

If the data for the 5 dimensions are analyzed together (as 5 strands), then 5 measures can be obtained for each person using DPF (Winsteps Table 31).

To make the measures more meaningful, transform them into a 0-100 scale using UIMEAN= and USCALE= (Winsteps Table 20 gives the numbers).

Then the five 0-100 numbers can be transformed into ratios, perhaps (m/(100-m)), weighted and combined.

Juan: Thank you Mike. I will try this and let you know how it works.

Juan: Dear Mike

Below is a short description of the stats I did to come up with the Performance Indicator (PI) as was required. The client wanted the PI from an institutional perspective and I had to make some adjustments to your suggestions.

A five-point Likert scale was calibrated using a Rasch model (Item Response Theory). Raw scores were transformed into logits for each dimension separately. The logits are expressed as 'measures' which is an estimate of a students' underlying satisfaction. Logits are not widely understood and were consequently transformed back into raw scores (weighted against the logits) to determine the raw score mean for each dimension. A reliability index, analogous to the Cronbach's alpha, was constructed. The reliability coefficient of the dimensions is a function of the potential weight allocated to a dimension. Each dimension carries an equal weight of 20% which is multiplied to the reliability coefficient and represents a weighted reliability (the weighted reliability was transformed from a total of 80% to 100% ). To calculate the PI expressed as a percentage the weighted average of the mean satisfaction level for each dimension, consisting of the raw score mean and the weighted reliability coefficient2 was divided with the sum of the weighted averages, consisting of the raw score total and the weighted reliability coefficient2, multiplied by 100.

MikeLinacre: Juan-Claude, if your client was satisfied, then we are all happy :-)

510. Linear Logistic Test Model

Carlo_Di_Chiacchio May 3rd, 2011, 5:06pm: Dear all,
my name is Carlo, from Italy. I've been studying LLTM and trying to apply it to cognitive data. Together with a collegue we have run the model by R. As I usually work with ConQuest, I would like ask if you are aknowledge about any application of LLTM with ConQuest, or whether you could help me in constructing the disign matrix to import into ConQuest.

Thanks very much for your help!


MikeLinacre: Carlo, I have not heard of anyone using ConQuest to estimate an LLTM model before. Suggest you contact Ray Adams, the ConQuest guru: http://www.acer.edu.au/staff/ray-adams/

Carlo_Di_Chiacchio: Dear Prof. Linacre,

thanks a lot for your advice!! Maybe could Winstep run this kind of analysis?



MikeLinacre: Carlo, Winsteps estimates logit-linear models that look like:

Person ability - Item difficulty -> observation

LLTM models look like:

Person ability - item component 1 * weight 1 - item component 2 * weight 2 - ... ->observation

The "Facets" software can analyze models that look like:

Person ability - item component 1 - item component 2 ... ->observation

This is described in http://www.ncbi.nlm.nih.gov/pubmed/16385152

Carlo_Di_Chiacchio: Dear Prof. Linacre,

thanks so much. ConQuest also can perform Facet Analysis and I would like to try. Very interesting is LinLog. I've seen several articles from Prof. Embretson in which this software is cited. I would like to know more info on how to get LinLog.


MikeLinacre: Carlo, please Google for LinLog.

Carlo_Di_Chiacchio: Dear Prof. Linacre,

I've tried to do it. I think writing to authors could be a good idea...:-)

Thanks for availability

511. SE for Fair Averages

drmattbarney April 28th, 2011, 12:18pm: Dear Mike and fellow Rasch enthusiasts

I need to create confidence intervals about my Fair Average to make it easier to report. I'm clear on the logits and SE, but Facets reports the fair average and I would like to know the easiest way to get Standard Error estimates in Fair Average units. Is there a simple way in Facets, or do I need to do this manually?



drmattbarney: I asked to quickly - Mike's already got an excellent example in the Manual, I just didn't find it right away. See page 223, section 16.14 for a perfect, simple example.

Never doubt Mike Linacre's foresight!

MikeLinacre: Thanks, Matt. But credit should go to the person who originally asked this question, that prompted the answer, that ended up in the program documentation .... :-)

512. 3PL & 4PL model analogy

uve April 25th, 2011, 10:52pm: Mike,

In an attempt to understand the benifits of 3 and 4PL models, if any, I thought of an analogy and was hoping to get your input in terms of its relevancy. I know that the Rasch model deals with lower and upper asymptotes through infit and so these extra parameters are not needed.

But I was wondering if using the 3 and 4PL models with dichotomous data is akin to using the Rasch partial credit method with polyotomous data. That is, when I choose no groups for the items, I get more accurate data on the meaning and value of the categories for each item, but the categories could have different meanings for each item--give and take. If we use the partial credit model, then we possibly lose some accuracy but we gain better comparability because the categories mean the same thing for each item.

So for the 3 and 4PL models with dichotomous data, it seems I might get more accurate information about item difficulty, but the difficulty measures for each item have different meaning. So if one item was 1 logit and the second was 3 logits, I might be closer to "true" difficulty with each item, but it would be hard to say that the second item is 2 logits more difficult than the first. Or perhaps the issue is not of comparing difficulty but of comparing discrimination.

I know this is a strange comparison to make, so thanks for bearing with me.


MikeLinacre: There has been considerable discussion (and argument) about the benefits and deficits of parameterizing item guessability (lower asymptote). From a Rasch perspective, it makes more sense to trim out the situation where guessing is likely occur (low-ability performers meeting high-difficulty items). In Winsteps, this can be done with CUTLO= https://www.winsteps.com/winman/index.htm?cutlo.htm

No one talks much about 4-PL (carelessness), but for a more general discussion about 3-PL from a Rasch perspective, see https://www.rasch.org/rmt/rmt61a.htm

uve: Mike,

Thanks again for all the clarification.


513. reverse coded items

tmhill April 26th, 2011, 7:34pm: Hello again...

Is it generally good practice to have all of the items and response options worded in the same direction (ie no negatively worded items or items that must be reverse coded)? For this particular instrutment I believe we are considering frequencies. However, some of the items are good if they are endorsed with more frequency and some are good if they are endorsed with less frequency. Your thoughts are much appreciated.

MikeLinacre: Thank you for asking our advice, Tmhill. The usual reason for reverse-coding some items is to prevent response-sets, such as every response is "Agree".

However, the most important feature of an item is "communication". If it makes sense to the respondent for the item to be reverse-coded, then that is the way to go :-)
It is easy to change the responses to forward-coding for the analysis.

tmhill: thanks so much!

514. Targeting Off?

uve April 25th, 2011, 5:05am: Mike,

I've attached Table 10 of a recent exam for grade 7 level math. It's a 40 item mulitple choice and the reliability result was .7. I can't seem to find out why students had such a hard time with this exam--average p-value was .37. Looking at table 10, I see no negative pt measures, though several seem very close to zero. But when I examine the fit statistics, I don't see anything alarming.

Looking at question 27, the outfit of 1.28 is being considered significant but I don't feel that's a large enough departure from 1.0 to be of concern. It seems Q25 may have issues due to a greater influence of carelessness. I suspect the same for Q15, 40 and 10. The item measures range from 1.11 to -1.00, so this seems like a very narrow distribution of scores. Examining the score file, it ranges from -5.02 to 5.02. Person measures ranged from 1.16 to -2.30. Comparing these last two, I found that no one got more than 30 questions right on this test and the lowest score was 4.

The average person measure was -.60 which corresponds to the overall 37% probability. Perhaps students were being tested on material that had not yet been fully covered. This is a constant problem in our mile-wide inch-deep curriculum we're are supposed to follow. In summary, the items don't seem all that bad, yet this test is clearly too difficult for the students. Other than speculation of outside influences, I'm wondernig if there isn't something in the meausres I'm missing. I guess I'm just trying to figure out why our targeting is off on this one.


MikeLinacre: Your assessment of the situation looks correct to me, Uve. The test is too difficult for these students. There may be some lucky guessing (1's in bottom right of Table 10.5), but removing those would make the student performance worse! Overall, the statistics are consistent with low performance.

The reliability of .7 is reasonable.
Student range is 1.16 to -2.30 logits = 3.5 logits, so the observed S.D. is probably close to 0.6 logits
The test length is 40 items, so the average precision (S.E.) of the measures is close to 1/sqrt(0.37*(1-0.37)*40) = 0.3 logits.
Expected reliability = (true variance/observed variance) = (observed variance - error variance)/observed variance
= (0.6^2 - 0.3^2)/(0.6^2) = 0.7

uve: Mike,

Thanks again. You mentioned some lucky guessing and I noticed this too at the bottom of the table. However, I did not see any "substantial" lower asymptote values as mentioned briefly in the Help menu, so I did not pursue this further. I imagine if .10 or higher is substantial for the lower, then .90 or lower is substantial for the higher asymptote.


MikeLinacre: Yes, Uve, the item fit statistics also indicate no pervasive guessing.

515. Total Variance Always Low

uve April 21st, 2011, 3:00am: Mike,

I know the world is a noisy place so I don't expect there to always be a majority of variance explained by the model. However, I have to admit that I've become a bit discouraged by the low raw variance explained by the measures on virtually all 110+ of our multiple choice tests. I don't recall it ever being higher than 29% and seems to vary between that and down to 20%. Reliability on these tests is generally very high--above .8 on virtually all with many in the .9 to .93 range. When attempting to address critics I must admit that I would have a hard time defending the results. How can one feel confident in the measures when they explain so little of the variance?


MikeLinacre: Uve: we can predict the "variance explained" if we know:
1. The variance of the person ability measures
2. The variance of the item difficulties
3. The targeting (= average p-value) of the persons on the items.
The prediction is plotted at https://www.rasch.org/rmt/rmt221j.htm

And if the data are from a computer-adaptive test, with the items selected to be targeted exactly on the persons (p-value = 50%), then the data will look like coin-tosses, and the variance explained will be almost zero!!

BTW, the "variance explained" is similar for CTT and Rasch, but no one bothers to compute it for CTT :-)

uve: Mike,

Three additional questions then:

1) I know there are exceptions, but it seems that one goal in testing is to ensure that our tests target our persons as best they can. But if I understand what you are saying, the better we target, the less variance is explained. It seems I'm in a strange situation here in that the better targeted my test, the less I can rely on the measures. So how does the Rasch model help us in this situation?

2) The overall average expected and observed matches on many tests is usually around 70%. If we've matched that many questions, how can we have so little variance explained?

3) You mentioned that CTT provides similar results. Would this be the Pearson R squared?


MikeLinacre: Uve, let's clarify things:

1) "the better targeted my test, the less I can rely on the measures."
No! The more we can rely on the measures!! Think of the same situation in physical measurement. When the mark on the tape measure is exactly targeted on your height, then when an observer makes the comparison, there is a 50% chance that the observer will see the mark as taller, and a 50% chance that the observer will see you as taller. When this situation of maximum uncertainty (least variance explained) happens, we know we have exactly the correct height mark.

2) Suppose that everyone in sample achieves 70% success on a set of dichotomous items. Then Explained variance = (70-50)^2 = 400. Unexplained variance = 70x30 = 2100. Explained variance / Total variance = 400/(400+2100) = 20%
If everyone achieves 50%, then the test looks like tossing coins. Explained variance = (50-50)^2 = 0. Unexplained variance = 50x50 = 2500. Explained variance / Total variance = 0%.

3) CTT: this would be a more exact computation, similar to 2), but no one does it!

uve: Mike,

Again, thanks for all the clarification. I am getting there slowly but surely. Unfortunately, more slowly than surely :)

I guess my problem has been that I always thought more variance explained would equate to more control and thus more accurate measures.


MikeLinacre: Uve, you write "more variance explained would equate to more control and thus more accurate measures."

Yes, there is a paradox here. The process of explaining is different from the process of measuring. If we want the measures to explain a lot of variance, then we need to measure things that are far apart. But, when we do the measuring itself, we use things that are as close together as possible.

516. Predictive Validity

uve April 23rd, 2011, 3:05am: Mike,

For predictive validity you've mentioned, "Are the people ordered as we would expect based on other information about them. Do the experienced people have higher measures? Do the healthier people have higher measures? Do the more educated people have higher measures?"

There seems to be some circular logic for me here. How do we know who the more educated/experience/healthier examinees are unless we give them a test to get this "prior" information? How can we interpret the results from this prior test unless we already know who is more educated/experienced/healthier?

This is a tough one for me to wrap my head around.


MikeLinacre: Uve: yes, there is a circular logic. This has been the history of science. In 1600, when they started to measure "temperature", they did not know what temperature was. They had a rough idea that some things are "hotter" and some things are "colder". They began this hermeneutic cycle:

conceptualization (hot, cold) -> primitive measuring instrument -> better conceptualization -> better measurement -> yet better conceptualization -> yet better measurement -> ...

For temperature, this process lasted about 200 years (1600-1800), and there may yet be a fundamental reconceptualization of temperature to include temperatures below "absolute zero" (-273.15 C).

So, "predictive validity" and "construct validity" are verifications that the "measurements" match the "conceptualization" as we currently understand it. In fact, Rasch analysis also often identifies areas in which a conceptualization can be improved.

517. Reconciling Residuals

uve April 19th, 2011, 12:11am: Mike,

If Winsteps reports an eigenvalue of, say, 3.0 on the first contrast and a simulation run on the data report that 2.1 is reasonable to expect by chance, and then I determine that the contrasting items are merely viable subcomponents of the test, I am still left with data suggesting degradation of measurement. In other words, regardless of my opinion, the Rasch model is having difficulty reconciling these residuals.

So it seems that it really doesn't matter whether I believe the contrasts reveal viable or non-viable subdimensions on the test. Either way, we seem to be left with something that reduces the accuracy of our measures and so these items should always be removed or separated. Yet I know that sounds extreme and inaccurate, and also is contrary to what you've stated.

So, if I feel the subdimensions are valid and the items contained in them should remain as part of the test, have I gone as far as I can go and we must deal with the measures as they are, or are there other options?


MikeLinacre: Uve, imagine the same situation in physical measurement. I am on a cruise liner at sea, and I decide to weigh myself. The estimate of my weight will be degraded by the movement of the ship. What to do? Surely the degradation is too small to matter to the inferences I will base on the estimate of my weight (that I am eating too much!)

In the Rasch situation, if the degradation of the person measures is smaller than their standard errors, then the degradation can definitely be ignored. The investigate this, place a code in the item labels to indicate their contrast, and then do a DPF. This will report the logit size of the contrast for each person.

uve: Thanks!

518. Item Banking

uve April 19th, 2011, 1:07am: Mike,

We would like to be able to change tests with items that have been field tested from previous administrations. Our goal is to have a large item bank from which to pull replacement items, or perhaps simply to rotate out items.

Let's assume a test is built from this item bank and we have measures for each of these items from the field testing process. My apologies for all the questions, but this concept is completely new for me. Here are my questions:

1) Does an item measure of, say, -1.23 on test that ranged from -5 to +5 logits mean the same as an item of the same exact measure for the same exact subject that came from a test range of, say, -7 to +6?

2) Should I make sure that all the measures average to 0 in order to frontload a better fit with the Rasch expectations?

3) How would I target the difficulty of the test if, using CTT terminology, I wanted it to have an average p-value of .65? In other words, I don't want to make the test too easy or hard, so how would I target a specific overal test difficulty?

4) With logit measures for banked items, is it possible to create a new test that discriminates persons in the manner in which I feel is needed?

I would greatly appreciate any additional resources you could provide that address how to best take advantage of Rasch measures to construct new versiosns of tests.


MikeLinacre: Thank you for these question about item banking, Uve.

There are several ways to add new items to the bank. For instance,
1) Use common-item or common-person equating to align the difficulties of new items with bank items.
2) Pilot test the new items in tests based on the bank of items. Administer the new items in tests mixed in with the bank items. Give the new items a weight of zero, so they do not influence the person measures, but are reported with difficulty measures and fit statistics.

See https://www.rasch.org/memo43.htm for some advice, and there is much more on the web.

Q1: item difficulties are best estimated from a sample roughly targeted on the item (within 2 logits of its difficulty. If there are, say, 100 persons in each sample within 2 logits of -1.23, the item difficulty estimate is probably precise enough for practical purposes.

Q2: Rasch has no expectations about the average measures. The measurement is more precise is the sample is roughly targeted on the items.

Q3: If you have a rough idea of the average ability of your sample, then chose items roughly 1 logit (on average) less difficult than they are able. For a p-value of .65, the exact logit difference would be ln (.65/(1.00-.65)) = 0.6 logits less difficult than the person are able.

Q4: Logit measures are ideal for item selection, because they are independent of any particular sample. In general, school grades are about 1 logit apart. So we can easily select items that form tests that are 1 logit (on average) more difficult for each grade.

George Ingebo's "Probability in the Measure of Achievement" gives plenty of practical advice about item banks, based on the experience of the Portland, Ore. Public Schools. They have been using item banks for 30 years. www.rasch.org/ingebo.htm

uve: Mike,

Great information! Thanks as always.


519. 'Bivariate' Rating Scale model?

chaosM April 15th, 2011, 10:42pm: Hello there!
I have a problem which i was hoping somebody can give some ideas or directions towards the solution.
For almost a year now, I have been trying to model within the Rasch Model framework a group of items that are composed of two parts: the first asks to define the direction of a change (in life or learning style) with a 3 point options, and linked to that a second part asks the person to rate their feelings towards this change (in another 3-point scale: positive, neutral, negative). An example item is:
First part: "I have to do (1) more, (2) about the same amount of, (3) less, private study at school": Second part: "How do you feel about it? (1) Negative (2) neutral (3) positive."

There are 13 items of this nature, and the idea is that they measure an underlying construct which i am trying to model with the Rasch model. I have extensively looked into literature and was not able to find something similar.

I was wondering if anybody can suggest an appropriate model to run in such (probably bivariate) situations? That would help a lot.
What i did so far in an attempt to solve this problem is to split each of the 3 possible responses of part 1 into 3 different items and rating the second part on each (hence for every response i get 2 missing responses for the other two items). I analyse the this with the Rasch rating scale model and get some meaningful scale. However i think i may be violating the assumption of local independence (for a starter!).
I would appreciate any thoughts and comments on this as well.

Many thanks in advance

Best wishes

MikeLinacre: Maria, don't worry about "local independence" until you know you are measuring what you want to measure. Local dependence may slightly distort the measures, but it does not change the meaning of the measures.

So, first, please define the latent variable you want to measure. There seem to be two here: 1) the requirement and 2) the emotion.

Imagine the same situation in a weight-loss program. There could be a set of questions about why we need to lose weight, and another set of questions about our feeling about losing weight. For instance:
"I have to lose (1) a little weight, (2) a lot of weight, (3) a lot of weight quickly"
"How do I feel about this situation (i) Negative, (2) Neutral, (3) Positive"
We would see immediately that there are two distinctly different variables. We would measure them separately and would expect to see a negative correlation between them (more weight loss = more negative feeling), but that would be a finding. We would not want to combine "need" with "feeling" in advance of our analysis. So this suggests two separate Rasch analyses. OK, Maria?

chaosM: Thanks for the quick response Mike!
I have actually tried this (measuring separately) and it gives meanigiful measures.
However, i am still pursuing one-measure for the combination of the responses.
Assuming this make sense, is it any other extension of the ratings Scale model that can be used for such situations?

Once again thanks.
Best wishes

MikeLinacre: Maria, perhaps a "multidimensional Rasch model". This is implement in the ConQuest software: www.rasch.org/software.htm

chaosM: Thanks Mike. I will give this a try then!

520. Secondary dimensions on a Rasch PCA

SueW April 15th, 2011, 7:20pm: I have statistics which suggest two secondary dimensions.
The total variance explained by the measure was 53.9% and the unexplained variance was therefore 46.1%, equating to 22 units (eigenvalues).
The eigenvalue of the first contrast is 3.1 (rounded to 3) which indicates it has the strength of about 3 items. This largest contrasting factor explains 6.6% of the total variance.
The eigenvalue of the second contrast is 2.8 (rounded to 3) which, like the first contrast, is more than two (also indicating it to be considered a dimension) has the strength of about 3 items and explains 5.8% of the overall variance.

1. Are these two contrasts termed as ‘dimensions’, ‘contrasts’ or ‘factors’ or all three.
2. To find these ‘dimensions/contrasts/factors’ I assume I look at the Rasch PCA plot and also produce a factor loadings table where there are two ‘sets’ of items with factor loadings. These items in the factor loadings table happen to coincide with the two ‘clusters/dimensions/factors/sets of items’ on the Rasch PCA map. So I assume then that these are the two secondary dimensions and not two clusters of one dimension?

Please could you confirm.



MikeLinacre: Thank you for your questions, Sue.

Your first contrast has the strength of about 3 items. It is something off the Rasch dimension, but what? To discover this, look at the content of items A,B,C in the first contrast plot, and also the content of items a,b,c. What is the difference between those items? Are they different dimensions, like "math" and "history"? Or are they different content strands, like "addition" and "subtractions"? Or are they different item types, like "true/false" and "multiple-choice"? Or what?

When we know why those items are different, then we can decide what action to take:
1) do nothing: the differences are the natural variation in the content area
or 2) omit the irrelevant items
or 3) split the items into two sets of items, one for each dimension
or 4) .....

SueW: Thanks Mike

Just to repeat what I said in my email to you Mike and to kind of let people know. Your answer was very helpful. What I get from this is that to interpret the eigenvalues of the two apparently contrasting dimensions, one has to, as you say, look at the contrast plot and the factor loadings table. So decisions are a little subjective. But I what I found most helpful is the fact that you have essentially said that we are able decide that the two contrasts in the PCA plot and in the factor loadings table could be two clusters of one of the dimensions or could be the two dimensions; it all depends on their content, loadings etc. That is the important thing I have learnt here; that there are no strict rules to say how one interprets the loadings and PCA plot; in my case they can be seen to be two parts of one dimension or two dimensions.


SueW: Hi Mike and all

I would like you and others to know, Mike, that since my last posting I have a better understanding of contrasting dimensions (aka secondary dimensions, contrasts, secondary factors) and wish to slightly amend my last post. I now realise more fully that they are called contrasting dimensions because they are literally in contrast to the measure; in contrast to the explained variance; they are assessing something 'other' than the main measure. So to reiterate what I said in my last post, I have described my newer understanding below. Please excuse the repetitive nature of some of my terminology but please know I have only done this in order to remain clear.

1. I have statistics which suggest two contrasting dimensions due to two sets of eigenvalues >2
2. These contrasting dimensions (in my data at least) each have two 'clusters' or 'sets' of items, one positively loaded to the factor and one negatively loaded. So really they are 'opposite poles' rather than just sets, within one contrasting dimension.
3. So, and I feel this is key, within each contrasting dimension I have two sets of items where each set is different to the other. In my case, they are polar opposites. When I posted my last post I did not fully understand this and I mistakenly thought these two sets of items were my two contrasting dimensions simply because they differed from each other. In fact these two sets of items are two clusters or sets within one contrasting dimension. However, I realise, Mike, as you suggest, you can sometimes see these 'sets' or 'clusters' or 'poles' also as dimensions; kind of like two opposing mini-dimensions (where each differs from the other) within one contrasting dimension. I am now wondering if we can all agree on one term which describes these 'sets', 'clusters', 'poles', 'mini-dimensions' to save confusion in the future.
4. To understand my two contrasting dimensions, I realise that, I need to look at my Rasch PCA output (Table 23). When I originally wanted to understand my first contrasting dimension, in my absent-mindedness, I had not scrolled down to see that there are also plots and loadings tables for all contrasting dimensions. Since my last post I have realised this and have found a plot and factor loading table for my second contrasting dimension.
5. In sum, I now realise that my contrasting dimensions each contain two groups of items where one group of items differs from other group of items within the same contrasting dimension. Added to this is the fact that I have two contrasting dimensions (which initially confused me immensely).

I hope this helps clarify and maybe helps others. Please amend Mike if I have this wrong or as you see fit.


521. Better Fitting

uve April 14th, 2011, 5:15pm: Mike,

Sometimes in table 10.1 I see the term "Better Fitting Omitted." What does this mean?


MikeLinacre: Uve:

Winsteps Table 10 is intended to focus our attention on misfitting (under-fitting and over-fitting items). If you want all the items to be displayed, put this line in your Winsteps control file:

uve: Mike,

What are the criteria for "Betting Fitting"?


MikeLinacre: Uve, according to Winsteps Help:

"For Table 10, the table of item calibrations in fit order, an item is omitted only if the absolute values of both [INFIT and OUTFIT] t standardized fit statistics are less than FITI=, both mean-square statistics are closer to 1 than (FITI=)/10, and the item point-biserial correlation is positive."

The default value of FITI= is 2.

522. writing journal articles

pjiman1 April 12th, 2011, 3:07am: I wanted to inquire about the experience of writing reports that use the rasch model. For papers about instrument development, I’ve been using the framework by Smith (2004), Wolfe and Smith (2007) and the Medical Outcomes Trust Scientific Advisory Committee (1995) to demonstrate how the results of a Rasch analysis can be used to provide evidence for the different types of validity proposed by Messick (1995). I am finding that I am using many pages to explain the Rasch model, to justify its use, to describe the analysis procedure, and to present the results and tables, plus references to Rasch papers. I also find that the Rasch model produces a plethora of results, all of which provide good fodder for discussion about the quality of our instruments. So the results of the Rasch analysis provide plenty for me to write and reflect on and I want to include the findings and relevant discussions in the paper.

My problem is that journals have page limits, usually up to 35 pages, including references. I submit my work to journals that normally do not use Rasch so I find that I take up many pages explaining the method. Because the Rasch output provides much information worth discussing, I finding myself having to sacrifice discussion points to ensure I have enough pages to explain and describe Rasch analysis. I am wondering if there is any way that I can shorten, say perhaps the explanations of rasch and the analysis procedure, by making reference to other works? Or are there other ways that folks have written their Rasch procedures and results in a few pages so that there are available pages for discussion?

MikeLinacre: Thank you for your questions, Pjiman1.

Here's an exemplar ... http://www.uky.edu/~kdbrad2/Rasch_Symposium.pdf

As the marketing experts say, "We must sell the sizzle!" - a brief overview of the method with suitable references (e.g., Bond & Fox, Applying the Rasch Model), and then a really sizzling substantive story. For instance, a powerful diagram of the item hierarchy rather than obscure tables of numbers; a few words of explanation and a reference rather than an algebraic formula.

Journals may accept 35 pages, but, as a paper reviewer, when the number of pages exceed 25 then I already know that the paper should be condensed if there is to be any hope of maintaining the reader's interest.

pjiman1: thanks Mike, this is very helpful

523. Relative size of standard error on measurement

joliegenie April 12th, 2011, 4:17pm: Hello all,

we've been doing some Rasch analysis using Facets. The following table shows the summary statistics for our items once data fits the model.Is the model s.e. value of 0.81 we obtain too large? Does it mean we don't reach good fit? Or that we can't use our results for future analysis?

We know that we should usually try to reach better precision on our estimates, but we are analysing data in a context "out of the ordinary" and it is hard to evaluate if such a value may be a problem or not.

| Total Total Obsvd Fair-M| Model | Infit Outfit |Estim.| Correlation | |
| Score Count Average Avrage|Measure S.E. | MnSq ZStd MnSq ZStd|Discrm| PtMea PtExp | Num heure_mesure |
| 172.2 198.9 .9 .90| .00 .81 | .77 -.2 .19 1.6| | .37 | Mean (Count: 108) |
| 66.3 57.3 .3 .27| 6.60 .41 | .34 .7 .30 .9| | .20 | S.D. (Population) |
| 66.6 57.5 .3 .27| 6.63 .41 | .34 .7 .31 .9| | .20 | S.D. (Sample) |
Model, Populn: RMSE .90 Adj (True) S.D. 6.54 Separation 7.22 Strata 9.96 Reliability .98
Model, Sample: RMSE .90 Adj (True) S.D. 6.57 Separation 7.26 Strata 10.01 Reliability .98
Model, Fixed (all same) chi-square: 19712.1 d.f.: 107 significance (probability): .00
Model, Random (normal) chi-square: 107.5 d.f.: 106 significance (probability): .44

To give you all an idea, our "items" are in fact hours of measurement every five minutes. The "answer" to those items are means of physical activity level. Moreover, as opposed to real questionnaires where respondents usually answer to all items, here the responses vary accorging to the schedule of each person. Some arrive early in the morning and leave in the middle of the afternoon, while others arrive later (end of morning) and leave around 5 or 6 pm. As a result, the number of "answers" per "items" may vary a lot for the different moments of the day. In addition, to reach goodness of fit, many unexpected data had to be removed from the sample. So, is the s.e. value large because the number of answers is not sufficiant enough to obtain precise estimates?

Also, the usual logit scale goes frome -3 to +3. Our, goes from about -15 to +15... Since the range of estimates is quite large, is the 0.81 value still to be considered large or is it reasonnable? According to what can we judge that the standard error is ok or is too large?

Thanking you all in advance for any help you can give! :)


MikeLinacre: Thank you for your question, Joliegenie.

The extremely high reliability of 0.98 indicates that the precision of measurement (= S.E.) is much less than the spread of the measures (=S.D.). The size of the S.E. is definitely not a problem.

"Model S.E." is effectively independent of the fit of the data to the model. It is dominated by 1) the number of observations for each element, 2) the targeting of the items on the respondents, 3) the number of categories in the rating scale (2 for dichotomies).

524. PCA Summary

uve April 11th, 2011, 7:10am: Mike,

I would be most grateful to get your input to my following comments on PCA:

1) Traditional PCA using observed scores relies on high loading items on the component to provide some degree of measure of influence of that component on our testing instrument.

2) Winsteps PCAR provides a measure of potential off dimension variance from the perspective of the substantive nature of items that load high on the component in contrast with those items that load low on that same component.

3) If we determine that the contrast is not significant because the high and low loading groups of items seem to be close to the construct, I imagine that we would find more often than not that the loadings are closer together in postive/negative loading values and represented so on the scree plots.


MikeLinacre: Uve:

1. and 2. - Yes.

3. - I don't know. My attempts to generalize about PCAR have been largely unsuccessful. Perhaps you will be more successful :-)

525. Why no rotation?

uve April 11th, 2011, 6:43am: Mike,

Why is rotation, orthogonal or otherwise, not done for PCAR in Winsteps?


MikeLinacre: Uve, our intention is not to explain the residuals, but to refute the hypothesis that the residuals can be explained. Accordingly, we model all the inter-item variance as explainable (PCA not Common Factor Analysis), and organize the components (contrasts) to explain as much variance as possible in descending sequence (orthogonal, unrotated).

If you would like to perform other types of factor analysis based on the Rasch residuals, then please use the ICORFILE= item correlation matrix as input to your factor analysis.

526. guidelines for writing items

pjiman1 April 12th, 2011, 3:08am: what are some guidelines that can help us write good items for the purpose of measuring a psychological (or any) construct?

Based on my experience thus far with instrument development and studying measurement, I am convinced that quality of our ability to measure psychological constructs rests with developing good items. We use items on instruments to communicate what we know about the varying levels of a psychological construct. The only way we know if a person has a higher quantity of a latent variable is to determine if that person will do well on items compared to a person, with a lower quantity of a latent variable. So we have to develop good items to differentiate between those two persons. We develop questions of varying difficulty for use in standardized achievement tests. By the same token, our ability to measure psychological constructs rests with having good items.

I believe that the art of writing good items is often overlooked when developing instruments that measure psychological constructs. Often, it seems that instruments are interested in identifying categories, that is how many sub-constructs that make up a construct. I don’t think items are written to detect the variation of high or low person abilities for a construct; rather I think items are written to identify a construct.

I’m wondering if there are guiding questions that have been helpful when writing items. I am reminded of one guiding question by Myford - “are these items the most important things we want to ask about what we want to measure?” I am wondering if there are other guiding questions that can be helpful when writing items.

I’ve consulted the following sources:
Constructing Measures: An Item Response Modeling Approach, Mark wilson
Linacre, J.M. (2000) Redundant Items, Overfit and Measure Bias. RMT 14(3) p.755.
Also, book chapters by Smith and smith (2004; 2006) are helpful.

With so much resting on the instruments, and that instruments are essentially no better than the items that comprise that instrument, it is important that we write good items. So I’m wondering if there are helpful guiding questions that researchers use when writing items. Also I am interested if there are journal articles that discuss this issue.
Thanks for your input.

527. Item selection

addicted2stats April 5th, 2011, 6:47pm: I am trying to run a rasch model to reduce 50 items. I am running into an issue because there is no one that endorsed all 50 (my highest score is 49). The individual with 49 did not respond to item 41, so when I try to run my model, it will not converge unless I exclude item 41 initially. I don't want to exclude it because ~18% of the sample did acknowledge this item.

For example, let's assume I have the following 3 scores on a 15 item test:

111111111011111 for a score of 14
111111111110100 for a score of 12
111111111100000 for a score of 10

I would have to drop item 10 because my high scorer didn't get it even though two others did (reducing their scores to 11 & 9 respectively). Is this a principle in rasch or would a different program (e.g., winsteps) allow me to include item 10 (I'm using stata)? Stata forces me to remove item 10 in order to compute tests and converge appropriately.

Thanks in advance.

MikeLinacre: a2s, special-purpose Rasch software, such as Winsteps, allows missing observations.

However, why not omit the person who scored 49? Since that is an extreme score (on 49 items), it is uninformative for item reduction.

addicted2stats: The problem isn't the person who scored 49. It's that the model won't converge since the person has 1 item that wasn't acknowledged and that 1 item is left in the model. If I drop the person, my next highest person has 44 yes'. In order to get this model to converge I have to drop the 6 items they didn't acknowledge.

I just ran a subsample in ministeps and it looks like it converges fine even when this occurs, so perhaps it's an issue with stata. Regardless, I'm having our secretary submit a purchase order for winsteps.

MikeLinacre: Thanks, a2s. In a project last year, Winsteps converged with 99% missing data: https://www.rasch.org/rmt/rmt233d.htm

addicted2stats: I got Winsteps, and am working on reducing our questionnaire. I have a couple questions I was hoping someone could help me with. We're developing a questionnaire of marijuana problem severity. There are a lot of people (~20% of the sample) with 0 problems. I'm wondering if it is appropriate to use the statistics for non-extreme sample in this case. If not, under what circumstances would you use this? Also, does a person separation index of 1.5 indicate that I am not adequately assessing people in the lower domain of problems? Thanks for the help

MikeLinacre: Thank you for your email, A2S.

"There are a lot of people (~20% of the sample) with 0 problems." - These people are too problem-free to be measured by this questionnaire. So their measures and standard errors have been estimated using a Bayesian adjustment.

Imagine the same situation in physical science. We have a 6-foot (or 2-meter) ruler and we are measuring the heights of a sample of people. We discover that 20% of them are taller than the ruler. For these people, we guess a height of 6ft 3 inches and a matching standard error of +- 3 inches. Do we include or exclude these "too tall" people from our analysis of the sample? If we want to estimate the precision of our ruler, then we exclude them. If we want to estimate the average height of our sample, then we include them.

So, in your situation, if we want to estimate the effectiveness of the questionnaire for people in our sample for whom the test is relevant, we exclude extreme people. If we want to estimate the effectiveness of the test for samples of people like our sample, then we include the extreme people.

Separation of 1.5 indicates that the questionnaire has the statistical power to discriminate between high and low scorers on the questionnaire in samples like this sample, but that is all. If you need greater discrimination, then you need more questions, and since 20% have no problems, some of these questions need to probe more trivial problems.

But obviously this contradicts your goal of reducing the questionnaire. Statistically the questionnaire is already at its minimum size. The questionnaire can only be reduced if the number of categories in the rating scale of each item can be increased in a meaningful way.

addicted2stats: This helps tremendously! Thank you very much!

528. Table 23.99

SueW March 29th, 2011, 5:59pm: Hi Mike,

Firstly, can I thank you for the latest Winsteps guide (version 3.71) and especially for Section 18.23 which explains dimensionality and Rasch PCA. This has really spelt it out for me what Rasch PCA is about. Thank you also for the explicit example on p.459 of a Rasch PCA. This had helped me so much. However I just have one or two questions below which I need a little help with if you would be so kind. For greater clarity I have written them below.

When I look at my Rasch contrast plot along with item factor loadings i see that items 8, 9, 6 and 1 are all off-dimension (positive loadings > .4). Simiarly, items 12, 19, 22, and 5 are off-dimension with neg loadings of -.4 or more. Having explored the wording of these items I can interpret what the two clusters of this secondary factor mean; it makes sense.

However, I want to know just how off-dimension these items are and I want to understand table 23.99 so I have a couple questions below.

Question 1. In my table 23.99 below, an off-dimension item correlates with an on-dimension item; Item 9 (off) correlated with Item 13 (on) with a corr of .42. Can I conclude from this that Item 9 is not independent (is not so off-dimension simply because it correlates with an item on the dimension)?

Question 2. Similarly, if two off-dimension items (Items 8 and 9) from the same cluster correlate with a low correlation of .35, does this question the integrity of the secondary dimension - (although .35 is not completely insignificant)?

| .42 | 9 RDI09 | 13 RDI13 |
| .37 | 11 RDI11 | 12 RDI12 |
| .35 | 21 RDI21 | 23 RDI23 |
| .35 | 8 RDI08| 9 RDI09 |
| .34 | 10 RDI10 | 16 RDI16 |
| .33 | 1 RDI01 | 8 RDI08 |
| -.40 | 1 RDI01 | 10 RDI10 |
| -.37 | 8 RDI08 | 12 RDI12 |
| -.35 | 16 RDI16 | 23 RDI23 |
| -.32 | 1 RDI01 | 16 RDI16 |



MikeLinacre: Sue: none of these correlations look large to me. Remember that "common variance = correlation^2), so items 9 and 13 only share .42*.42 = 18% of the variance in their residuals in common. 82% of each of their residual variances differ.

In this Table we are usually only interested in correlations that approach 1.0 or -1.0, because that may indicate that the pairs of items are duplicative or are dominated by a shared factor.

SueW: Thanks Mike

So pretty much all these items in the table are locally independent in that none of them share sufficient variance with another item. So if an off-dimension item has a lowish correlation (.42) with an on-dimension items it gives an indication that the off-dimension item is more off than on.

However, if I had found that many off-dimension items correlated with many on-dimension items then this would be evidence of unidimensionality.


MikeLinacre: SueW: Oops! We are looking at correlations of residuals, not of observations.

We expect correlations of residuals to be zero! Two on-dimensional items will correlate zero. An on-dimension and off-dimension item are expected to correlate zero. Only two off-dimensional items (on the same off-dimension) are expected to have a high correlation.

SueW: Thanks Mike - sorry I did not get back earlier - i messed my email notifier up.

Just to adjust what I said earlier I have an on-dimensional item (13) and an off-dimensional item (9) correlating at .42.
My interpretation is that the largish amount of residual variance that exists in the off-dimensional item is correlating with the slightly smaller amount of residual variance in the on-dimensional item. So their residuals have some small but significant association.
I indicate this to mean that the on-dimensional item is actually close to being off-dimension. When I check this out with the factor loadings of residuals this is confirmed because the on dimensional item is close to being off; it's loading is .29 and it is borderline off.
I also checked the Rasch PCA plot to find why these two sets of residuals might have a similar pattern; both items are located at the same point on the measure but at a different point on the contrasts, but not a huge difference on the contrast (a difference of 2 logits which is not huge given that the biggest contrasting items have a difference of 7 logits). So this is why these items' residuals correlate. Their pattern of residuals happen to be similar and they are at the same place on the measure (making their residulas correlate) but their contrast is different enough to make one on-dimension and one off-dimension but not so different that their residuals have no association at all which explains why the correlation between these items residuals is lowish (.42).

MikeLinacre: Thank you, SueW. Glad to see that you have investigated the situation thoroughly and carefully :-)

SueW: Thanks Mike,

May I also say this is the first time I think I have appreciated more fully what Rasch PCA does. Nothing in the classical approach comes close to analyzing data like this.

Here's to Rasch PCA!


529. Factor versus PCAR

uve March 30th, 2011, 11:11pm: Mike,

I had Winsteps export the item residual correlation matrix to SPSS so I could compare its output to that of Table 23. I set the rotation in SPSS to varimax and have combined its output with Table 23 in the attached PDF.

The SPSS data are virtually identical as Table 23. For example: on page 1 of the PDF, for the first contrast notice how item 67 loads at .76 and item 2 at -.43. On page 4 you'll see this is the same as the data given in Table 23. Likewise, for the second contrast item 60 loads at .65 and item 47 at -.58 which is also virtually identical to Table 23 on page 6.

In Winsteps Help you state, "Please do not interpret Rasch-residual-based Principal Components Analysis (PCAR) as a usual factor analysis. These components show contrasts between opposing factors, not loadings on one factor."

The SPSS output does not seem to have the data on opposing factors but on the same factor for comonent 1 and the other for component 2. And since its output is essentially the same as Table 23, I would mistakenly interpret the data in that manner.

Why can't we interpret the data in this manner?


MikeLinacre: Uve, the choice of analytical tools and the interpretation depend on our purposes. Please interpret the Winsteps output in whatever way is most meaningful to you.

uve: Yes, interpretation can be the hardest part. I guess I was just hoping to get more information from you on why in Winsteps the loadings for a contrast are to be interpreted as being on opposing factors instead of loadings on a single factor, especially in light of the fact that SPSS seems be reporting the factor loadings in just that latter manner.

The reason I ran the matrix through SPSS was because I was expecting to see the negative loading items on a different factor or component and wanted to know what it was. I was surprised to see them on the same component/factor and am just scratching my head at the moment.

MikeLinacre: OK, Uve. We are analyzing residuals. The Rasch dimension has been removed. We are looking at other possible dimensions. On each of these other dimensions, one of its poles is causing residual correlations and the other of its poles is causing contrasting correlations. We identify the substantive meaning of the dimension by looking at the items at each pole.

For instance, on a math test. There is the "math ability" Rasch dimension, but there is also the "abstract" -> "concrete" dimension. When we look at the items, we may see this as "algebra" at one pole and "word problems" at the other pole. We can conceptualize this as two dimensions: "algebra" ( -> neutral) and "word problems" ( -> neutral). But we don't know what aspects of "algebra" and "word problems" are causing the dimensionality until we look at the contrasting pole.

In a subject we understand well, such as "math", we might say: "All this is obvious". But in a new area, we might say: "What is it about items A,B,C that makes them off-dimensional?" We have to look at the contrasting items, a,b,c, to discover the answer.

uve: Mike,

Thank you for your insight and patience with me. Each time I get closer to a better understanding of the concepts thanks to your help.


530. Factor sensitivity ratio

SueW March 29th, 2011, 6:57pm: I have two contrasting factors; one is 3.1 units and the other is 2.8 (the rest are all less than 1.7). The total unexplained variance is 22 units (items) and the total explained variance is 25.7.
Question, when calculating the factor sensitivity ratio should I combine the two contrasting factors (which = 5.9) then calculate the ratio of 5.9 (the contrast) to 25.7 (the explained variance). Or should I do them one at a time?



MikeLinacre: Sue, the "factor sensitivity ratio" (Bond & Fox, 2007, p. 256) is defined as (variance explained by this component / variance explained by the Rasch model).

Imagine the same situation in a math test. We discover that the variance explained by the children's math ability is 25.7 units. The residual variance explained by their heights is 3.1 units. The residual variance explained by their weights is 2.8 units.

If we want the "height factor sensitivity ratio" then 3.1/25.7, but if we want the "height+weight factor sensitivity ratio" then ( 3.1+2.8 )/25.7.

Sue, please look at the substantive meaning of each contrast (the difference in meaning between the items at the top of the contrast plots and the bottom), and then decide what numbers will best communicate your message to your audience.

SueW: Thank you kindly Mike

That explains it perfectly. What you are saying is the numbers are one thing but they have to make sense in a meaningful way. Whether I decide to combine the two contrasts will depend on the qualitative meaning of them both in relation to each other and the measure.

Yes that makes sense. I get caught up with the numbers sometimes and don't always look at the meaning.

Thank you for your time and commitment


531. confusing with kidmap gen by winsteps

NathalieQIU March 29th, 2011, 4:49am: Hello, Mike. I would like to use winsteps (3.71.0,1) to produce the kidmap. It is really great update. But, it confused me when I tried to gen the kidmap for polytomous tiems with Partical Credit. The questions are: (1) How the thresholds difficulties are calculated? Are they thurstone threshold or Rasch step threshold in Winsteps? (2) How to classsify the thresholds measurement into four categories, I mean, 'Easy items answered correctly', 'Easy items answered incorrectly', and so on. What are the logic? Thank you very much. PS. I think it should be more detailed with the Kidmap in Winsteps Manual.

MikeLinacre: Nathaliie, thank you for your comments about the Kidmaps. Based on comments like yours, in the next release of Winsteps (very soon now), the polytomous kidmaps will be restructured.

1) Winsteps uses the Rasch-Full-Point thresholds on the Kidmaps. The full-point-threshold is the point at which the category has the maximum probability of being observed, and it is also the point at which the expected score on the item is the category value. For extreme categories, the threshold locations are based on (lowest category + 0.25) and (highest category-0.25).

In the next Winsteps update, this may switch to the Rasch-Thurstone thresholds. For an observation of 3 on a 1-6 rating scale, this is the point at which the person has a 50% chance of being observed in the categories 3,4,5,6 vs. 50% chance of being observed in categories 1,2.

2) "How to classify the thresholds measurement into four categories?" - Yes, that is something we are trying to resolve. So far there has been no consensus.

Here is the dilemma. Perhaps you have a solution! In the dichotomous Kidmaps, each item appears only once at its difficulty. It is an expected success, unexpected success, expected failure or unexpected failure.

In polytomous Kidmaps, we could show each item only once, based on the observed rating and and the full-point threshold. This is probably what you see on your map. If the full-point threshold is below the person's ability, it is an expected low-rating. If the full-point threshold is above the person's ability, it is an unexpected high-rating.

Some researchers prefer to show each item twice on the polytomous Kidmap: once for the observed category and again for the observed category + 1. But observations in the bottom category are not shown, only bottom category+1. Also observations in the top category are shown, but not top category + 1.

What layout would you like, Nathalie?

532. Low Mean Square Interpretation

uve March 27th, 2011, 8:56pm: Mike,

When dealing with fit, high mean square makes intuitive sense to me, but I'm having a problem with interpreting low mean square. You suggest that interpretation can be that one item can answer another or be correlated with another variable. Please correct me if I'm wrong but it is my understanding that fit statistics aren't comparing one question to another or with another variable. My question is: how does low mean square for a particular item capture this redundancy/correlation?


MikeLinacre: Uve, low-mean-squares mean that the responses to this item (or by this person) are more predictable than the Rasch model expects. In Classical Test Theory these would be high-discriminating items and thought of as the best items. In Rasch Theory, these items are locally dependent, summarizing other items. However, they rarely distort measurement. Their main problem is that they are less efficient as measurement devices. This is also recognized in CTT as the "attenuation paradox".

So, if you are constructing a new test, items with low mean-squares tend to be summary items or items that correlate with other items in clusters. The test would be more efficient if those items are replaced with independent items.

In an existing test, low mean-squares warn us that the test is not measuring as effectively as the statistics (such as reliabilities) may lead us to believe, but the low-mean-square items do not lead to incorrect inferences about the meaning of the measures.

533. real or model RMSE

pjiman1 March 24th, 2011, 4:25pm: on the summary statistics table 3.1, which summary statistics for person reliability and separation do I report? the real or model RMSE values?

MikeLinacre: Pjiman1, real or model? This depends where in your analysis you are.

At the start of a Rasch analysis, we need to be cautious about what we are seeing. Use the "real" reliability.

As you clean up the data. the "real" and the "model" reliability should become closer. If you get to the point in your analysis that you can say "All the unexpectedness in the data is the randomness predicted by the Rasch model", then use the "model" value.

If in doubt, report the "real" values. Then you will not deceive your audience (and yourself) into thinking that you have measured your sample better than you really have.

pjiman1: much appreciated, thank you.

534. Item and Step anchors

lovepenn March 24th, 2011, 6:49am: I was reading an article by Wright on TIme 1 to Time 2 Comparison and Equating (https://www.rasch.org/rmt/rmt101f.htm).
In stage 3, he explained how to obtain Time 2 person measures in Time 1 frame of reference.
He anchored Time 2 item difficulties to Time 1 item difficulties produced from a separate analysis of Time 1 data, but for step anchors, he used the threshold measures obtained by a combined analysis of Time 1 and Time 2 data (stacking), rather than those from a separate analysis of Time 1 data.
My guess is that the threshold measures produced by stacking would be an average of Time 1 threshold measures and Time 2 threshold measures. (please correct me if I'm wrong)
My question is: if I want to use this kind of approach (using Time 1 item difficulties and Time 1&2 step measures as anchor values for Time 2 analysis), do I need some special rationale for this? Can I still say that I am measuring Time 2 person measures in Time 1 frame of reference?

MikeLinacre: Lovepenn, Time 1 vs. Time 2 measurement depends on the situation. Sometimes we want to measure Time 1 in a Time 2 (only) frame-of-reference. Sometimes Time 2 in a Time 1 (only) frame-of-reference, sometimes Times 1 and 2 in a Time 1+2 frame-of-reference.

In Ben Wright's example, the lower categories of the rating scale are observed at Time 1. The higher categories at Time 2. So, in order to obtain the full range of the rating scale, we need to combine Time 1 + Time 2 data. But otherwise, it is Time 1 that is decisive. So he used Time 1 item difficulties with the Time 1 + Time 2 rating scale.

How about this? Perform the Time 1 and Time 2 analyses separately. Compare the rating scale structures. If they are approximately the same, then use the Time 1 rating-scale and the Time 1 item difficulties. But if the rating scales differ, there is no general answer. You must decide what provides the most useful measures.

535. Problem with SAFILE=

lovepenn March 22nd, 2011, 8:47pm: Dear Mike,

I am still using the WINSTEPS version since I missed the chance to download my free update and my free update eligibility is now expired. Before purchasing more years of updates, I have a question about the older version of Winsteps that I'm currently using.

While analyzing my data, I encountered the problem with SAFILE=. I tried to anchor step measures using SAFILE=, but the measures it produced were not anchored values.
For a rating scale model, it worked fine, but for partial credit models and grouped response models, it didn't work. Did I do something wrong? or Is this a program bug that is corrected in a newer version?

Thanks for your response, in advance.

MikeLinacre: Lovepenn, my apologies for this bug. Here is a work around. In the SAFILE= format,
1. No spaces or tabs before the first number on a line.
2. Only one space or tab before each of the next number(s) on the line.

1 10 0.56
1 10 0.56

lovepenn: Mike, I tried it again, but it was not successful. I couldn't figure out why. Any other reasons except two things you mentioned above? Below is the text that I put in my control file, in the order of item number, category number, and Rasch-Andrich Threshold measure. And as you mentioned, it didn't have any space before the first column, and only one space between numbers. Thanks for your help again,

1 0 0.00
1 1 -0.58
1 2 0.58
2 0 0.00
2 1 -0.66
2 2 0.66
3 0 0.00
3 1 -1.52
3 2 1.52
4 0 0.00
4 1 -0.58
4 2 0.58
5 0 0.00
5 1 -1.52
5 2 1.52
6 0 0.00
6 1 -0.64
6 2 0.64

MikeLinacre: Lovepenn, please email me directly at: update ~ winsteps.com . The SAFILE= bug is repaired in the current version of Winsteps.

536. Relative to Item or Measure

uve March 20th, 2011, 10:16pm: Mike,

I have been given data from a 77 item survey administered to about 140 students with a 6 point likert scale going from strongly disagree to strongly agree. This is very new territory for me as I deal primarily with dichotomous data. While looking at the category probability curves, I switch from realtive to item difficulty and latent variable.

As I recall, relative to latent variable shows the categories for the item positioned on the scale determined by it's overall difficulty level. Relative to item difficulty seems to center the difficulty of the item at zero and show the categories relative to this.

So we have two investigative tools here, but I'm not sure when I would use one over the other. Can you possibly provide an example when I would use one versus the other?


MikeLinacre: Thank you for your question, Uve.

When you conceptualize the rating-scale for an item as being relative to the item difficulty, then choose "relative". When you conceptualize it as being relative to the person distribution or latent variable, then choose "absolute".

An Andrich-model rating scale applies to all the items, so it is often convenient to depict it relative to item difficulty. But when we are asking "Where does Fred's ability place him on the rating-scale for item 6?" then we want the rating scale relative to the latent variable.

uve: I think I'm beginning to understand. Thanks again.

537. Do Measures Really Matter?

uve March 21st, 2011, 12:53am: Mike,

In most of my interactions with coworkers who have created surveys and are not familiar with Rasch, they are primarily interested in frequency of responses per item and not much more beyond that. They then correlate these frequencies to other data in order to perform regression studies and other analyses.

I am currently reviewing one such survey and am noticing many issues. I can explain that certain categories for an item(s) may need to be collapsed because of disordering, that some items have high misfit, that some of the respondents are reducing the predictive ability of the test, that some items reduce construct validity because they are not ordered in difficulty as hoped, that using a partial credit model may be more appropriate, etc., but I am likley to get only wide-eyed silence as a response from the surveyor. If I'm lucky, I may get a response more along these lines: "How will better calibration and measures affect the frequencies in my data? My respondents answered how they answered, and a Rasch interpretation isn't going to alter the data to mean something fundamentally different."

In the public K-12 school system, surveys are constantly being given without any thought as to the quality of the questions or even the construct being measured. Frequencies are tallied, averages reported and it typically ends there. I have the challenge of trying to communicate the fallicy of such processes to an audience that is not well versed in statistical concepts.

Do you have any suggestions about responses and/or publications well suited to such an audience and situations? I realize this is a very open-ended question, but I do appreciate any advice on the matter.


dachengruoque: Uve, I can not agrre with you more on that! Survey or questionnaires are frequently applied in educational settings, primary secodnary and tertiary contexts. As most often found, most studies touch on the matter of construct validity by running factor analysis, present the results, then done.

uve: Yes, it is frustrating. I'm at a loss as to how to properly convey the need to look at data in this manner. You could say I'm trying to find a way to sell them on this process. It would be interesting to find an example of the interpretations of a survey before using Rasch concepts and what changed significantly after using them. Do you know of any such report?

MikeLinacre: Think of the same situation in measuring height. We can measure height by eye (= typical survey numbers) or we can measure height with a tape measure (= Rasch). We don't expect the two sets of measures to be very different. For many purposes the exactness of a tape measure gives us no additional benefit. But sometimes the tape measure makes important corrections to the eye-measures. We usually don't know until we have measured by tape.

Typical situations include negatively worded items that have not been reverse-coded, or that have been double-reverse coded. Rating scales with ambiguous, duplicative, or semantically disordered categories. Hugely off-dimensional items that are degrading the meaning of the scores.

Typical survey reporting is a set of 2x2 cross-tabs, and the researchers focus on the most statistically significant of the cross-tabs. This is an atomized approach to understanding the survey results. This approach leads decision-makers to try to optimize pieces of a system (often at the expense of other pieces)

Rasch constructs an overview, so that the meaning of each item, and each person's responses can be seen in the context of a "big picture". This leads decision-makers to try to optimize the entire system.

We can see a parallel situation in engineering between the methodology of Genichi Taguchi and that of the Detroit statisticians. See "Taguchi and Rasch" - https://www.rasch.org/rmt/rmt72j.htm

uve: thanks again!

538. Questions: Tables 2.x, and DIF on Person Labels

brownb March 17th, 2011, 8:38am: Hello Mike:

I have two questions on the functioning of Winsteps. I am analyzing an old data set of 56 items (CODES=12345) by 198 persons. To reduce the number of response categories from five to three, I used

CODES = 12345

however some of Table 2.x (Tables 2.1-2.7) use 124, while the subtables use numbers 123. Should I recode differently?

My second question deals with DIF analysis, and Excel presentation. In the same data set, I had used the number 9 to indicate missing data. In the same analysis I used CODES=12345, (NEWSCORE 12233), excluding 9 from the codes command. I imported data from SPSS, and identified 13 person label items. When I conduct a DIF analysis of a two-category item (DIF=@GENDER), the Excel graph has four lines identified as 125 and ".", and not lines for 12 and ".". I don't know where the 5 came from. And, then for a DIF analysis of another two-category item (DIF=@HOSP_EXP), the Excel graph has four lines (i.e., 129 and "."). I am assuming that the DIF analysis is also graphing 9 for the missing data. Is this correct? Many thanks.


MikeLinacre: Thank you for your questions, Barry.

Question 1: "however some of Table 2.x (Tables 2.1-2.7) use 124, while the subtables use numbers 123. Should I recode differently?"

Answer: The coding is correct. Some subtables show the rescored values, and other subtables show an example of the original codes, and some sub-tables show both!

In Table 2.1-2.7 it is an example of the original code
In Table 2.11-2.17 it is the rescored value
Please use the Tables that make most sense to you (and your audience).

Question 2: "I am assuming that the DIF analysis is also graphing 9 for the missing data. Is this correct?"

Reply: Yes. "9" in the response codes is "missing data". "9" in the demographic codes is another demographic group. There must also be "." and "5" codes in the demographics in the person labels.

1) In Excel, delete the lines you don't want to see
2) In Winsteps, use the "Specification" dialog box to select only demographic codes you want in the DIF report:
PSELECT=?????{12} ; if the demographic code is in position 6 in the person label

brownb: Hi Mike:

Thank you very much. Your response is greatly appreciated.


539. winsteps 3.69 default values

bcaapi March 18th, 2011, 2:25am: good evening all. I've got a rather important question and im hoping for a quick response since this affects my work in an immediate way. I need to find the default values for the control variables in WINSTEPS 3.69:
All i can find in the documentation is that the values depend on the version we run but no direct reference to what the values actually are. If anyone has any information on the topic please feel free to contact me by email or as a response here. Thanks in advance (hopefully).

MikeLinacre: Bcaapi, if you can run Winsteps 3.69, then "Output Files" menu, "Control Variables" lists all the values of all the control variables.

Most default values are also listed in the titles of the Winsteps Help pages. For instance,
LCONV= logit change at convergence = .005
.005 is the default value.

540. Table 3.2

mbe March 11th, 2011, 6:10pm: Hi, I have a question about partial credit items.Does this table give me the difficulty of each one of the categories of a partial credit item? What`s the information that give me "Structure measure" . I have read the help but i can`t understand it.


MikeLinacre: Mbe, each category is an interval on the latent variable. So the challenge here is to define "the difficulty of each one of the categories". It is a range, but often summarized as one number. There are numerous different definitions of that one "category difficulty" number in the literature. Many are reported in the ISFILE= output of Winsteps. https://www.winsteps.com/winman/index.htm?isfile.htm in Winsteps Help

Structure measure reports the transition point from one category to the next (the Andrich threshold) according to one parameterization of the Partial Credit Model.

mbe: Thanks, one more question... the threshold are disordered but when i look at the table 13.3 , there aren´t problems. subjets who answer correctly have in average more ability than those who answer partially correct. Which can be the reason?

MikeLinacre: Mbe, do you mean "the Andrich thresholds are disordered" or "the categories are disordered"?

If you mean "the Andrich thresholds are disordered", then the thresholds estimates are dominated by the frequencies of the categories.
Cat 1: 80
Cat 2: 40
Cat 3: 10
Then, first approximation for
Cat1-2 threshold is ln(80/40) = 0.69
Cat2-3 threshold is ln(40/10) = 1.39 so thresholds are ordered
Cat 1: 10
Cat 2: 20
Cat 3: 80
Then, first approximation for
Cat1-2 threshold is ln(10/20) = -0.69
Cat2-3 threshold is ln(20/80) = -1.39 so thresholds are disordered,

The average ability of the respondents in each category augments this, so that,
Second approximation for a threshold =
log(frequency of lower category/frequency of upper category) + average ability in upper category - average ability in lower category

The expected frequency of observation of a category depends on the distribution of the person sample and the width of the category on the latent variable.

So the reason for disordered thresholds is usually a narrow category on the latent variable.

mbe: Thanks Mike!

541. Contrasts or T-Tests?

uve March 5th, 2011, 7:15pm: Mike,

Since Winsteps provides us the ability to choose t-tests to see if items vary significantly from each other, it's hard for me to see the advantage of using contrasts, which seem to essentially get at the same thing but are much harder to interpret, in my opinion.


MikeLinacre: Uve, Winsteps often provides several ways of investigating the same effect in the data. Please use those ways that make the most sense to you.

Winsteps users (like SAS users, SPSS users, R users, etc.) are constantly requesting the inclusion of more and more statistical tests and manipulations in the software. Software developers are continually adding new capabilities. No one uses all of the software's capabilities. We choose what makes sense to us and ignore everything else. OK, Uve?

uve: Mike,

I think both of these tools are very useful, it's just that contrasting is tough for me to get a handle on at the moment. I know this is an impossibly broad open-ended question, but I should have asked you for what investigative course of action is each best suited? I was just noticing that when I did find what might be a significant contrast, the t-tests showed significant results as well. I don't want to fall back on what I know well just because it's easier and I'm familiar with the process. I want to learn and grow and am willing to put the time in to get better. It's just difficult to do when you can't see why you would use one tool versus the other. I hope that makes sense.

MikeLinacre: Uve, are we talking about the DIF contrasts? Then these indicate the (logit) size of the effect. The t-test indicates its significance (probability). We are usually concerned about effects that are big enough to have a substantive impact (at least 0.5 logits) and improbable enough not to be caused by random accidents in the data (p<.05).

But we may require much bigger sizes (if there are many items) and much smaller probabilities (especially if we are doing multiple t-tests) before we decide that an effect is big enough to seriously impact decisions that will be based on these data.

uve: Mike,

I was referring to dimensionality. I've noticed that when you have a significant 2nd dimension that is, say, represented by 5 items, those 5 items will vary significantly from the other items when running the item subtotal or table 27. It seems both tables 23 and 27 tell the same story in a way. So I'm wondering what circumstances better fit using one versus the other.


MikeLinacre: Uve, apologies for the misunderstanding.

The PCA analysis was added to Winsteps because researchers noticed that approaches such as Table 27 (item subtotals) could be insensitive to pervasive multidimensionality. But if Table 27 works for you, then great :-) Table 27 is indeed much easier to understand and to explain to your audience.

uve: Could you provide an example of a situation in which Table 27 would not capture the possiblity of a 2nd dimension in the data? By the way, I think I understand dimensionality conceptually. I just have a hard time looking at Table 23 and coming to definitive conclusions sometimes.

MikeLinacre: Uve, Table 27 (Item subtotals) requires us to know which items would load on 2nd dimensions. Also if there are two balanced subdimensions in the data, so that there are about the same number and same difficulty of items loading on each subdimension, then the Table 27 reports for the two subdimensions will look the same.

In Table 23.2, the two subdimensions will cause opposite contrasts for the two sets of items.

Does this help, Uve?

uve: Yes. Many thanks as always.

542. analyzing rating scales, 1 rater per target set

pjiman1 March 1st, 2011, 7:50pm: Greetings,
I appreciate your past assistance.

I was wondering if I could ask your assistance with a problem I am having with teacher rating data. I am working with a group to analyze rating scale data. The group collected rating scale data on a 35 question instrument designed to measure children’s Social and Emotional Competencies. There were 1500+ students in grades PreK through 5th grade, 257 teachers, and each teacher rated only a maximum of 6 students. The data are non-independent, students are nested within teacher and the number of students per teacher varies. Some teachers rated only 1 student, others rated 6 students. The student ratings are not crossed or linked with other teachers. In other words, no other teacher rated the same student.

These data are problematic because we do not know how much of the variance in the ratings is due to the items, students, or teachers. Without the crossed design, I cannot tell if teachers are more lenient/severe when making ratings, if teachers are differentiating between items, and how much variance in the ratings is between students, which is the variance that we are most interested in. I’ve used MPLUS with a cluster factor to conduct a confirmatory factor analysis, but based on my readings, all that does is adjust the standard error of the estimates. It did not partition the variance. I tried setting up a model in HLM, but due to my limited program ability, I could not get the analysis to run properly. I tried using SPSS to run a variance components analysis, but here too my limited knowledge of that statistical analysis prevented me from conducted the analysis. I’ve been reading about the correlated trait-correlated method (CT-CM) structural equation model as cited in Konold & Pianta, 2007 but in those models, there are at least 2 raters per target. What is the analysis method if there is only 1 rater per a set of targets?

My goal is to determine how much variance in the ratings are due to teacher and student and whether the teachers are differentiating between the items. Ideally, we want to see the variance between students to be large.

However, we cannot be the only group with this problem. In many studies that examine the impact of a school-based classroom curriculum intervention on students, teachers deliver the program curriculum to the students and then rate how well the students in the classroom responded to the program. Often there are 20 to 40 students in a classroom that were rated by a single teacher. Most, if not all, studies ignore the dependence of the data and the clustering issue and assume the data are independent ratings. In other settings, I am sure that say in the medical profession, the supervisor is the only one who rates a set of 10 medical interns, or there is one employee supervisor rating a group of 15 employees. In other words, not every setting has the benefit of having two raters per target. But I am having trouble locating articles that have dealt with this problem of nested/dependent rating data when there is only one rater for a set of targets, at the data collection design and at the statistical analysis level.

My questions are:
1) Is my description clear and am I raising the right issues?
2) Is there any statistical analysis that can address this issue of determine what is the amount of variance in the ratings associated with teacher, student sources and if the teachers are differentiating these items? I’ve tried using FACETS Rasch software, but without a crossed design, I cannot estimate the rater effects. I was thinking about using HLM but was not sure how to set up the equations for level 1 and 2.
3) Is there any data collection design factor that can be used to address this problem? In elementary schools, ideally, if there were 2 teachers to rate each student within the same classroom setting, that would be ideal. However, this situation rarely occurs. Usually, it would be 2 teachers (regular classroom teacher and say a gym or art teacher) in 2 different contexts that could rate the same student. But then context effects would impact the variance in the ratings. Is there any other design factor school settings that might be useful in addressing the problem of determining the impact of rater variance on ratings of students?

I appreciate your consideration and time.

MikeLinacre: pjiman1, unfortunately you have a common situation. The ratees are nested within the raters. No matter what statistical technique is used, an assumption of (random) equivalence must be made.

We could say that the raters have equal leniency. Or we could say that the groups of ratees have equal mean ability. Or we could say that the classrooms are equal. Or we could say that the schools are equal. Or ...

1) Yes.

2) In the standard Facets analysis, set all the raters to have the same severity (anchor them at zero). This will give as meaningful results as any other assumption, except for decision regarding the relative leniency of the raters and the relative ability of different groups of students.

3) Unfortunately nothing practical without a lot of expense and effort.

So, the analytical process is:
a) decide which one effect you want to estimate now.
b) set up the analysis so that the assumptions impact this effect as little as possible
c) perform the analysis and report the effect
d) decide which one effect you want to estimate next.
e) back to b)


pjiman1: Thanks Mike,

I appreciate your prompt response. Your response confirms what I had suspected and I need to convince my group that if they want to analyze rating scale data, they need to plan ahead of time rather than rush into data collection.

follow-up comment: based on this and other articles I've read, it would appear that we should be very skeptical of the validity of the rating scale data when there is only 1 rater per set of targets and when there is no rater crossing plans and/or no covariates to describe the qualities of the rater. Basically it is not just difficult, but virtually impossible to determine if the results we are seeing are actually due to the variations in the latent trait that are attributable to the target or due to rater effects or other construct irrelevant variance. Any future study that I see that uses only 1 teacher to rate 30 students in a pre-post intervention design and use the results as evidence of impact should be treated with much skepticism.

no need to respond Mike, I know you are very busy, just wanted to air out that comment.

Thank you for your valuable consultation.

Much appreciated.

MikeLinacre: Exactly right, Pjiman1. With only one rater and no crossing, there is little opportunity for quality control. We can only hope that everything was done by everyone in the way we intended.

An example was the awarding of "Nurse of the Year" based on ratings by the Nursing Supervisors. The nurses were all about equally competent, and all the supervisors were assumed to be rating to the same standard, so the nurse with the most lenient supervisors won!

pjiman1: thanks mike, much appreciated. while I'm glad my suspicions are confirmed, there is a lot of work to do to fix this problem. thanks!

543. interpreting high negative item infit and outfit

pjiman1 March 2nd, 2011, 4:49pm: Hi Mike,
sorry for another question.

I ran Rasch analysis on set of items. I have two items with high negative infit and outfit values. According to diagnosing misfit, these values would indicate item redundancy and responses are too predictable.

I am looking at items 24 and 28.

However when I look at table 10.1, the item difficulty measures for those two items are different (.70 and .96). I looked at the PCACOMP table 23.99 and could not find correlated residuals between those two items. I looked at the PCA table 23.2 and could not find the two items clustering together.

what do the high negative infit and outfit values for items 24 and 28 mean and if I can't find the source of the redundancy, what output should I be referring to when locating the source of why the responses to the items are too predictable?

output attached.

MikeLinacre: Pjiman1, the high negative significance statistics indicate that the misfit is unlikely to be caused by chance. The mean-squares are about .7 indicating that 70% of the variance in the responses (away from their Rasch expectation) is due to chance, and 30% is predictable from other responses.

Now we need to investigate why that 30% is predictable. The first place to look is the content of the items: "resolves disputes", "takes responsibility". Do these comments somewhat summarize clusters of other items? If so, then the over-predictability is explained.

Another possibility for over-predictability is range restriction on the rating scale (most ratings in only one or two categories for an item).

But we see that the observed correlations 0.90 are bigger than the expected correlations 0.87. These correlations suggest that these items are acting somewhat like switches between "high performers" and "low performers". Perhaps in the minds of the raters, these are summary items that capture the difference between "good" and "bad" performers.

pjiman1: thanks mike, much appreciated.

544. out of range t-value

geethan March 3rd, 2011, 6:31am: I'm using Winsteps to analyze my data from a 50-item MCQ test. My items are spread from t=-4 to +6. How do I interpret this, so that I can modify/change my items to be within the -2 to +2 range.

MikeLinacre: Angie, does "t" mean "standardized fit statistic"?
Then t=-4 to +6 indicates that you have many observations of each item. This can be reduced to t = -2 to +2 by sampling from your data.

First eliminate any grossly misfitting items (mean-squares >2.0). Then please look at your mean-squares. Use the plot in the middle of https://www.winsteps.com/winman/index.htm?diagnosingmisfit.htm (in Winsteps Help). This indicates the number of observations of each item (= d.f.) to produce the desired t-statistics corresponding to the observed mean-square values. Probably sampling 20% of the responses to each item will produce the results you want.

geethan: thanks for the feedback. will try it out.

545. Probability Problems

uve February 28th, 2011, 3:23am: Mike,

How do you suggest I tackle the problem of the multiplicative issues of probability as displayed in the output of the Person Subtotal table t-tests when encountering more than two groups? For example, I have included ethnicity codes in my control files for each person. There are 8 possible categories. If I run a subtotal for ethnicity, Winsteps compares each group with one another. This is fine for initial investigations of which two groups, if any, are significantly different from one another. However, this increases the probability of a Type I error. Let's say there are only three ethnicity groups. In t-tests that use a .05 significance level, the probability of no Type I errors is .95. However, since we have three groups, this would be .95*.95*.95=.857. So in reality, the probability of of making a Type I error goes from .05 to .143, far beyond the acceptable .05 level. After all, this is why we have ANOVA tests for more than two groups and why it is generally considered bad practice to use multiple t-tests. If an ANOVA comes back with a significant difference among the groups, orthogonal contrasts can be undertaken to see which groups differ from each other without violating the multiplicative effect of probability. Post hoc tests can also be used, but these can violate the effect also.

So my question is how to deal with this issue when interpreting the t-tests in Winsteps when more than two groups are being compared.


MikeLinacre: Uve, we need to state our null hypothesis precisely.
If the null hypothesis is:
"There is no difference between group A and group B", then use the Winsteps t-test value.
If the null hypothesis is:
"There is no difference between any pairs of groups", then we need to make a Bonferroni (or similar) correction.
So, for Bonferroni, instead of p<.05, the critical value becomes p<.05/(number of pairings)

I considered incorporating this into Winsteps, but it became too complicated. There are many possible null hypotheses. For instance, another reasonable null hypothesis: "There is no difference between any minority group and the majority group".

Statistical authors point out that the Bonferroni correction may be misleading. They suggest Benjamini and Hochberg (1995), see https://www.winsteps.com/winman/index.htm?bonferroni.htm in Winsteps Help

uve: Many thanks

546. Rating Scale Model or Partial Credit Model

lovepenn February 28th, 2011, 7:32pm: Dear Mike,

I have read the questions and answers posted in this forum that are relevant to the choice of RSM vs PCM. But I still have a difficulty choosing which model I should apply to my data and would like to ask for your advice.

My data:
(1) sample size of about 7000 people
(2) 7 items, all with 3 response categories (0, 1, and 2)

It seems to me that 7 items share the same substantive rating scale, so I thought that RSM would be an appropriate model for my data. But I first conducted both RSM and PCM to see if there are any meaningful differences between the results obtained from these two models. I found some differences in the hierarchy of item difficulty (i.e. the location of two items were reversed). And also, threshold measures obtained from PCM (structure calibration in Table 3.2) were quite varied across 7 items: ranging from -.23 and .23 for item # 5 to -1.45 and 1.45 for item # 7. The threshold measures from RSM are -.85 and .85.
Somewhere in this forum, you said, “If the thresholds of the items are definitely not the same, then RSM is not a suitable model. RSM is only a suitable model if the thresholds may be the same across all items.”
Then, would it be more appropriate to use PCM for my data?
Thank you, in advance, for your time. Your comments are always most helpful.

MikeLinacre: Lovepenn, your Andrich thresholds certainly differ across items, but they can be misleading to our eyes. Compare the model ICCs (Winsteps Graphs menu, multiple ICCs). You will see that the ICC for the .23 item is steep, and the ICC for the 1.45 item is flat. The rating scales for the items discriminate noticeably differently. This was the motivation for developing PCM.

In general, RSM vs. PCM can be a complex decision. See www.rasch.org/rmt/rmt143k.htm

547. Item Subtotals

uve February 28th, 2011, 3:01am: Mike,

Though not as important as person subtotals, I am interested in whether or not there is significant difference between categories of dichotomously scored items. For example, is there a significant difference between five items categorized under "Number Sense" and another 5 under "Measurement & Geometry" for one of our math tests. I'm aware of the subtotal coding, i.e. $S1W3, etc., but I was hoping to create item "grouping" labels that could be selected much like person labels that appear as choices in the dropdown menu.


MikeLinacre: Uve, the procedure the same for person and item subtotals. You need a code in the item labels for each item category. Suppose it is in column 3 of the item labels, then, for instance,
27 ; item subtotals

548. Table 32

uve February 27th, 2011, 9:56pm: Mike,

I was attempting to look that the control variable list in Winsteps by selecting Table 32, but nothing happened. I then attempted to run a variety of other tables but none of them would appear either. I had to close out Winsteps and start again. This is not that critical, but I'm just wondering if I need to do something different.


MikeLinacre: Apologies, Uve. This sounds like a bug in Winsteps. What is the version number, ?
The control variable list is also the top item on the "Output Files" menu.

uve: Yes,

MikeLinacre: Uve, I can't reproduce this bug. Table 32 outputs correctly for me every time with Winsteps
Have you changed the value of ASCII= or are you using a different text editor, not NotePad?

549. Information Clarification

uve February 23rd, 2011, 8:57pm: Mike,

In the Winsteps Help menu, Infit and Outfit calculations mention "information." Is this the same information mentioned in determining standard error, or is this variation? If it is variance, is this variance of the item logit measures from the test mean, or is this the variance of the probability from the mean probabiltiy of the test? I'm curious because the Bond/Fox book explains that for the Infit, the standadized residuals for items are weighted by their variance.


MikeLinacre: Uve, the Fischer information in a binomial observation is the inverse of the model variance of the observation around its expectation - see https://www.rasch.org/rmt/rmt203a.htm

If X is the observation, E is its expected value (according to the Rasch model), V is the model variance of X around E (according to the Rasch model), then

Fisher information in the observation, I = 1/V
Residual R = (X-E)
Standardized residual, S = R / sqrt(V)
Infit of a set of responses = sum (S^2 * V) / Sum(V) = Sum (S^2 / I) / Sum (1 / I)

And the S.E. of a measure estimate is S.E.(Rasch measure) = 1 / Sum(V) = 1/Sum(1/I)

OK, Uve?

uve: Mike,

As always, thanks for your help and patience with me. Yes, this all makes sense.


551. Weighty matter

C_DiStefano February 1st, 2011, 5:56pm: Hello,
I have a question - I have some physical activity items that I want to analyze. We are interested in seeing how well a set of items fall out along a latent construct of 'Physical Activity' --rating items from low to high in terms of activity level.

Currently children are selecting if they are engaging in different activities (Y/N or 1/0 right now) over the past 5 days; however, some activities are more vigorous than others (e.g., walking vs. swimming). Right now, the construct that is being measured seems to be "frequency of activity" without regard to the type or intensity of the activity.

I would like to build this in and have a rating for each activity (from 1.5 to 7), where a higher number means more intense. So, walking would get a 2.5 where swimming laps may get a 5.

I was exploring the command for using weighting (IWEIGHT), but the WINSTEPS help describes this as having the item "count" more times in the dataset.
Instead, would it be accurate to replace the Y with the item intensity rating and use something more along the lines of a Rating Scale/Partial Credit Scale? This would in effect create a rating for each activity along a continuum from 1.5 to 7 (higher = more activity) and children would get a rating of 0 if they don't participate (N) and the rating level if they do participate in the stated activity(Y).

Any advice is appreciated.
thanks again

MikeLinacre: cd, if your ratings are from 1.5 to 7, please multiply them by 2, so that they are integers in the range 3 to 14. Rasch works on counts.

Weighting vs. Rating Scale categories. These are different psychometrically. If the additional weight is intended to indicate a higher level of performance, then use a rating scale. If the additional weight is intended to indicate replications of the same level of performance, then use item weighting.

C_DiStefano: Thanks for the suggestion. I think that the rating scale makes the most sense. I ran a preliminary run ordering the 20 activities according to the 'level' of the activity, where a higher weight number = more strenuous activity.

If the values are multiplied by 2 and the scale goes from 3 to 14 (with many 0's for not participating in an activity), will WINSTEPS have a problem analyzing a rating scale that has so many points? I say this because I ran some preliminary activities and found some strange ordering when I looked at the Item map (e.g., reading a book higher on the map than jump rope but lower than soccer -- where jump rope should be the most strenuous activity).

thanks for your input

MikeLinacre: CD, the item difficulty is the point of equal probability of highest and lowest (observed) categories. This can be over-ridden with SAFILE=. You can instruct Winsteps how to process unobserved categories.

STKEEP=YES is the default. This says "keep unobserved intermediate categories as sampling zeroes". They are maintained as unobserved, but substantive, performance levels. Unobserved extreme categories are omitted.

STKEEP=NO. This says "unobserved intermediate categories are structural zeroes, so they do not represent performance levels." Winsteps renumbers the categories using only observed categories. Unobserved extreme categories are omitted.

There is a third option in the most recent Winsteps release, 3.71.0. This says, "model the rating scale with a smooth polynomial function across the specified range of categories." Unobserved intermediate and extreme categories within the specified range are kept.
1 3 14 ; item 1 (and all items in its item-group) has category range from 3 to 14
SFUNCTION = 4 ; model the rating-scale thresholds with a 4th degree polynomial (= mean + variance + skewness + kurtosis)

C_DiStefano: I'll give this a try - thank you.
Now I'm back to questioning if it really is a rating scale. The data look like this (least to most strenuous activity)
Read 0 (didn't do it in past 5 days) 1 (did it, but a low weight)
Walk 0 3 (bit higher weight)
Swim 0 5 (higher weight - because a more intense activity)

But, any one item can only take two values 0 (N) or a number (Y) where the number reflects intensity. Across the set, the range will go from 3 to 14.

So, I think I'm a bit puzzled. Rating scales that I have dealt with in the past have used a Likert scale, for example. And, for an item, subjects across the dataset used the whole scale. Here, it is basically just dichotomous data, where scores will either be 0 or a number.

What is ultimately wanted is that a child who Swims only may get a higher measure score than one who Walks & Reads based on the intensity. (Instead of using these as just counts of number activities participated in where participating in 1 activity is "lower" than 2)

Will the methods that were discussed account for something like this?

Thanks - sorry, the more I learn, the more complex it gets!
thanks again,

MikeLinacre: CD, is this the problem in another setting: a criminal is worse to have murdered once, but never stolen, than to have stolen ten times, but never murdered? If so, we need to use a different data design.

This "criminality" problem was a topic of George Karabatsos' Ph.D. dissertation. As I recall, his solution is to code unobserved lower-severity crimes as missing data, rather than 0. The logic was: "this criminal clearly had the mindset that could have done this lesser crime, if the circumstances had been different."

C_DiStefano: Thank you for the responses - (sorry for the late feedback, been sick lately).
I'm still puzzling on this one and will try the suggestions that you posted. But, you are right - it is similar to the criminality problem that you stated, but here it is "better" to jump rope than walk...the weighting may help to treat this 'intensity' and I'll try the missing values instead of 0.

On another note, when running the RSM (constraining item mean to 0 w/a 4 category Likert scale, for example), the summary profile information provides average measure.
I understand that in a binary case, this may be interpreted as the average measure score for people selecting a 1...but, --Given that there are many categories, how is this average person measure interpreted?

Thanks - i am enjoying learning!

MikeLinacre: CD, for rating scales, the "average measure" is "the average measure of people selecting the category".

552. Probability Output

uve February 12th, 2011, 6:30pm: Mike,

Does Winsteps provide a way to output the exact probability of answering a question correctly at a given ability level for an item on a test? Or can Winsteps produce the equation of the ICC so that I can plug in the ability level and get the probability output?


MikeLinacre: Uve, Winsteps outputs the ICC on the Graphs screen and also in the GRFILE=.
For dichotomous items, it is probably easiest to use Excel:
Probability = 1 / (1 + exp(difficulty - ability)) in logits

uve: Mike,

I copied the probability curve data into Excel, but it does the opposite of what I need. That is, it seems the range of probability is kept the same and the logit abilities adjust depending on the item. I would like the ability levels to remain the same and the probability levels to change. I also attempted to use the GRFILE, but for some reason it only outputs one item. Am I doing something wrong?

I am attempting to replicate an interesting calculation used by our state in order to check whether common items have drifted too much between forms. They call it weighted root mean square difference. They calculate the difference in probabilities of an item between forms at a specific range of abilities from -3 to 3. Each difference is squared. These squares are then added and you calculate the square root of the total, if I read the equation correctly. If the value is greater than .125, the item is removed from the common item set.

I'm trying to find something in Winsteps that will produce the probabilities for each item in the range of -3 to 3. If it goes beyond this range, that won't be a problem as I'll just trim off the extremes. As always, thanks so much for all your help!


MikeLinacre: Uve,

GRFILE= reports the probability of an item of difficulty 0 logits. So you would need to lookup the probability for an ability of (b-d).

The probability-difference computation appears to be biased against items with difficulty close to 0. The can have big differences in probability. Items with difficulties fare from 0 logits cannot have big probability differences in the range -3 to +3 logits.

But the computation sounds like it needs an Excel spreadsheet. Put the common item difficulties in the rows, and the target abilities in the columns, and the formula 1/(1+exp(d-b)) in all the cells.

uve: Mike,

I may be getting sidetracked with the GRFILE issue because its output looks exactly like what I need. I've attached a copy of the output of a 50 item multiple choice test. There should be 50 items here but only 1 item is produced by the GRFILE option under the Output Files menu. When I ran it for a 23 item Likert scale test, all 23 items appeared. Seems odd. As far as doing this manually, I know I can take persons with abilities from -3 to 3 on the item in question and plug them into the equation. However, I don't know the intervals I need to use. For example, after -3 do I choose -2.8, or -2.85 or -2.7, etc. for the next ability level? I suppose I could just look at the PFILE and use all measures from -3 to 3, but the GRFILE appears to do this all for me, so it's a bit disappointing that it only gives me one item.

Equation clarification: I thought probability was (exp(b-d))/(1+exp(b-d)). Yours appears different, or do we get to the same place using either one?

MikeLinacre: Uve, unfortunately GRFILE= is not doing what you want. It reports as though the item difficulty is zero. The item number is only for classification, not for its difficulty.

Winsteps Graphs menu
Multiple ICCs
Select all model ICCs
Display them (if they overlay, then click on "Absolute x-axis")
Paste into an Excel spreadsheet.

Does this work for you, Uve ?

uve: I'll give this a try as well. There was a weighting piece I left out of the calculation. I contacted the state and spoke to Dr. Tim Gaffney. He's been providing me a lot of assistance over the last year and was very gracious with his time as you are with yours. I'll send you correct file with his permission.


553. Sampling Affecting model fit?

harmony February 13th, 2011, 11:21am: This question actually has 2 parts. The first has to do with what the best sampling method is for a particular purpose. The second has to with whether or not one of those methods might cause the data to create model misfit.

A placement test has say 1000 examinees. Of those 500 are placed in level 1, 200 are in level 2, 100 are in level 3, 50 are in level 4, and 50 are in level 5.

In performing test analysis, is it safe to take samples of 30 from each group in an effort to see how well the test performs for the different levels of placement?

Or should the sample be representative (ie: 10% of each- 50 Level 1, 20 level 2, 10 level 3, and 5 each of levels 4 & 5).

Also, is there some minimum number of students from a level that need to be represented in the sample in order to have reliable data on that level? Is 13 too small when you have a sample of 240?

Secondly, would the first sampling technique affect the measures in any negative way? In particular might it lead to model misfit?

MikeLinacre: Harmony, generally we want as many observations as possible, and at least 30 observations of anything (person or item) for robust measurement: https://www.rasch.org/rmt/rmt74m.htm

If the misfit in the data to the Rasch model varies across level, then the varying the sample sizes at different levels will vary the fit of the data to the model.

You probably want the final set of Rasch measures to be as independent of level as possible, so an equal sample size for each level will prevent level 1 from dominating the estimation.

If the level is coded into the examinee identify label, then you can report sub-totals by level. This will indicate how the misfit is stratified across levels.

Suggestion: analyze each level separately to verify that the measures for the level are valid. Then combine the data into one analysis. Do a DIF analysis of items x level to identify items that change their meaning across levels.

harmony: As always, thanks a million Mike.

I was reading about DIF in the manual recently and have been wanting a good excuse to try it out. Now I have one. Thanks for your reply and advice.

MikeLinacre: Have fun, Harmony :-)

554. test calibration

renato February 8th, 2011, 4:41pm: Hi.

I am a beginner in Rasch and Winsteps. I made a digital version of a traditional test for evaluate the reasoning with 43 items. I want understand if it needs more or minus items for a good calibration of test, but I don't see how to make this directely with Wisteps.

How many items require my test for a good efficiency? Can Winsteps provide analysis to help me with calibration?



MikeLinacre: Thank you for attaching your Winsteps control and data file, Renato.

Overall, your analysis look good. Its person "test" reliability is 0.88 (Winsteps Table 3.1). This is close to the generally recognized standard of 0.90.

The person-item map (Table 1.0) suggests that two or so more items with difficulty "0" (near items I28 and I25) would be beneficial. They would increase the reliability, and also help discriminate between the persons around 0.4 logits. Items I1 and I2 are really easy and could be omitted. Only 3 people are down at their level. OK, Renato?

TABLE 1.0 reasoning ZOU142WS.TXT Feb 8 17:17 2011

4 +
A |
3 A A +
A |
A T|
A A A | I41
| I39 I43
2 A A A A +
A A A A A A A | I29 I38 I42
A A A A A A |S I33
A A A A |
A A A A A A A A A A A A A | I31 I32 I36
A A A A A A A A S| I22 I30 I40
1 A A A A A A A A A + I23 I35
A A A A A A A A A | I26
A A A A A A A A A A A A A A | I20 I21 I24 I27
A A A A A A A A A A A A | I34 I37
A A A A A A A A A A A A A A A A A A A A A A A A A A | I28
A A A A A A A A A A A A A A A A A A |
0 A A A A A A A A A A A A A A A A A M+M I25
A A A A A A A A A A A | I14
A A A A A A A A A A A A | I15 I16 I17
A A A A A A A A A A A A A A A | I13 I18
A A A A A A A A A A A A A A A A A A |
A A A A A A A A A A A A |
-1 A A A A A A A A A +
A A A A A A | I8
A A A A A A A A S| I19
A A A A A A A | I11
A A A A A A A |S I5
A A A A A A | I4 I6
-2 +
A A A A | I12 I3 I9
A A |
A A A |
A A A A A | I7
-3 +
A A |T
| I2
A A |
-4 +
A | I1
-5 +

renato: Thanks a lot for your attention, Mike.

Ok, I know that I can use visual inspections, outfit, infit, reliability, point-biserial correlations, PCA, then we remove the suspects items and then run Winsteps once again, etc. Winsteps is an excellent tool to support research by providing good indicators of misfit in the data for custom analysis. But my real question is if there are a automatic analysing tool in Winsteps that suggests what to do for improvement the test efficiency.

For instance:
- Some tests may have more items than necessary. Can I remove some items and maintain (or improve) the same reliability? What are these items?

- Some items are located in the same position of the continuum of the latent variable. They are redundant items?

- Certain items can degrade the performance of the overall test. What are they?


MikeLinacre: Thank you, Renato.

Generally, items with mean-squares > 2.0 degrade test functioning and lower reliability.

"Redundant items": If we want to measure with equal precision over a long range, then we need a uniform distribution of items. If we want to measure with high precision over a short range (e.g., near a pass-fail point) then we need many items near that pass-fail point.

renato: Thank you very much, Mike.

I'm excited about the possibilities of using Winsteps in education and psychology and I'll keep learning.


555. Formatting

uve February 4th, 2011, 5:41pm: Mike,

What is the best way to copy Winsteps output tables from notepad to Word and still preserve the formatting?


MikeLinacre: Several options, Uve. www.winsteps.com/winman/ascii.htm

Copy-and-paste from NotePad into Word: use a fixed-space font, such as Courier New in Word.

ASCII=webpage displays in your web-browser. Then copy-and-paste into Word.

uve: Thanks!

556. Linear partial credit model

resmig February 1st, 2011, 3:37pm: Hello ,
I have 15 items with different ordered categories for each one. These are measured over time ( 4 time points) . I was thinking of analyzing with linear partial credit model. My questions are - (1) Is there any other model option , and (2) What software should be used ?
Thanks very much,
Resmi Gupta

MikeLinacre: Thank you for your questions, resmig.

The Partial Credit Model (PCM) is appropriate. There are many Rasch and non-Rasch model options (generalized PCM, multidimensional PCM, etc.), but my recommendation would be to start with PCM.

Obviously I recommend Winsteps, but there are many Rasch software alternatives www.rasch.org/software.htm

For thoughts about time series and Rasch, please see www.rasch.org/rmt/rmt223b.htm

resmig: Thanks very much Dr. Linacre.

557. Contrast Loadings

uve January 31st, 2011, 3:55am: Mike,

I'm getting much closer to a more solid conceptual grasp of dimensionality, but there is something still eluding me when attempting to read table 23. By contrast, I�m assuming you mean the difference in how an item or items "correlate" with a possible sub-dimension, and that of the Rasch dimension. If that is true, are the positive loadings the possible sub-dimension and the negative loadings the Rasch dimension? I've attached an output file of a 77 item multiple-choice test of English Language Arts given to 2nd graders over a day or two mid way through the year. It covers the typical clusters of standards: Reading Comprehension, Word Analysis, Writing Strategies, Written Convetions and Literary Response & Analysis. The Decoding items fall under the Word Analysis cluster, and most of the other items fall under the other clusters.

uve: Mike,

Here's the attachment.

MikeLinacre: Uve, you are an explorer traversing rough country!

We are analyzing the residuals, so the Rasch dimension is the .0 line. The Rasch dimension does not correlate with any component in the residuals.

The contrast is between the correlations (loadings) of items A,B,C, ... (focusing on "word meaning") and the correlations (loadings) of items a,b,c, ... (focusing on "word construction") with a latent component that is orthogonal to the Rasch dimension. In conventional factor analysis, we would think of these opposite loadings as two factors: "word meaning" and "word construction".

The convention in factor analysis (principal component or common factor) is to set the biggest loading as positive. But this is an arbitrary choice. We could equally well decide to set the biggest loading as negative.

uve: Mike,

Yes, rough country indeed but I have 4-wheel drive. I just wish I knew where I was going!

Factor analysis is new to me, but in my readings so far the examples usually show certain questions/variables that load high on one factor with how they load low on the other. But contrast seems to be slightly different.

Correct me if I’m wrong, but it seems from what you’re saying that the high loadings are simply the questions that load strongly on one factor (ignoring how they load on the other factor) and the negative loadings are questions that load strongly on the other factor (ignoring how they load on the other factor). So in my data the question 2 loads the highest on the first factor and question 39 the highest on the other. But the more important issue is determining what the construct of the factors are. Is that correct?

If these two factors are orthogonal to one another, then they are therefore presumed to be independent. If the loadings are not strong, say below .4, then could we assume these two factors are not as independent as we think?

The reason I ask this comes from the output of a scatterplot of 100 persons on the groups of questions for each factor. It was items 2 -10 and 12 for the high load factor and items and 39, 46, 52, 67, 70, 71, and 74 for the negative factor. Even though the eigenvalue for the 1st contrast was 3.4, the graph doesn’t seem to suggest this is distorting our measures. So, I might conclude that perhaps we are not measuring true independent factors but perhaps just the “legitimate” subcomponents of the test.

MikeLinacre: Uve, this can be confusing. Residuals are balanced, so if someone is high on some items, that person must be low on other items. The PCA of residuals reports that there is a component (factor) with "word meaning" at one end and "word construction" at the other. This component is orthogonal to the Rasch dimension, and also explains the most variance in the residuals. From the perspective of the person sample, after adjusting for their overall ability, there is a tendency for some people to be high on "word meaning" but low on "word construction", but for other people to be low on "word meaning" but high on "word construction". This contrast between the two groups of persons (or two types of items) explains the most variance in the residuals.

uve: Mike,

Thanks again as always. Based on my scatterplot, it seems the factor is not very strong. Would you agree?


MikeLinacre: Uve, a contrast size of 3.4 with 77 items is weak, but noticeable. If this is a pilot version of an instrument, we might try to reduce it. If this is the real data, then we would accept it.

558. sample size of two groups are different

hm7523 January 29th, 2011, 6:13am: Hi,
I have two groups and want to compare if their performance on a set of rating scale items are similar. The first group has 300 people and second group has 3000 people, is it okay to do DIF analysis? thanks!

MikeLinacre: Hm7523, your situation is typical in DIF studies of ethnicity. There is a big "reference" group and a smaller "focus" group. Your sample sizes look fine to me.

559. winsteps code randomly select persons

pjiman1 January 25th, 2011, 8:59pm: Hello,

what is the winsteps code to randomly select persons from the person data file for a rasch analysis? I'm looking for a code similar to that in SPSS, where I can select 50% of the sample for analysis. Is there a similar code in winsteps? and can this code be implemented after the PDfile code? so that I am only randomly selecting persons after the misfitting persons have been deleted? I am doing this because I have a large sample size and want to compare the results from two randomly selected samples from my larger sample.

thanks in advance.

MikeLinacre: Thank you pjiman1.

Winsteps does not have random person selection, but it can select every other data record from the input file (before PDFILE= is actioned).

https://www.winsteps.com/winman/index.htm?format.htm - Example 7

560. Scalogram pattern matched with infit/outfit

uve January 18th, 2011, 12:49am: Dear Mike,

Our multiple choice tests are scored using A-D. I would like to be able to attach the Guttman scalogram of responses output in the first half of Table 22.1 on to the end of Table 6. My intent is to be able to look at the infit and outfit values and also see the pattern of responses for the matching respondents in one table. Is this possible?


MikeLinacre: Uve, yes, you can do this with rectangular copies (NotePad++, TextPad, Word, etc.)

The approach depends on how big is your sample, how long is your test, and how many persons you want to view.

Let's assume everything is big, and you want to view everyone (the worst case).

1. Perform your standard Winsteps analysis.
2. Output Table 22, with a LINELENGTH= big enough for all observations to be on one line.
3. Save the Table 22 file to your Desktop (or wherever).
4. Open the Table 22 file in software that can do a rectangular copy (NotePad++, TextPad, Word, etc.)
5. Open your Winsteps data file using the same rectangular-copy software.
6. Rectangular copy the Scalogram immediately after the person labels.
7. Adjust Winsteps control NAMELENGTH= etc for the new data file format.
8. Save the control and data file(s)
9. Perform your revised Winsteps analysis.
10. Table 6 should now display the Scalogram as part of the person label.

OK, Uve?

uve: It usually takes me a day or two for things to click. I'll give it a try and get back to you. Thanks!


uve: When I ran Table 22, each response set was on one line only in Notepad, so that part worked out fine. But I am using Microsoft Office 2003, so I'm assuming this version does not accommodate rectangular copy. I say this because I highlighted the data and pasted it into the control file, but it inserted the data instead. I tried the same with WordPad but it did the same. Is rectangular copy a special command? You mentioned Linelength. Is this a command you provide in Winsteps? I'm not sure I need it but if the output Scalogram contains additonal output I don't want, it would be nice to crop it out.


MikeLinacre: Microsoft Word supports rectangular copy (Alt+Mouse). Set the font to Courier New. Rectangular copy only works within the software, so please also import the Winsteps control file into word.

NotePad++ (freeware: http://notepad-plus-plus.org/ ) is good for rectangular copies.

LINELENGTH= is a Winsteps command that sets the character-length of the output lines. In some Winsteps Tables it controls continuation lines, and vertical splitting of Tables.

uve: Thanks! I'll give it a try

561. Effect Size

uve January 10th, 2011, 2:35am: Mike,

Most of the data I work with will always consist of over 1,000+ participants. If I recall correctly, the effect size for a t-test is: SQRT(t^2/t^2+df). I have two questions:

1. Can this formula be applied to Welch t values in Table 30.1 provided the df column is not INF as well as the t values given in Tables 30.2-4?

2. Is there an equivalent version that could be used for the ZSTD values in Table 14?

I guess where I'm going with this is that I know I will likely get very large ZSTD values for some items given the large populations I work with, so if an effect size calculation could be done I could easily tell whether the effect is significant or not.


MikeLinacre: Uve, the Welch t-values are Student t-statistics, but using a more accurate computation than Student's. So t-theory applies.

ZSTD statistics have already been transformed into unit-normal deviates. Computing the d.f. is complex, but it usually approximates the COUNT column. For 1,000 participants, you could probably approximate:

effect size = SQRT (ZSTD^2 /(ZSTD^2 + df))

but the Winsteps ZSTD computation cuts off at 9.9 because the error in the computation of more extreme values is large.

So the biggest reported value would be SQRT(10^2 / (10^2 + 1000)) = SQRT(100 / 1100) = 0.3

But this effect-size computation is new to me. Do you have a reference for it, Uve?

dachengruoque: http://www.amazon.com/Essential-Guide-Effect-Sizes-Interpretation/dp/0521142466/ref=sr_1_1?ie=UTF8&qid=1294638250&sr=8-1
The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results
This could be one of the latest books on effect size.

MikeLinacre: Dachengruoque, do you know if "The Essential Guide" talks about the SQRT(t^2/(t^2+df)) computation? My own references on effect sizes do not.

dachengruoque: I am a green hand at effect size issue as well first of all. However, as far as I know there are different computations formula for t-tests family since there are different types of t-test.

dachengruoque: Calculation of r Family of Effect Sizes for t-test

The t-test effect size is most often calculated with Cohen¡¯s d. In the case of
a t-test, r will be the point-biserial correlation. To convert d to r if sample
sizes are equal: rpb =d/SQRT(£¨d SQUR£©+4£©(Volker, 2006)

MikeLinacre: Yes, that effect-size is in Ellis:
eta2 = t2 / (t2 + N - 1)
This is a "correlation measure of effect-size" - http://www2.jura.uni-hamburg.de/instkrim/kriminologie/Mitarbeiter/Enzmann/Lehre/StatIIKrim/EffectSizeBecker.pdf

dachengruoque: Thanks a lot, Dr Linacre!

uve: My apologies for not replying sooner. So many projects! My source for the equation was "Discovering Statistics Using SPSS" Andy Field, 2nd Edition, 2005. Page 294 has the formula in question. There are many more. One intriguing version is for Z in the Mann-Whitney test: r=Z/SQRT(N), page 532. Much of our data is not normally distributed.

Again, I am wondering if it would helpful for, say, Winsteps Table 14 to have an additional column(s) provding effect size or significance value (probability) of the Infit and Outfit ZSTD values being significant as well as the same for Tables 30.1-4 for the t-values. When you're working with such large populations as I am with over 100 tests to examine, it would be great to have something that helps point the way.

Or should I be using a different tactic? Thanks again for all the help.


MikeLinacre: Thank you for your suggestion, Uve. Winsteps is bulging with more and more numbers!

For large sample sizes, my advice is "Ignore the significances, look at the mean-square or logit sizes." And here is a quick summary of mean-squares from Winsteps Help:

Interpretation of parameter-level mean-square fit statistics:

>2.0 Distorts or degrades the measurement system.

1.5 - 2.0 Unproductive for construction of measurement, but not degrading.

0.5 - 1.5 Productive for measurement.

<0.5 Less productive for measurement, but not degrading. May produce misleadingly good reliabilities and separations.

562. Error in Item Output (Polarity Table)

mlearning.usm January 11th, 2011, 5:55am: We would like to know the reason why the installation password is no longer valid. Besides, there is no expiration date stated and is assumed this software can be used at anytime.To be frank, the Winsteps software also is not working very well. Item entered is not same as the item measured, only half of the item entered is represented in the item polarity table. Please explain this situation since it cause inconvenience to the user. ??)

Thank you.

MikeLinacre: Thank you for this question. I have replied in detail to two of your email addresses. In summary:

You asked: "We would like to ask the reason the installation password is no longer valid."

Reply: The Winsteps software license is perpetual. It does not expire. The installation password is valid for the version of Winsteps you purchased. It is not valid for the current version of Winsteps. Your free update period (one year) has expired. Your update eligibility can be renewed, then you will receive the password for the current version. Please email me for the update procedure.

You wrote: "To be frank, the Winsteps software also is not working very well."

Reply: Our apologies. We are constantly working to improve the Winsteps software.

You wrote: "Item entered is not same as the item measured".

Reply: There are many possible reasons for this. Please look at Winsteps Table 14 for the full list of items in entry order. Does this match what you expect?

You wrote: "only half of the item entered is represented in the item polarity table"

Reply: The three most common reasons for this are:
1. The data file contains "tab" characters. Please replace all tabs with blanks.
2. ITEM1= must be the column number of the first item-column in the data file.
3. NI= must be the number of items
Other reasons include:
4. XWIDE= must be the number of columns for each response, usually = 1
5. CODES= must include all valid response codes
and so many more :-( but they are usually easy to remedy.

Please email me your Winsteps control and data file(s). I will diagnose the problem for you.

Mike Linacre
bugs -/at\- winsteps.com

563. Linear transformation of the logit scale

anairc January 10th, 2011, 3:51pm: Dear Prof. Linacre,

I'm a newbie in the measurement process and Rasch analysis. :) My objective is to measure pain perception and stimulus intensity using a rating scale [0-10]. I applied a rating score model to analyze the data and obtained the Rasch scores for both stimulus and responders. I was thinking of using a linear trasformaion in the Rasch scores to obtain a more 'friendly' scale 0-10, and already saw it in some papers, but I was not able to fully understand how exactly the extremes were obtained (infinity?), is there any reference that approaches this issue in detail?
Thank you so much for your help.

Best regards,

MikeLinacre: Ana, in Rasch theory, extreme scores (minimum possible and maximum possible) have infinite measures. But these are impractical. So we substitue the measures for almost-extreme scores, such as (minimum possible + 0.3 score-points) and (maximum possible - 0.3 score-points).

This corresponds to option 1 in https://www.rasch.org/rmt/rmt122h.htm

564. comparing item performance across forms

dmt1 January 3rd, 2011, 7:21pm: I want to compare the same items on two separate forms of an examination to determine whether a statistically significant difference exists in the performance of the items (looking at b) across the two forms. N is different on each form. Is there a way I can do this in one analysis in Winsteps?

MikeLinacre: dmt1, yes, we can do this in one analysis.

We need to put the two datasets into one analysis, with each student identification including a form number. Then do a DIF analysis of item x form.

Here is one approach.
1. Make up a master list of the items, identified by sequence number. This can match one of the forms.
2. Enter the data for each form in its own sequence.
3. Set up an MFORMS= instruction
3.1. Match each item on each form to its master-list number
3.2, Enter a constant in each student id: the form number.
4. Do a Winsteps analysis using the MFORMS= instruction.
5. Perform a DIF analysis: Winsteps Table 30.

dmt1: Thank you! I was able to do the analysis I needed.

565. a rather simple question

barbresnick January 9th, 2011, 3:25pm: I am trying to get used to the new version of Winsteps but in an older text file with the syntax I want to do an analysis using just a few of the items from a string. Specifically the measure has 10 items and I want to analyze just items: items2,5,6,8 and 9. is there a way to do this. would it be...


any help is greatly appreciated.

Barb Resnick

MikeLinacre: Barb, several ways, easiest is:


This means "delete all the items, and reinstate 2,5,6,8,9"

barbresnick: Thanks Mike that worked great for that group of items (negative subscale). When I tried it for the positive subscale items 1,3,4,7 and 10 it only read 3 of the items? is there something wrong with how I typed it?

1 strongly agree
2 agree
3 disagree
4 strongly disagree
;??NO file I don't want to delete anything
satisfied with self
I am no good
I have good qualities
Do things well
Not proud
Feel useless
Worthy person
more respect for self
a failure
positive attitude to self

MikeLinacre: Barb, oops?


NI is the number of items. Should this be NI=10 ??

barbresnick: exactly....and the good news is I figured that out! thank you again so much. Barb

566. Standard Error of Estimate: Model or Real?

rblack January 9th, 2011, 4:21pm: Hi again!

When constructing confidence intervals around the estimated item difficulties, is it general practice to use the "Model" standard errors or the "Real" Standard Errors?



MikeLinacre: Ryan, this depends where you are in your analysis.
If you are starting, use the "Real" (worst case) so you don't fool yourself that you are doing better than you really are.
If you are ending, and are reasonably certain that the randomness in the data is the randomness predicted by the Rasch model, then use the "Model" (best case).

567. # of Decimal Places in Winsteps Output Table

rblack January 9th, 2011, 1:11am: Hi,

The Output Table, "Item: Measure", provides values to the 2nd decimal place for the "Measures" and "Model S.E." columns. Is there a way to obtain values under those columns to the 4th decimal place? Ultimately, I want to output this table (with 4 decimal places) to an excel spreadsheet for manipulation.

Any help would be appreciated.


rblack: Hi all-

I figured it out. UDECIMALS=4.


MikeLinacre: Well discovered, Ryan! Another option is to multiply the measures and S.E.s by 100 with USCALE=100, then adjust back in Excel.

rblack: Thanks, Mike!

568. DIF analyses interpretation

nike2357 January 8th, 2011, 10:02am: Dear Mike,

I'm not sure I'm interpreting the results from my DIF-analyses correctly so I'd be grateful if you could clarify a few things.
The groups are men and women and I have data from a personality inventory with a five point Likert scale. I'm using ETS' classification system, so DIF exists if the DIF contrast is >.43. But to interpret the direction of the DIF, I have to look at the DIF measures, right? If the DIF measure ist larger for men than for women, does that mean the item ist more difficult for men? And does that then mean that the item favors women since they endorse a higher response category than men of the same latent trait level? And with the DIF measures, do I only evaluate the absolute value or the sign as well? I.e. would .34 be more difficult than -.43 or the other way around?

Thanks very much!

MikeLinacre: Nike, there are are two groups (men and women), so the DIF report will show that one group has scored higher than expected and one group has scored lower than expected. So please look at the "observed-expected average" which is the "DIF SCORE" in Winsteps Table 30.2

If the group scores higher than expected on an item, then either they have greater ability on that item, or that item is easier for that group. The DIF report cannot tell us which alternative is correct. Conventionally we think of the group ability as constant, so that the item is easier for the group = negative DIF.

If the group scores lower than expected on an item, then either they have lower ability on that item, or that item is harder for that group. The DIF report cannot tell us which alternative is correct. Conventionally we think of the group ability as constant, so that the item is harder for the group = positive DIF.

When there are only two groups, we usually compare them directly with each other. This is reported in Winsteps Table 30.1 as the "DIF contrast".

569. Scoring using LOFT (linear on-the-fly testing)

oosta January 7th, 2011, 3:08pm: I am considering using linear on-the-fly testing to deliver a multiple-choice test. That is, when each examinee takes the exam, the 100 items will be randomly selected from a large pool (2,000 items) of items. The testing engine supports LOFT but not CAT.

I can store the equated item difficulty parameter of each item in the test administration system.

So here's my question. Is the algorithm or formula for computing the total (Rasch) score (using the difficulty parameter values) fairly simple? What is the algorithm/formula? By the way, unanswered items will be scored as incorrect.

MikeLinacre: Oosta, yes, the estimation of the Rasch measures (scores) is straight-forward: https://www.rasch.org/rmt/rmt102t.htm

Rasch-Related Resources: Rasch Measurement YouTube Channel
Rasch Measurement Transactions & Rasch Measurement research papers - free An Introduction to the Rasch Model with Examples in R (eRm, etc.), Debelak, Strobl, Zeigenfuse Rasch Measurement Theory Analysis in R, Wind, Hua Applying the Rasch Model in Social Sciences Using R, Lamprianou Journal of Applied Measurement
Rasch Models: Foundations, Recent Developments, and Applications, Fischer & Molenaar Probabilistic Models for Some Intelligence and Attainment Tests, Georg Rasch Rasch Models for Measurement, David Andrich Constructing Measures, Mark Wilson Best Test Design - free, Wright & Stone
Rating Scale Analysis - free, Wright & Masters
Virtual Standard Setting: Setting Cut Scores, Charalambos Kollias Diseño de Mejores Pruebas - free, Spanish Best Test Design A Course in Rasch Measurement Theory, Andrich, Marais Rasch Models in Health, Christensen, Kreiner, Mesba Multivariate and Mixture Distribution Rasch Models, von Davier, Carstensen
Rasch Books and Publications: Winsteps and Facets
Applying the Rasch Model (Winsteps, Facets) 4th Ed., Bond, Yan, Heene Advances in Rasch Analyses in the Human Sciences (Winsteps, Facets) 1st Ed., Boone, Staver Advances in Applications of Rasch Measurement in Science Education, X. Liu & W. J. Boone Rasch Analysis in the Human Sciences (Winsteps) Boone, Staver, Yale Appliquer le modèle de Rasch: Défis et pistes de solution (Winsteps) E. Dionne, S. Béland
Introduction to Many-Facet Rasch Measurement (Facets), Thomas Eckes Rasch Models for Solving Measurement Problems (Facets), George Engelhard, Jr. & Jue Wang Statistical Analyses for Language Testers (Facets), Rita Green Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-Mediated Assessments (Facets), George Engelhard, Jr. & Stefanie Wind Aplicação do Modelo de Rasch (Português), de Bond, Trevor G., Fox, Christine M
Exploring Rating Scale Functioning for Survey Research (R, Facets), Stefanie Wind Rasch Measurement: Applications, Khine Winsteps Tutorials - free
Facets Tutorials - free
Many-Facet Rasch Measurement (Facets) - free, J.M. Linacre Fairness, Justice and Language Assessment (Winsteps, Facets), McNamara, Knoch, Fan

To be emailed about new material on www.rasch.org
please enter your email address here:

I want to Subscribe: & click below
I want to Unsubscribe: & click below

Please set your SPAM filter to accept emails from Rasch.org

www.rasch.org welcomes your comments:
Please email inquiries about Rasch books to books \at/ rasch.org

Your email address (if you want us to reply):


FORUMRasch Measurement Forum to discuss any Rasch-related topic

Coming Rasch-related Events
May 17 - June 21, 2024, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 12 - 14, 2024, Wed.-Fri. 1st Scandinavian Applied Measurement Conference, Kristianstad University, Kristianstad, Sweden http://www.hkr.se/samc2024
June 21 - July 19, 2024, Fri.-Fri. On-line workshop: Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com
Aug. 5 - Aug. 6, 2024, Fri.-Fri. 2024 Inaugural Conference of the Society for the Study of Measurement (Berkeley, CA), Call for Proposals
Aug. 9 - Sept. 6, 2024, Fri.-Fri. On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com
Oct. 4 - Nov. 8, 2024, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
Jan. 17 - Feb. 21, 2025, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
May 16 - June 20, 2025, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com
June 20 - July 18, 2025, Fri.-Fri. On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com
Oct. 3 - Nov. 7, 2025, Fri.-Fri. On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com