Rasch Forum: 2007
Rasch Forum: 2008
Rasch Forum: 2009
Rasch Forum: 2010
Rasch Forum: 2011
Rasch Forum: 2012
Rasch Forum: 2013 January-June
Rasch Forum: 2013 July-December
Rasch Forum: 2014
Current Rasch Forum
OSU_Public_Health November 14th, 2006, 9:21pm:
We have longitudinal data, beginning in 1st grade and ending at 6th. At the baseline survey (1st grade) we had a total of 18 items. As the survey was administered each successive year, additional items were added to the original 18. Further, during the first two years of the survey, response options were on a 3pt scale. However, for the remaining years, the response scale changed to 4pt.
I am attempting to determine which Test Equating method would work best for this scenario.
It seems we have several issues:
a) common items across surveys, with additional items at each successive administration
b) response scale change from 3pt to 4pt
Would this be something best dealt with in a two step process; either equating the scales with the common items and then analyzing the entire set of items or vice versa?
Does anyone have any suggestions as to best deal with this situation?
You have two things going on here, OSUPH. Unfortunately you can't meaningfully analyze all the data together.
For the first two years you were measuring in "Celsius". You could put these years together using common-item equating. (Analyze each year separately and cross-plot the item difficulties to verify that the common items really are "common").
Then for the other years on the "Fahrenheit" scale, do the same thing ....
Putting the two sets of measures together is a Celsius-Fahrenheit conversion. Take the two sets of item difficulties and cross-plot the common items. The best-fit line gives the Celsius=>Fahrenheit conversion to apply to all the measures for the first 2 years to put them on the scale of the later years for reporting purposes.
Hello, it was with great interest I read this post and was wondering if the same advice would apply in this situation.
in year 1, a survey was administered with a 5-point response scale (1 = SD, 2 = D, 3 = sometimes agree, sometimes disagree, 4 = A, 5 = SA). After the survey data was collected, after consideration, the survey group decided that it might be best to drop the middle response option ("sometimes agree, sometimes disagree") to improve interpretation of scale mean scores using a raw score, general classical test theory.
in year 2, they want to change the response scale to a 4-point response scale (1 = SD, 2 = D, 3 = A, 4 = SA).
Is there a way to equate scores from year 1 to year 2 even if the the response options change? Can we assume that respondents interpret the options in a continuous, linear fashion?
my initial plan is this:
First, I would take a look at the year 1 data and look at the response categories and how they are functioning. Frequency counts can determine if any one category is not being used. I would also examine the item response curves to see if any category could be collapsed. In my experience, for a 5 point likert scale, categories 1, 2, and 3 could be collapsed, without any meaningful change to the item measure. If that is the case I might be able to get away with collapsing the categories for the year 1 data. Then you would have an equivalent measure with the second round.
Second, I would obtain the item measures for year 1 using a 5 point scale. Then, I would collapse the 5 point scale into a 4 point scale and would obtain the item measures. I would save the item measures for both versions, then cross plot them. Review the scatterplot to see if throughout the axes that there is a one-to-one correspondence. If so, I have some evidence for equivalence. If they wildly diverge, that means we are getting different item measures if we use a 4-point versus a 5-point scale. My guess is that I am going to find that some items do work better as a 3 point collapsed scale and others as a 4 point collapsed scale.
Third, if there is a best fit line where the 4 point and 5 point scale versions demonstrate equivalence, then the best fit line is an equation to convert 5 point to 4 point. What the equation is and how to extract it, I think there should be table from Winsteps that could produce that equation. But the point is to use the equation to put the 4 point response option (Celsius) on the 5 point response option scale (Fahrenheit).
is that a good plan?
That sounds like a good plan, Pjiman1.
You may want to cross-plot the survey groups measures at each stage of the collapsing process to verify that the process does not distort those measures in a way that could alter your findings.
If you are satisfied with the best-fit line that Winsteps shows in the Excel scatterplot, then its formula is in the Excel worksheet.
thanks for your quick reply as always. most helpful.
Seanswf August 22nd, 2006, 12:43pm: I am analyzing an attitude survey with 29 items. My goal is to create a data file that contains the survey respondent’s raw scores transmuted into logit scores. Can I do this with facets?
"Logit scores" are measures on an interval scale constructed using Rasch analysis.
Yes, Facets constructs "Logit scores".
I have some questions for you.
If i want to do the analysis using Rasch measures(logit scores) in SPSS, how can i transform the Rasch Measures (logit scores) into SPSS?is it any way to do it? thanks..
MikeLinacre: Najhar, in SPSS, Rasch measures (logits) are exactly like any other additive variable, such as height or weight. For Rasch measures we also have their precision (S.E.s) which we don't usually have for height and weight. But most SPSS routines are not configured to take advantage of S.E.s.
Raschmad October 25th, 2006, 8:09am:
I appreciate any comments.
Why are Rasch measures linear?
I know that the Rasch measures are expressed as the exponent of the base e.
I also know that equal ratios on the raw score scale are expressed with equal intervals on the logarithmic scale. The log of numbers 2, 10, 50, 250 to the base 10 are 0.30, 1, 1.69, 2.38 respectively. The ratio on the raw score scale is 5: 1. The logarithmic scale is interval and increases by 0.70.
Consider this scenario that students A, B, C and D have raw scores of 2, 10, 50, and 250 respectively. If student C complained and said I have scored 50, i.e., 40 scores more than B and B has only 8 scores more than A, but the difference between B and A is just equal to the difference between me and B, what would we answer? And of course D is at a much better position to complain.
Or have I misunderstood the whole issue of linearity?
Good question, Raschmad!
Rasch measures are linear because they are constructed to be linear, in the same way that distances in meters are linear because they are constructed to be linear. In contrast, numbers on the Richter Scale for earthquakes are NOT constructed to be linear, also test raw scores are NOT constructed to be linear.
There are many ways of establishing that Rasch measures are linear. I like the one based on concatenation because that is what physicist Norman Campbell said was impossible for psychological variables. See https://www.rasch.org/rmt/rmt21b.htm
The paper is way beyond my measurement background. I know that the parameters are estimated iteratively by means of this equation:
R = Sigma exp(Bn-Di)/1+exp(Bn-Di)
R being the raw score and L the number of items. Could you please explain in laymen terms why are estimates of B and D for different persons and items linearly scaled?
What do you mean by “they are constructed to be linear”?
Certainly, Raschmad. In practical terms, "linear" means "a numerical increment of 1 implies the same amount extra no matter how much there already is of the variable". We can see that raw scores are not linear because there cannot be an increment of 1 beyond the maximum raw score.
"Constructed to be linear" means that we start from this (or any other) definition of linearity and deduce the algebraic model that has this property. The algebraic model is the Rasch model. This is done apart from any actual data. It is similar to deriving Pythagoras Theorem from the property that a triangle has a right-angle. So the Rasch Model and Pythagoras Theorem have the same standing. Both are deduced from an ideal. The more closely the data approximate either the Rasch model or Pythagoras Theorem, the closer those data are to having the ideal properties.
I think this paper makes it clear why Rasch measures are linear.
I'm reviving an old thread here, but it'd be silly to start a new one.
[quote=MikeLinacre]There are many ways of establishing that Rasch measures are linear. I like the one based on concatenation because that is what physicist Norman Campbell said was impossible for psychological variables. See https://www.rasch.org/rmt/rmt21b.htm
In that article, the concatenation of two persons working together on one item is based on the fact that both persons agree. What happens if they disagree on the answer, but each is absolutely convinced he's right? Also, the expression given for the concatenation of P_X and P_Y assumes that each person's probability to succeed is unaffected by the cooperation with the other person, which need not be true. Finally, one can imagine scenarios where two persons working together can succeed whereas each individual cannot. A far-fetched example might be the task to calculate 2*3+4*5 being tackled by John, who knows how to add but not how to multiply, and Sue, who knows how to multiply but not how to add.
But nitpicking aside, I'm willing to accept that Rasch measures do in fact concatenate nicely, and I'm sure that the scale defined by p = exp(b-d)/(1+exp(b-d)) can be proven to be linear.
What I'd be interested to know is if it also works the oher way round: does *requiring* linearity inevitably lead to the Rasch scale? In other words, is the Rasch scale the only conceivable scale that is linear?
Thank you for your questions, Garmt. You describe various situations. Measurement models can be constructed for these. See, for instance, "Teams, Packs and Chains" www.rasch.org/rmt/rmt92j.htm.
You ask is the "Rasch scale the only conceivable scale that is linear?" No, the measurement theoreticians, such as Joel Michell, have formulated the general rules that a linear scale must obey. But Rasch is the only methodology that successfully produces linear measures from probabilistic data (that I know of). It will be exciting when another methodology is developed.
garmt: Thanks for the pointer, Mike. I found an interesting article by Joel Michell (Stevens's Theory of Scales of Measurement and Its Place in Modern Psychology), who complains about unsubstantiated claims on tests giving interval scale measurements. Also, some nice work by George Karabatsos, who is developing a Bayesian approach.
Seanswf December 7th, 2006, 1:20pm:
This is a basic question on the differences between CTT and IRT. I have a number of tests to analyze and want to obtain a reliability measure for each. We use an item banking approach so it is rare that any two test takers would receive the exact same item set. Because of this I cannot calculate the CTT Alpha coefficient. What is the analogous statistic in Rasch and what are some good resources on this topic?
Standard reliability indexes (such as Cronbach alpha) estimate the ratio Reliability = "true variance" / "observed variance", where observed variance = true variance + error variance.
Rasch does not require complete data. The "observed variance" is the variance of the Rasch measures. The error variance is estimated from standard errors of the Rasch measures. So the Rasch reliability estimates "true measure variance" / "observed variance".
There are many webpages about this on www.rasch.org, such as https://www.rasch.org/rmt/rmt132i.htm
SMUH: WHAT IS THE difference between Rasch dimension&Rasch reliability
MSUH, thank you for your questions.
Rasch dimension: the latent variable underlying the responses that is identified by Rasch analysis and marked out in equal-interval units (logits). Along this dimension, the persons are position by their ability measures, and the items are positioned by theirdifficulty measures.
Rasch reliability: reliability is the ratio of "true variance" to "observed variance". Rasch reliability is when the reliability is computed using Rasch measures.
Raschmad December 11th, 2006, 8:58am:
Bond and Fox, chapter 5, talk about cross plotting measures from 2 tests with their errors and the 95% quality control lines to find out whether the two tests measure the same construct. What’s the advantage of this method over simple correlation?
They quote Masters and Beswick (1986) “...points well within the control lines could help to justify the assertion that two tests were measuring the same construct even though the correlation coefficient was disastrously low” (p.59).
How can the correlation between 2 tests be “disastrously low” but the points fall well within the control lines?
Are there any sources on the superiority of this method over correlating CTT raw scores?
Can't correlations and cross-plots both be done?
My problem with correlations is that they overlook the existence of two parallel trend lines. A prime example was our work on the Functional Independence Measure around 1989. A cross-plot of item calibrations at admission and discharge revealed vital information about the structure of the instrument. www.rasch.org/memo50.htm Figure 3.
Kaseh_Abu_Bakar: I have another unresolved question about this cross-plot stuff. Perhaps this is due to my ignorance on the concept and computation behind cross-plotting. When I cross-plot common items from two tests of different difficulty levels, I find some items that don't fall within the confidence intervals. Out of these items, about three of them show point to the direction that I want, i.e. their estimation measures in one test are harder than in another, as expected. However, they showed larger estimation difference when compared to the valid link items within the confidence interval. Thus, I am not sure whether to drop them from the link items and label them as different items, or to include them as link items in the concurrent analysis. Can anybody share his or her experience/knowledge/reference in this matter?
MikeLinacre: Kaseh: The confidence intervals are highly influenced by the sample sizes. So the first question is: "Are the differences in difficulties big enough to make a difference?". For many applications, differences of less than 0.5 logits don't matter.
1)Thanks Mike. But in my case where two tests of different difficulties are involved, wouldn't I expect some unidirectional differences between the two sets of measures? I expect the items to measure the same ability, but they are not of the same difficulty measures (not positions) mainly because they were administered to examinees of different abilities. So if this is the case, can I expect to see 95% of the plotted measures within the control lines?
2)What does the average of measure differences between common items (D=Meausre easy-Measure Hard) tell me and what does it say about the validity of my linking items?
I have cross plotted two sets of measures. The measures correlate at 1.
When I use the plot empirical line the dots fall on the stright line with the slope of unity. When I click on the plot identity line the measures diverg from the 45° degree line and some of the test-takesr at the two ends of the line go outside of the control lines. This second one is a bit strange given the perfect correlation between the measures. What is the difference between the 2 plots and which one should I take?
Raschmad, thank you for asking.
The two plots draw different lines, but based on the same points. Please look more closely at the empirical plot: the slope of the line is SD(y)/SD(x). A reported correlation of 1 would also happen between Celsius and Fahrenheit temperatures, but their common slope (conversion factor) is far from 1.
Our theory is usually that the slope should be 1, so we only act on the empirical slope if it is noticeably far from 1.
There is no problem with the empirical line since it’s exactly the 45° line which matches the correlation of 1. Isn’t the empirical line the familiar regression or correlation line?
The identity line is the one that diverges from the slope of 1. Therefore, as far as the empirical line (which is more important) is concerned the test is doing OK. Then why in the identity line many points fall outside the control lines? What’s the slope of the identity line?
In a previous query of mine about these two lines you wrote: “The choice of lines depends on your hypothesis. If your hypothesis is that the two samples should produce identical measures, then use the identity line. If your hypothesis is that your samples should produce correlated measures, then use the empirical line”. From your example about the correlation of 1 between Celsius and Fahrenheit temperatures, I got the impression that we are not looking for correlated measures, we want identical measures and therefore the identity line should be more important.
Raschmad, something is wrong somewhere - perhaps with Winsteps. You write: "The identity line is the one that diverges from the slope of 1." But the identity line has a slope of 1 by definition. So, if it is being shown with a different slope, there is a plotting failure. Perhaps the identity and empirical plots have been switched around.
A correlation of 1 says nothing about the slope of the line. It says that all points fall on the line. For instance, Celsius and Fahrenheit temperatures have a correlation of 1, but the slope of the line is 1.80.
The best-fit line is a joint-regression line for predicting x from y and also y from x.
MikeLinacre: The "identity line" problem is solved. The Winsteps documentation for those plots is/was misleading. I'm uploading a revised Help page to www.winsteps.com/winman/comparestatistics.htm. My apologies for causing this confusion.
Monica October 27th, 2006, 6:46am:
Some of the software packages report cronbach's alpha and separation index. I am unsure about the difference. I understand cronbach's alpha equation incorporates the variance of the scores of the persons on the test and the variance of the scores of the persons on each item.
The person separation index is calcuated on the variance of the estimates of abilities and the standard errors of measurement for each person. Which statistic should be reviewed? Overall they are similar as one looks at total score and the other the person ability.
Thanks for asking, Monica. Reliability can be confusing. Reliability is the proportion of "true variance" in the sample distribution to "observed variance". Since we can never know the "truth", several methods of estimating it have been derived. Cronbach Alpha uses a variance decomposition of the raw scores. It generally produces the same number as KR-20 which is based on averaging point-biserial correlations (or so I recall. It's a long time since I looked at their paper.) Rasch reliability is usually estimated from the person measures and their standard errors. In general raw-score based reliabilities overstate the "true"reliability and Rasch reliabilities understate the "true" reliability.
But once you have the reliability, R, estimated by any method, then the corresponding "separation index" S = square-root ( R / (1-R))
or, if you have the separation index, then R = S^2 / ( 1 + S^2)
Thank you very much for the question, Monica. As is said by Mike, reliability is a very confusing concept in measuremet theory in the social behavior sciences.Fortunately, for most of the purposes, there is no point in calculating the so-called reliability coefficient. When CTT theorists developed the various estimators of the coefficient, their aim is to calculate/estimate the standard error of measurement (see for example, Kuder and Richardson's 1937 classical paper on Psychometrika). With the birth of both the generalizability theory and item response theory, the coefficient is unnecessary, for well before the coefficient is calculated, the standard error of measurement is known. But this has not been realized by most of the reseracher and practitioner alike. Cronbach emphasized this point only in his posthumous publication in 2004 on Journal of Educational Measurement (My current thought ...). it seems to me, both Cronbach alpha and Wright's seperation index should be dropped.
As for the difference between the two, i am afraid that a good knowledge of generalizability theory is needed before the difference is to be appreciated. In brief, the speration index in the same as the index of dependability in generalizability theory, but not the generalizability coefficient, to which cronbach alpha belongs. The greatest difference between the CTT reliability coefficient and IRT seperation index is that the former is about the measure of apparent performance, whereas the latter is about the measure of latant ability. Were the relation between the performance and the latent ability is linear, reliability coefficient should be identical to the seperation index. When the ralation is nonlinear, they are different.
Omara December 29th, 2006, 1:32am:
I am an Egyptian researcher interested in IRT and DIF.
I need information about data simulation for a comparison study between DIF detection methods. I would very much appreciate your help and cooperation.
Ain Shams University
Faculty of Education
MikeLinacre: Do you need information about how to simulate data, or about how to quantify DIF?
thank u for ur response .
I am asking for these two elements that u said ( simulation and quantify DIF )
because I need to make a comparison between some DIF detection methods , and I have a SIBTEST prog which able to simulate data . But I am beginer in DIF studies so that I need a help .I would very much appreciate your help and cooperation . If u have any Information in DIF studies plz help me .
Thanks for your response . I think you have good experience in IRT and DIF studies. Can I ask u any question about the difficulties which I face in these areas . here in EGYPT there is no one can help me in this field of study.
Can u take me as a friend in Egypt ?
merry christmas and happy new year
Eliab, thank you for responding.
Do you have the SIBTEST documentation? That should help you.
I'm not familiar with SIBTEST.
Can anyone else on the Forum help Eliab?
Monica December 20th, 2006, 7:49pm:
Why is ANOVA of residuals used to determine if there is a main effect. I am running DIF on grade level at school. Should the groups be near equal size as the usually requirement for ANOVA?
When examining the ICC and looking at the expected and actual values, what is the minimum number of persons that should be in a class interval? (I'm using RUMM) :)
MikeLinacre: A RUMM expert has told me he will reply ....
Raschmad December 15th, 2006, 9:15am:
I ran PCAR for some locally dependent dichotomous items. The variance explained by measures was 55%. Then I ran PCAR for the same data but this time the dependent dichotomous items were grouped in polytomies. This time the variance explained by measures was 65%.
Can one argue that (65%-55% =) 10% is the effect of local dependence? That is, for these data local dependence introduces a dimension which brings 10% noise to the measurement system and results in smaller variance explained? Can this be used as a method of detecting the strength of local dependence?
Thanks in advance
Raschmad, you have good ideas here.
Yes, dependent items share something in common. For instance, if the same item is included in a test twice, then those two items share the "dimension" of being the same item.
The change in "variance explained" is obviously an effect of combining the items. But the "effect of local dependence" arithmetic is almost certainly non-linear, because the number of observations has also been changed by combining them together into polytomies. The impact of the local dependence is probably much less than 10%. A simulation study would provide an estimate.
Thanks a lot Mike.
Very helpful, as always...
pixbuf December 13th, 2006, 10:40pm:
I'm analyzing a timed test, where unreached items are coded as missing or incorrect. In the original data, these item responses are coded as missing. When I write out the Bigsteps/Winsteps data file, these item responses are coded as 9.
I originally included CODES=0123456 (omitting 9) so unreached were treated as missing. The analysis ran as expected.
Then I changed it to CODES=01234569, with RESCOR=2 and NEWSCR=01234560. When I treat the unreached responses as wrong like this, I get a few items that no one reached, which come out in the item file like this:
23 INESTIMABLE: HIGH | |
(In Bigsteps they show 0 count and 0 score, and a status of -2).
On the other hand, I have other items that were no one reached but were coded originally as incorrect but appeared like this in the item file:
21 0 1300 7.71 1.84| MAXIMUM ESTIMATED MEASURE
(Notice this line shows 1300 count. And in Bigsteps it has a status of 0. Hey Mike, what happened to the status column in the item files?)
Now, why shouldn't the rescoring make the all-missing items wrong and come out with a non-zero count, and estimated difficulties?
Pixbuf, are you using a partial credit model, such as GROUPS=0? If so, "INESTIMABLE HIGH" means "only one category of the rating scale for this item has been observed, and it looks like the difficulty of this item is high.
A good approach is to debug your data using a standard rating scale model (omit GROUPS=0), and, when you are sure the data coding is correct, then switch over to Partial Credit. Look at Table 14.3 in Winsteps to see the response summaries for each item.
To see item "status" numerically, write out an IFILE= from the Winsteps Output Files menu. It is the third column of numbers.
pixbuf: Yes, it is a partial credit model. Now I see what the problem is. Thanks, Mike.
SSampson December 5th, 2006, 6:45pm: I am interested in looking at growth in learning based on a pre and post test. I have read about rack and stack and it looks like one or the other (or both) will be helpful in doing this. I have read a piece on FIMM from RMT, but I am still trying to get my mind around how to show growth with this method. Does anyone have suggestions on where to find material about rack and stack (or another way of using Rasch analysis to measure change over time?)
Have you administered the same instrument pre- and post- test to your subjects?
If so, you can obtain two measures for each subject, the pre-test measure and the post-test measure. You do this by "stacking" all your data in one long rectangular file in which each subject becomes two cases, the pre- and the post. You can then compare pre- and post-measures on an individual or group basis.
If you are interested in investigating "how much change has there been on which items?", then you "rack" the data. Each subject becomes one case, but with two sets of responses the pre- and the post-. You can then compare subject or group gains on individual items.
A more traditional Rasch analysis is simply to analyze the pre- and post-data as two separate analyses and then to compare the results for the items and the subjects.
Monica November 13th, 2006, 11:58pm:
I am looking at 2 DIF reports (a) grade 3/4, (b) grade 4/5 comparisons for a mathematics test. 1. I would like to graph the item difficulties for each grade, 3, 4 and 5 onto a common scale along with the entire sample item difficulties. What is the best way of achieving this. The program I am using produces deltas and adjusted deltas for each grade, difference between adjusted deltas and standardised difference grade within each DIF analysis. 2. How are the adjusted deltas and standardised difference calculated?
Regarding your 1st question if I understand it correctly you want to cross plot item difficulties from two samples (grade 3 vs grade 4) on x and y axes. If this is the case you can easily use Excel.
Click on chart wizard. Choose XY scatter. Click Next. Click in the Data Range box and then select the two columns of data that you want to cross plot. Next. Next. As new sheet. Finish.
The 2nd question I don't know.
Monica: I would like to plot the item difficulties for grade 3, 4, 5 and overall using the entire sample on one graph, but I have 2 different sets of DIF analysis information to combine.
Monica, thank you for asking. You have the full facilities of Excel available when you are producing DIF plots. So you copy the data from one Excel worksheet to another and you can plot as many lines as you like on one scatterplot.
Excel is somewhat of a challenge to start with, but it is very powerful.
Thanks for your responses. I may not be explaining myself...
I have 2 DIF reports, 1 for grades 3/4 the other for grades 4/5 and the item difficulties for the entire sample. I would like to graph all of them on a single graph, items on the x axis and difficulty on the y.
For each DIF report, each item has the following statistics for each item, for each grade level (3/4 or 4/5): deltas, adjusted deltas, difference between deltas and standardised difference.
I was expecting the deltas for grade 4 to be the same over both reports. They are correct to within 1 decimal point. For example, item 1 has a difficulty of -2.17 in the 3/4 report whereas it has a difficulty of -2.15 on the 4/5 report. Is this a problem? Why is there a difference? I plan to take grade 4 as the reference group and will plot the difference of 3-4 and 4-5, and can I place these difficulties on the same axis as the difficulties for the entire sample?
How is the adjusted delta, and standardised difference calculated. If its any help, these statistics are produced by Quest when requesting a DIF analysis.
MikeLinacre: Monica, you need a Quest user to help with this - Any Quest users out there?
ismail13dz: hi i m new here can u help me
What do you need?
Monica November 27th, 2006, 11:20am:
I have calculated person estimates for my mathematics assessment used to measure students' conceptual understanding of fractions. The person ability estimates range from - 3.03 to around 4.24. what is the best method of scaling so as to present this information back to teachers in a more standard form say within a range somewhere between 0 and 100.
If you are using WINSTEPS, use the following formulas to change the origin of the scale.
USCALE= (wanted range)/ (current range)
UMEAN= ( wanted low ) - ( current low * USCALE)
in your case USCALE=100/4.24--3.03=13.75
Put these two commands in your control file and rerun the analysis.
SUSANA_GARCIA November 16th, 2006, 12:13pm:
I AM NEW IN THIS FORUM.
MY QUESTION IS ABOUT HOW TO SCORE A RASCH DERIVED QUESTIONNAIRE.
WE HAVE REDUCED ONE QUESTIONNIARE USING RASCH. THE REDUCED FOR HAS TWO MAIN FACTORS COMPOSED EACH BY 10 QUESTIONS WITH 0 TO 5 RESPONSE OPTIONS.
EACH FACTOR SOCRE (DIREC SUM) RANGES FROM 0 TO 50. BUT THIS WOULD BE A DIRECT SUM.
THE WINSTEPS PROGRAM GIVE A SCORE TABLE THAT TRANSFORMS THE DIRECT SUMS INTO UNIT MEASURES AND PERCENTILES.
I AM A BIT CONFUSED IN THIS RESPECT. HOW DO I USE THAT TABLE?
HOW ARE RASCH DERIVED QUESTIONNAIRES SCORED?
THANK YOU FOR YOUR HELP!!
Thank you for your question, Susana. What you do next depends on the purpose for your questionnaire. You have used Rasch analysis to discover that it has two factors (dimensions, latent variables). You have raw scores and Rasch measures on each factor.
For many purposes, the raw scores are enough, particularly if every one responded to every question. The Rasch analysis has confirmed that the scores mean what they say. A higher score means more of that factor (latent variable).
But the Rasch measures can take you further. They enable you to draw pictures, "maps", of what is going on (like Winsteps Table 1). They are also the numbers to use in subsequent analyses, such as regression or ANOVA.
Rasch measures are used in the same way that raw scores are used. Except that raw scores produce skewed pictures and findings, but Rasch measures produce linear, "straight", pictures and findings.
Does this help?
DEAR MICHAEL LINACRE
THANK YOU SO MUCH FOR YOUR HELP AND PROMPT REPLY.
I FOUND YOUR RESPONSE VERY USEFUL. I HAVE DECIDED THEN TO REPORT THE DIRECT SUM AS THE TOTAL SCORE IN EACH SUBSCALE. I WILL ALSO ADD WHAT YOU MENTIONED THAT NOW THESE SCORES HAVE LINEAR PROPERTIES, SO A HIGHER SCORE INDICATES MORE OF THE CONSTRUCT.
ari November 6th, 2006, 7:59am:
Hi, I'm Ari. I'm new to Rasch. ;)
I'm at the first stage of analyzing my data. However i'm not so clear as how to interprate the PCA output. What does it mean by "Factor 1 extracts ??? units out of ??? units of TAP residual variance noise." What is the acceptable units extracted?
How to read the "Table of STANDARDIZED RESIDUAL variance (in Eigenvalue units)". Please help.
MikeLinacre: Thank you for asking for our advice, Ari. There is some explanation at www.winsteps.com/winman/principalcomponents.htm which is also in the Winsteps Help file. Does this start to answer your questions?
Thanks for your prompt reply...I'll work on it.
Raschmad November 14th, 2006, 10:30am:
In WINSTEPS DIF and DPF plots there are local measures, relative measures and t-values. What are these and which should be considered?
When crossplotting measures using WINSTEPS, there is polt identity line and plot empirical line. What is the difference and which should be considered?
MikeLinacre: Thank you for the question, Raschmad. Please take a look at the Winsteps Help, also at www.winsteps.com/winman/table30dif.htm and related webpages - if you have questions not answered there, please ask them here. Then I can respond to you and also update Help to include that information.
Raschmad November 8th, 2006, 9:35am:
I have 4 polytomous items each with different categories between 40-61.
I have used Master's partial credit model to analyses them.
These item are independent texts with blanks to be filled by test-takers and each text is treated as one polytomous item. the texts are totally independent from each other.
One of the texts (items) has very low infit and outfit MNSQ which are significant too.
I'm trying to come up with a justification. Why Overfit?
The items are not dependent,i.e. response to one cannot clue response to another.
What do you folks think?
Dear Raschmad, the average mean-square is usually near 1. Is the average mean-square of your items near 1? If so, the analysis has proceeded correctly.
From your description, each item is the score on a text. So the dependence would be across persons, not between texts. We saw this in a reading comprehension test in the USA. The content of the text affected the children's reading styles, and so affected the item fit statistics.
The average mean square is near 1 and as you mentioned each item is the score on a text.
When you say that the dependence is among persons does it mean that they have cheated?
Has your similar experience with the reading test been published?
Psychometrically "dependent" people have similar backgrounds, educational specialties, interests, etc. For instance, a text relating to "cowboys" will usually be of great interest to boys, but little interest to girls, of the same literacy level. Similarly, for girls with a text about clothes and fashions.
Another source of dependency is the curriculum. For instance, different teaching methods produce strengths (and so psychometric dependencies) in different areas. This is one of the challenges in choosing a text for a reading comprehension test.
ERMEstudent November 6th, 2006, 6:19pm:
I was wondering if anyone knows of any articles or reports discussing Rasch application in a classroom setting. I am looking to write a paper concerning this topic, but have found little info. If anyone could provide any help, it would be greatly appreciated.
MikeLinacre: A good person to ask would be William Boone at Miami of Ohio University: http://netapps.muohio.edu/phpapps/directory/?query_type=simple&query_operator=equals&query_string=boone&show=boonewj&start=1
Jade October 24th, 2006, 4:34pm:
I am examining 10 raters' behavior on an English speaking test using FACETS. There are 10 items in the test. The raters give 10 item scores and an overall impression score. I am new to FACETS. It seems to me most people only use item scores in their analysis from the few studies I have read. I ran inter-rater reliability in FACETS but I have been wondering what to do with the overall impression score? In my cae, the raters give the overall score after they rate all items. The overallscore, therefore, also reflects their decisions. What do researchers normally do when they have an overall score in addition to item scores in FACETS analysis? Can I pretend it to be another item score? What might it affect if I do so? :-/
I appreciate your advice.
Jade, "overall impression" appears to be an 11th item on your instrument. This is a very common situation on customer satisfaction and similar surveys.
First treat it as a separate item. We expect that it summarizes the other 10 items so should considerably overfit (mean-square much less than 1.0). Compare the size of its mean-square to the size of the mean-squares of the average of the other items. This will tell you how much new information the "overall impression" contributes, relative to its expectation of 1.0.
If the mean-square is very small, then ditch the overall impression item. It is merely acting as a substitute for the total score on the other items.
If it's mean-square is noticeable, then leave in the "overall impression" item. It is somewhat redundant, so inflates test reliability and lowers person standard errors, but it does no harm to inferences drawn from the person measures. Usually, it will slightly inflate inter-rater reliability indices.
Jade: Thanks. I like this forum. Very helpful for beginners like me!
Raschmad October 9th, 2006, 10:16am:
Some observations about appropriateness of Rasch model for localy dependent items:
Some people say that Rasch is not an appropriate methodology for detecting dimensionality when the items in a test are locally dependent. However, I don’t think that this is true. Locally dependent items increase predictability and make the data set more Guttman like.
This may result in overfit (small MNSQ<1) and over fit doesn’t have anything to do with dimensionality; underfit that is large MNSQ (>1) is indicator of multidimensionality.
Locally dependency does not push MNSQ above 1 , they can push them below 1 which means lack of randomness and not multidimensionality.
However, local dependency may slightly affect person measures.
So the bottom line is that as long as the purpose of the study is to detect dimensionality Rasch can be used for locally dependent items.
What do you think?
Thanks for asking ....
If the local dependence covaries positively with the Rasch dimension (e.g., math and reading), then the data become locally too predictable and the mean-squares go below one. If the local dependence covaries negatively with the Rasch dimension, then the data become less predictable from the Rasch perspective, and the mean-squares go over 1.0. Examples of this are the "rats" and "cans" items in the Liking for Science data. Both forms of local dependence can be detected by analysis of Rasch residuals.
So, I was mistaken.
In case the Rasch dimension covaries even reasonably positively with the local dependence, the data become somewhat predictable and this brings the infit mean squares down, i.e., near 1 and sometimes below 1. Therefore, it gives a fake picture of unidimensionality.
If the Rasch dimension covaries negatively with the local dependence then we will have high infit mean squares and consequently a fake picture of multidimensionality.
In any case, local dependence masks the true dimensionality of the data.
MikeLinacre: Good point! "Dimensionality" is difficult to distinguish from "local dependence". For instance, "arithmetic word problems" display local dependence caused by their shared "reading" dimension. A secondary dimension always causes some type of local dependence among the items that load on that secondary dimension.
Clarissa October 17th, 2006, 9:12am:
I'm Clarissa and new to Rasch measurement. I have just started analysing my M-C reading test with WINSTEPS.
I read in WINSTEP HELP that "high outfit mean squares are the result of random responses by low performers" ( I think they could also be the result of careless missing of easy item by good performers).
"High infit mean squares indicate that the items are mis-performing for the people on whom the items are targeted".
What do low outfit mean squares mean?
What do low infit mean squares mean?
I couldn't find these in the HELP.
Low mean-squares (less than 1.0) indicate that the responses are too predictable. But do not worry about low mean-squares until you have examined the high mean-squares, this is because the average of the mean-squares is usually forced to be near 1.0. Here is the general rule-of-thumb:
Interpretation of parameter-level mean-square fit statistics:
>2.0 Distorts or degrades the measurement system.
1.5 - 2.0 Unproductive for construction of measurement, but not degrading.
0.5 - 1.5 Productive for measurement.
<0.5 Less productive for measurement, but not degrading. May produce misleadingly good and separations.
In the Winsteps Help, low mean-squares are called "muted". See the Table at the bottom of "Misfit diagnosis" - also at https://www.winsteps.com/winman/diagnosingmisfit.htm
Monica: Is it better to examine MNSQ or the infit/Outfit t statistic?
The MNSQ reports how large is the distortion in the measures. The t-statistic reports how unlikely is that size of distortion if the data fit the Rasch model.
There is a direct numerical relationship between the two, based on the number of observations (degrees of freedom), so they both contain the same information, expressed in different ways.
Rasch experts are divided as to which takes precedence. Those experts concerned about data-model fit tend to focus on the t-statistics. Those concerned about measure utility tend to focus on the mean-squares. (As you can detect, I'm a utilitarian.)
Monica October 17th, 2006, 1:06am:
I am analysing data from a test used to examine student's understanding of fractions. All the questions are open ended, rather than multiple choice as the test is a pilot. How is student ability and item difficulty affected if the answers are coded : correct, incorrect, missing (should the missing be coded as incorrect). Are there any papers that examine this?
Thank you for joining us. Missing responses are usually of two types:
1. Missing because not administered or irrelevant. In estimating abilities, etc., these responses are usually skipped over, so that the person is evaluated on a shorter test which omits those responses.
2. Missing because the person does not know the answer, does not have the time, etc. These are usually coded as incorrect and included in the data and so used to estimate the person's ability.
If missing items are skipped over, the person's ability is based on the remaining responses. If they are coded as wrong, the person's ability will be estimated lower. If they are coded as right, the person's ability estimate will be estimated higher.
There are Papers examining the impact of coding "missing because not reached" items as not-administered vs. incorrect. The conclusion is that coding as incorrect penalizes the slow, careful worker, and encourages fast, sloppy work as well as guessing. There is a cultural aspect also. Some cultures discourage guessing on the remaining items as time runs out, other cultures encourage it.
Thanks for your response. It is very helpful. I have administered the same instrument to students from Grades 3 - 6. Just to complicate things further, there has been a change in the syllabus in the 2002 year and the fraction content has increased, although the schools could roll-in the syllabus from 2003-2005. This creates a number of problems: (a) There are questions for example, that Grade 3 students are unable to complete as they have not been taught the particular aspect of fractions, (b) Some of the Grade 5, 6 students have also not been taught particular aspects due to the change in syllabus due to school based decisions or teacher knowledge. Hence I coded the omissions as non-responses and the student abilities calculated accordingly. Do you know of specific references or could you point me in the right direction.
Cheers, Monica :)
Not sure what you are looking for, Monica. Do you need references relating to ability estimation when data are missing, or relating to dropping irrelevant or "bad" items or responses during the estimation process, or what?
In practics, item calibration and person measurement are often asymmetric. For instance, when calibrating an item bank to be used in a CAT environment, it is advisable to drop out the responses of students who were obviously "goofing off", even though those responses must be included when reporting those students "ability" estimates. Similarly, when calibrating item difficulties for a timed test, it is advisable to treat as "not administered" the lack of responses by students who did not reach an item because time ran out - even though those responses will be treated as "wrong" when reporting the student abilities.
I am after some academic references on coding data as no response rather than incorrect.
From your comments, am I correct in understanding that the item difficulty can be calculated with those items coded as missing. However, when calculating student ability, these items are recoded as incorrect?
If your purpose is to obtain valid estimates of item difficulty, then you may want to code irrelevant responses (lucky guesses, carelessness, response sets, etc.) as "not administered".
If your purpose is to report person ability estimates, then you are usually forced to include every response, no matter what its motivation.
So there can be a 2-stage process:
1. with the clean data, estimate the item difficulties - then anchor them
2. with the complete data and anchored item difficulties, estimate the person measures.
Hope this helps ...
This has helped alot.
I am pleased I have stumbled onto this forum. I'ts great.
Thanks muchly, Monica
Christine October 14th, 2006, 3:18am:
My dissertation co-chair left our department, and I was counting on him to help me perform Rasch analyses on my dissertation data. I'm feeling lost, and at this point I'm not even sure when to finish collecting data. Could you help me with this quick question?
I'm planning to perform rating category analysis, find the person and item reliability and separation, and construct a person-item map. Approximately how many administrations of a 30-item measure (5 possible responses from Never to Always) would be needed to do a good Rasch analysis?
Thank you so much for any help you can provide,
Linacre (1994) states that a sample size of 27-61 is appropriate depending on the targeting. That is if the test is well targetted on the students a sample of 27 is enough, if the test is off-target 61 is ok.
Fot polytomies like your work he says the sample could even be smaller because there is more information in polytomoies.But you need at least 10 observations per category.
Read this useful paper in original.
Hope this helps.
Christine: thanks- that's very helpful!
Xaverie October 11th, 2006, 9:35pm:
I want to tranform the raw scores from my Likert scale survey into a linear variable and use it in multivariable analysis. My scale has a 3 possible response (0 = Never, 1 = Sometimes, 3 = Often), and I have a sample size of 500. The problem is that my person reliability value is 0.65 and separation = 1.4, which I believe indicates that my scale can only accurately differentiate between 2 levels of the trait being measured.
So I'm not sure what I should do, given the low overall person reliability and separation.
I'm considering the following options:
1. Make the scale dichotomous by collapsing the top 2 categories together (sometimes, often) and then transform the scale to be linear. I'm considering doing this because the person reliability index indicates that the scale can only differentiate between two categories of the trait being measured, however I will be losing information about people who responded 'often' - approximately 8 - 15% of test-takers on each survey item. So I'm considering the other option...
2. Keep all 3 categories and transform.
When I keep the data as is, it fits the Rasch model extremely well. The Rasch-Andrich threshold curves show the categories are well ordered, the infit and outfit statistics are all between 0.8 and 1.2 MNSQ, unidimensionality is present, everything looks good, EXCEPT for the overall low person realiability value. Is the low person reliability a deal-breaker?
3. My third choice is to not even try to tranform the scale to be linear, but rather just categorize test-takers into meaningful groups based upon their raw test score. I'd rather not do this as I lose valuable information about the test-takers but I'd rather have a valid, if crude measure of the underlying trait.
Any advice would be greatly appreciated.
I don't think that your instrumnet is problematic. Reliability is not associated with the instrumnent, it is a characteristic of the test scores or person measures for the sample you are testing. It is, as Mike has put it, "person reliability", i.e., the reproducibility of the person ordering if the test is administerd again. The question is how likely it is to get the same person rank order if the test is given again.
"Low values indicate a narrow range of person measures, or a small number of items. To increase person reliability, test persons with more extreme abilities (high and low), lengthen the test. Improving the test targeting may help slightly".
WINSTEPS Help from www.winsteps.com
Raschmad October 4th, 2006, 8:25am:
Thanks for your comments on this issue,
Disordering of categories or thresholds in case of rating scales can be interpreted as narrow definitions for categories or introduction of too many options that people cannot distinguish among .
I’m just wondering in case of polytomous EDUCATIONAL test items such as maths and biology how can category and threshold disordering be interpreted.
One way to attempt to get more useful information out of multiple-choice items is to code the distractors, A, B, C, D as a rating scale, e.g., 3=Correct answer, 2=almost correct answer, 1=somewhat correct answer, 0=clearly wrong answer.
Category disordering indicates that your theory about the scoring of the distractors does not agree with the behavior of your sample.
Threshold disordering indicates that one or more of your intermediate distractors is relatively rarely chosen by the sample. This is typically observed with MCQ questions.
beth October 3rd, 2006, 10:52am:
I have an extremely basic question. I was hoping someone could tell me if Rasch analysis and WINSTEPS is suitable for my data before I book onto a course. I am trying to explore a pool of 'disability' items - I am trying to create a unidimensional scale that covers the construct. The items are Likert-like mainly 1-5 but some 1-3 and many with different response structures. I have fitted Samejima's graded response model in MULTILOG but am I right that WINSTEPS would give me more item fit info and allow me to detect if I need extra items to cover the construct?
Any advice much appreciated,
MikeLinacre: Rasch analysis would be ideal for your data. In Winsteps, give each different response structure its own identifying code, and then specify which response structure belongs to which item with the ISGROUPS=. The item maps will tell you what part of the latent "disability" variable each category of each item targets. You can then identify where the coverage is thin. Statistically this is done by examining the test information function (person measure standard errors).
Many thanks for your reply - I have now booked onto your online course and looking forward to getting to grips with Rasch analysis.
Raschmad October 2nd, 2006, 8:51am:
Thanks for your comments.
What defines the origin of a scale?
Why two differenr sets of item calibrations or person measures have different zeros?
What does it mean?
Wright and Stone (1979) when talk about crossplotting two sets of measures they talk of a "single translation to establish an origin common to both sets of items" (p.95). What is this single translation?
The usual convention in Rasch analysis is set the local origin (zero-point) at the average difficulty of the current set of items. So each set of items defines its own zero point.
Think of temperature scales: Celsius, Fahrenheit and Kelvin have different 0 points. When you compare measures from different tests with similar characteristics, it is like comparing Celsius and Kelvin temperature scales. You need a "translation constant" to convert numbers on one scale into numbers on the other.
The process of determining the value of the "translation constant" or "equating constant" is called "test equating".
seunghee September 22nd, 2006, 1:29am:
Dear Dr. Linacre,
thanks for your kind reply. I have another question. My data showed that the person reliability and separation index is low across factors. So, I checked the person who were misfitted, eliminated them, and then reanalized the data. However, the person reliability and separation index was not so much improved (eg. real person reliability .54 -> .56. separation 1.09 -> 1.14 in case of factor 1). I guess it suggests sample(healthy) and response(social desirability) characteristics. If there might be other reasons, please let me know that. Have a good day. seunghee
* RSM fit statistics(1st analysis)
Infit MNSQ 1.00 1.08
Outfit MNSQ .94 .94
Reliability .54 1.00
Separation 1.09 15.71
Infit MNSQ .99 1.01
Outfit MNSQ .99 .99
Reliability .59 .99
Separation 1.20 9.58
Infit MNSQ 1.01 1.00
Outfit MNSQ 1.00 1.00
Reliability .40 .95
Separation .82 4.44
Infit MNSQ .97 1.04
Outfit MNSQ .99 .99
Reliability .45 1.00
Separation .91 19.55
The chief influences on "test" (sample) person reliability are:
1. Sample "true" standard deviation
2. Test length
3. Number of categories per item
4. Sample - test targeting.
Person misfit usually has little influence.
Your item reliability is high (reported as 1.0) so your sample size is large enough.
Your person reliability is low, so your sample has a narrow spread and/or your test has few items.
seunghee September 21st, 2006, 9:14am:
Dear Dr. Linacre,
A questionnaire with 23 items, using 5 point Likert scale was developed to assess QoL for the clinical sample and suggested to be available for healthy population as well. However, the items of my study showed ceiling effect for healthy sample (N=1425). I know that the average measures and thresholds should monotonically increase. The result showed no categories being disordered, but the logits of average measure and threshold seem to have some problem. Please let me know how to interpret my result. Thank you for your time. Seunghee, South Korea
* A part of the result
category label Average measure Infit Outfit Step measure
0 (never) -29.26 1.01 1.01 None
1 (almost never) -16.10 .97 .69 -12.66
2 (sometimes) -9.31 1.04 1.02 -8.91
3 (often) -1.81 1.10 1.21 7.61
4(almost always) 1.99 1.53 1.59 13.97
Thank you, Seunghee. Here's my view on your numbers, but others of you are welcome to post your views!
"Average measure" is a description of the use of the rating scale by your sample. It indicates that, on average, higher measure persons selected higher categories. Good ... this is what we all want.
Infit and Outfit: most are close to 1.00 or a little below: good! The people who chose each category accord with the people we would expect to choose those categories. Somewhat problematic is the Outfit = 1.59 for category 4. This indicates that persons with low measures unexpectedly selected this high category. You might want to check the data file to be sure that those respondents are behaving as intended (data entry errors, response sets, misunderstanding of reversed items, etc.)
Step measures: these indicate the structure of the category probability curves in as sample-independent manner as possible. They are advancing, and show a structure of a "range of hills". Good!
Conclusion: These results are almost as good as it gets ....
van September 20th, 2006, 5:46pm:
I'm thinking of using a 5-point or 9-point Likert scale to survey individual responses to a questionnaire. I would like to turn each ordinal point of the scale into a quantitative measure or score and stumbled on this topic of Rasch model.
Is this the right approach? Can I use the Rasch model (e.g. MiniStep) to input individual responses and use the model to obtain quantitative scores? If so, how do I interpret the output from the model?
MikeLinacre: Welcome, van! "Right approach" ? Yes. "Can I use ..." ? Yes. "How do I interpret ..." ? The best introductory book is "Applying the Rasch Model" by Bond & Fox. Chapter 6 talks about Likert scales. "5-point or 9-point" ? Can your respondents discriminate 9 levels of agreement (or whatever) ? Providing too many response categories provokes idiosyncratic category selection, so increasing the amount of irrelevant noise in the responses. Try out the questionnaire on a couple of your friends first, and discuss with them the wording of the questions and how they went about choosing the response category .... Wishing you all success ... !
Raschmad September 18th, 2006, 11:36am:
When I analysed some items for the high ability group all the items show good fit When the same items are analysed for the low-ability group one item is misfit.
When some more items are added to the test the fit picture changes. All the items are fit for the high ability group but some misfit for the low ability group.
The problem is that the item which was misfit in the short test for the low-ability group is now fit for this group in the longer test and an item which was fit in the short test for this group is misfit for this group in the longer test!!!
When I do DIF analysis (high and low ability groups as focal and reference groups) 80% of the items show DIF.
Also the order of the difficulty of the items change in the two groups (for the longer test only).
These findings are disturbing. Could you please tell what they mean about the test or maybe the construct.
MikeLinacre: Raschmad, please give us more information. The situation may not be as bad as it appears. What statistics are you using to indicate misfit, DIF, etc.? And what criteria are you using to decide what items exhibit misfit, DIF, etc.? With large sample sizes, very small discrepancies can be reported as significant. So the first step is to limit your misifit or DIF investigation to those items where the amount of misfit, DIF, etc. is big enough to make a substantive difference to your reported results. You mention a change in item difficulty order. This is the most alarming aspect of what you report, because this threatens construct validity. When you examine the content of the items, does one order make more sense than the other?
I?m using the infit and outfit stats reported by Winsteps.
My criteria for mean squares is 0.75-1.30 (McNamara, 1996) and -2 <Zstd<2.
One of the misfitting items has an infit mean square of 0.69 and Zstd of ?2.1 and outfit MNSQ of 0.72.
The second one has an outfit MNSQ of 1.32 and the third an outfit MNSQ of 1.83 and outfit Zstd of 2.1. The rest of the stats. are OK.
DIF analysis was performed by Winsteps. All the contrasts in the case of DIF items are larger than 1 logit. For deciding whether DIF exists or not I looked at the Prob. Column table 30. The ones smaller than 0.05 were taken as significant.
The sizes of the groups are not really big. One 77 and the other 83.And many items for the low-ability group have negative PTMEA CORR.!!
The question is why only for the low ability group?
When the two groups are combined the situation is good. The item that was misfit for the low ability group in the short test is still misfit: INFIT (MNSQ=1.55, ZSTD=4.1) OUTFIT (MNSQ=1.50, ZSTD=3.9). And there is only one item with INFIT (MNSQ= 0.87,
ZSTD=-2.5) and OUTFIT(MNSQ=0.84, ZSTD=-2.5). The rest are OK. And PTMEA CORRs. are all noticeably positive.
The test has 18 items. Four polytomous with 25 categories per item (blanks to be filled) and two reading passages, 14 dichotomous reading items which could be highly locally dependent but were treated as independent items.
Thanks for the further information. Here are some considerations ....
First, before being concerned about low mean-squares (negative Zstd) eliminate those high mean-squares that you plan to remove. This is because the average mean-square is forced to be close to 1.
Second low mean-squares (negative Zstd) are rarely a threat to the validity of the measures. They merely indicate redundancy in the data. So, unless your purpose is to shorten the test, you can usually ignore these.
Third, the description of the test raises further questions about the analysis. For instance, outfit mean-square of 1.83 with z-std of 2.1 suggests a sample size around 30, but the sample was 77+83 = 160. Why the discrepancy?
Perhaps you would like to email me your control and data file for some specific comments ....
Raschmad September 11th, 2006, 10:21am:
First of all I want to thank those who set up this forum. (it must be Mike). The Rasch mailing list is a place for advanced experts. This place seems to be for novices. Now we can feel free to ask dumb questions!!
In my achievement test I have identified 10 items which show DIF. However, not all the items are biased in one direction, 4 in favour of boys and 6 in favour of girls. I want to adjust person measures for DIF. But I don’t know in which direction.
Four items bring down the girls’ measures and six the boys’ measures.
What should be done in such situations?
Dear Raschmad, thank you for your question. We are all novices in most aspects of Rasch measurement ... but some of us have learned to disguise that fact!
When there are two groups of about the same size, boys and girls, then the overall DIF must approximately average out. So it is a matter of item by item investigation. Is there an "Item x" that is truly a different item for the boys and for the girls? Would it have made more sense to have entered Item x into the data file as "Item x boy" with responses for the boys and missing data for the girls, and "Item x girl" with responses for the girls and missing data for the boys? If so, DIF correction for that item is indicated, otherwise the reported DIF is probably merely statistical accident. Of course, the special-interest-groups will assert that DIF against their group is all real, but DIF in favor of their group is all accidental!
But DIF is a contentious issue ... Does anyone have other perspectives .... ?
What command should I write in Winsteps to delete the biased items?
I want o indicate items 1,3,7,11,14,22 as missing (not administered) for group 1 (boys) and items 23,31,32,10, as missing for group 2 (girls). At the end of each data line there is a group indicator (1 & 2) for each person.
There is also a group 3. These are the people who haven't ticked off their sex in the test booklets. How can I get rid of them in the analysis?
Differential Test Functioning
Differential Test Functioning occurs when the DIF among the items is unbalanced. For example, one could examine DIF between adults and adolescents in terms of their substance use disorders. In this case, if the DIF for adults, on balance, were greater than the DIF for adolescents, the overall test would tend to be biased. For example, the adults might appear to have less severe substance use disorders than they really had. If DIF is balanced, the test results are not affected, but the DIF information may still be important theoretically. If DIF is unbalanced, and we still want to measure and compare adolescents and adults on the same scale, then three steps can be taken.
First, anchor the item and rating scale steps for the non-DIF items using the pooled calibrations. The anchoring creates a stable yardstick that assures comparability of the two groups even though some unanchored items will be allowed to “float” or achieve their group-specific calibration.
Second, rerun the analyses separately for adolescents and adults. This provides the group-specific calibrations for the unanchored items. Because the non-DIF items will have been anchored, the adjusted results will be comparable to the non-adjusted results, and adult measures will be comparable to adolescent measures. The result is adult-specific and adolescent-specific calibrations that can be compared to each other because they are anchored on a ruler made from the pooled sample using the DIF-free items as the anchors.
Third, recalculate the person measures since they should have changed because of the group-specific recalibrations. These will be the “truer” age-specific calculations with the bias due to DIF removed.
m.langelaan August 29th, 2006, 8:41am:
When should I collaps categories in case of disordered tresholds? Is it necessary that all items have ordered categories?
M.langelaan, the rule is meaning first, statistics second!
What are the definitions of your categories? Does it make sense to collapse them? If the categories are "On a scale from 1 to 10, how much do you like ..." Then collapsing makes sense, because it is unlikely your respondents can discriminate 10 levels of likability.
But if the categories are "Republican, Undecided, Democrat", then collapsing makes no sense, because it is the "Undecided" that are probably the focus of the research.
All categories must be ordered in the sense that "higher category corresponds to more of the latent variable" (and vice-versa). But disordered categories are not the same thing as disordered thresholds. Please see www.rasch.org/rmt/rmt131a.htm
MikeLinacre August 7th, 2006, 1:03am: Welcome ... and please raise any Rasch-related issue .... ;D
chl: Hi, I'm a Chinese student who are interested in designing CAT tests by applying the Rasch model. I read some books written by Grant Henning. I cannot understand his explanations about how to calculate item fit. According to him, the squared standardized residue is calculated with "exp[(2x-1)(b-d)]", in which x is the observed item response, b is person ability, and d is item difficulty. I don't know what the "exp" refers to. I hope you can give me a helping hand to tell me the complete formula for calculating the squared standardized residue. Thank you very much.
x is the observation, 0 or 1
b is the person ability in logits
d is the item difficulty in logits
exp is the exponential function
from the Rasch model:
p = exp(b-d) / (1+exp(b-d))
the residual = x-p
the model variance = p * (1-p)
the squared standardized residual = (x-p)**2 / (p *(1-p))
chl: Thank you very much. Exp(x)=e**x. Is that right? You solved a big problem for me -- a beginner of CAT designing. If possible, we can cooperate in the field of Rasch measurement.
Yes, exp(x) = exponential (x) = e to the power of x = e**x = e^x
chl: Thank you£¡
migfdez August 24th, 2006, 2:52am:
When running tables 7.1.1, 7.2.1 and 7.3.1 in FACETS, at the bottom of these tables I get 2 different separation numbers, 2 reliabilities, 2 chi-squares and 2 significances. Could you please tell which is the difference between the 2 numbers for each cateogory, and what numbers I need to look at?
Thank you very much,
You will probably find helpful the explanation at www.winsteps.com/winman/reliability.htm
In your first analyses of a data set, the "real" values are the more important ones. After you have assured yourself that all is working as you intend, the "model" value is the number to report.
Separations, reliabilities, chi-squares and significances are all based on the same statistical information. Report the ones that best communicate your findings to your audience. For audiences familiar with conventional reporting of educational and psychological instruments, that is probably a reliability index.
Seanswf August 18th, 2006, 5:09pm:
I am new to Facets and would like to obtain category statistics and probability curves for each survey item individually. I have obtained the aforementioned data for the total survey. I attempted to enter only one item at a time into the model but received an error. How can I obtain the individual item data?
Thank you for your question, Seanswf.
You write "category statistics and probability curves for each survey item individually" - for this you need to specify that each item has its own rating scale structure. In your Models= statement, put # instead of ? in the position for your item facet.
Then the numbers you want will be in Table 8 and the Graphs menu will have a set of graphs for each item.
Seanswf: Thanks for the reply Mike. That was the information I needed!
Facet 1 = people
Facet 2 = survey items
I am not sure what the difference is between these two conditions:
(1) Model = ?,#,R5
(2) Model = ,#,R5
When I run facets the output the rating scale statistics are different. I realize that they should be different. However, I am not sure how to interpret that difference or which model is giving me the information I need. My goal is to "optimize" information gleaned from each survey item by assigning a "new" rating scale which works best for each item.
Thanks for your help!
(1) Model = ?,#,R5
Any person interacts with each item (modeled to have its own rating scale) with the highest rating scale category numbered "5" (or less)
(2) Model = ,#,R5
Ignore the person facet. The data are observations of each item (modeled to have its own rating scale) with the highest rating scale category numbered "5" (or less)
(1) is almost certainly what you want ....
Seanswf: OK thanks again!
chl: Hi, I'm a Chinese student who are interested in designing CAT tests by applying the Rasch model. I read some books written by Grant Henning. I cannot understand his explanations about how to calculate item fit. According to him, the squared standardized residue is calculated with "exp[(2x-1)(b-d)]", in which x is the observed item response, b is person ability, and d is item difficulty. I don't know what the "exp" refers to. I hope you can give me a helping hand to tell me the complete formula for calculating the squared standardized residue. Thank you very much.
migfdez August 16th, 2006, 5:09pm:
I am starting to use FACETS now, so I am not very familiar with it. That is why my question might be very simple.
I have been working on a test in which 15 raters rated 30 students' texts using a rating scale with 4 categories (15x4). However, this is not a very realistic testing situation, since we never have 15 raters rating all the tests.
I have been told there is some kind of matrix in which raters are distributed in such a way that they rate only a few texts, and then their inputs are combined so that we can know how reliable the testing method is (the rating scale), how severe or lenient the raters are and how proficient the test takers are.
Could you help me with this? How do I have to set the specifications file?
Thank you very much,
MikeLinacre: Thank you for asking .... Facets expects incomplete designs. The only requirement is that there be a linked chain connecting every rater and every student. There are some examples at www.rasch.org/rn3.htm
migfdez: Thank you very much. This is just the information I was looking for!!!
|Coming Rasch-related Events|
|Feb. 28 - June 18, 2022, Mon.-Sat.||On-line course: Introduction to Classical and Rasch Measurement Theories (D. Andrich, I. Marais, RUMM), The Psychometric Laboratory at UWA, Australia|
|Feb. 28 - June 18, 2022, Mon.-Sat.||On-line course: Advanced Course in Rasch Measurement Theory (D. Andrich, I. Marais, RUMM), The Psychometric Laboratory at UWA, Australia|
|May 20 - June 17, 2022, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com|
|June 24 - July 22, 2022, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com|
|Aug. 12 - Sept. 9, 2022, Fri.-Fri.||On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com|
|Oct. 7 - Nov. 4, 2022, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com|
|June 23 - July 21, 2023, Fri.-Fri.||On-line workshop: Practical Rasch Measurement - Further Topics (E. Smith, Winsteps), www.statistics.com|
|Aug. 11 - Sept. 8, 2023, Fri.-Fri.||On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com|