Rasch Forum: 2006
Rasch Forum: 2007
Rasch Forum: 2008
Rasch Forum: 2009
Rasch Forum: 2010
Rasch Forum: 2011
Rasch Forum: 2012
Rasch Forum: 2013 January-June
Rasch Forum: 2013 July-December
Current Rasch Forum
miet1602 January 20th, 2014, 4:52pm:
I have had to group anchor the item facet in a three-facet design (markers, items, pupils) as one of the items ended up being in a separate subset. The point of the study for which I am doing this facets analysis is to compare the outcomes of using two different mark schemes (one more holistic and one more analytic) on the same items and pupils by the same markers. I have run separate analyses for each mark scheme, getting two sets of parameters to compare.
My question: even though the item facet was group anchored, does it still make sense to compare the parameters for the item facet between the two mark schemes? And if so, does this only make sense for fit statistics, or can e.g. the rank order of item measures also be compared (it is slightly different for the two mark schemes, despite group anchoring)? What about item separation and reliability?
Miet1602, group-anchoring produces comparable fit statistics, but not comparable measures (item difficulties). Measures can only be compared within groups. Facet-level item statistics such as separation and reliability have values that are somewhat arbitrary.
If the item groupings are the same for the two analyses. Then item measures for equivalent groups can compared.
Instead of group-anchoring the items, it may be more useful to apply constraints to the elements of the other facets that are causing the nesting. For instance, if the problem is that different groups of pupils were administered different items, It may make more sense to group-anchor the pupils as being "randomly equivalent". Similarly for the markers. If so, then all the items can be compared, and facet statistics for the items will also be comparable.
Thanks for that, Mike.
Would it be ok to do separate analyses, group anchoring first the items and then the pupils, and compare the measures for the unanchored ones between the two mark schemes? It would be good to be able to compare the items between the two mark schemes, but the pupils also, in terms of range of measures etc.
Nettosky January 22nd, 2014, 11:04am:
Dear prof. Linacre,
I'm interested in applying rasch analysis to a neuropsychological tests in order to assess its dimensionality and targeting. One of these tests is a sequential timed association of symbols to numbers. In this test you have to make as much associations as you can in a determined amount of time. Every symbol is considered as an item and a correct association with a given number is scored 1, otherwise, if you associate the wrong number, the item is scored 0. I'm worried that the application of rasch analysis could be problematic (or at least correct) for this kind of test, since all the subject answers correctly to the first 10 items and almost no one responds to the last 10 due to the time limit.
harmony January 20th, 2014, 11:02am:
I'm equating different test forms via common item linking.
IAFILE = *
CONVERGE = L
LCONV = 0.005
When I run the analysis after adding the anchors, I get a subset warning message and all items in the Item misfit table not anchored are labled "subset 2"
Adding two dummy examinees does not help. (there are connections between the items despite an EDFILE anyway).
This is what table 0.4 says:
SUBSET 2 OF 44 Item AND 56 Person
Item: 1-5 7-12 14-16 18-32 35 37-50
Unfortunately a known bug in Winsteps 3.80 .... :-(
If your analysis reports no subsets without anchoring, then anchoring cannot cause subsets.
harmony: I suspected that. Thanks for your reply Mike.
ian.blackman January 14th, 2014, 11:49pm:
I have scale with 1,2,3,4,5... indicating strength of consensus (4 thresholds). The problem is that 3 = neutral category..... I use the partial credit model to explore the threshold variations however, I am concerned that the neutral category diminishes this capacity, as the neutral category is really measuring consensus strength compared to the rest of the scale.
How can i best omit the neutral category for winsteps analysis and how would this influnce the remaining threshold values?
thanks for any assistance
Thank you, Ian.
In Winsteps, the easiest way is to omit 3 from CODES=
STKEEP=NO ; tells Winsteps to automatically rescore 1245 into 1234
Rasch estimation assumes that the categories form an ordered hierarchy along the latent variable. Then fit statistics and other indicators tell us whether this assumption is correct.
The effect on threshold values depends on the relative frequency of "neutral". After removing "neutral", the other threshold values will probably be somewhat closer together. The item difficulty hierarchy will probably only show minor changes.
Thank you for this excellent response.. being a novice to winsteps.. where do i add the instruction (STKEEP=NO) ;
Ian, in Winsteps, the control file is a text file. It can be edited with any text editor (such as NotePad) or word processor. If you use a Word Processor, such as Microsoft Word, be sure to save the file as a "DOS text" file.
can be placed anywhere before &END. The order of instructions is not important. I usually place it after CODES=
rag January 14th, 2014, 5:47am:
I have been a little confused by a certain aspect of the Rasch model for a while. The sufficient statistics to get person and item parameters are total scores (of both items and persons). All persons (items) with the same raw number of items (persons) correct have the same Rasch score.
Here is where I get confused. Say I've got 6 items in my test, 1,2,3,4,5 and 6. Here are some real item parameters from my test:
Now let's say I've got two respondents, A and B, who both got 1 item correct, and have a Rasch score of 1.12. Person A got item 1 correct, and person B got item 2 correct. Because item 1 is more difficult than item 2, shouldn't that mean that person A's Rasch score should be higher, even though they both got the same number of items correct?
I understand that if person A was really "smarter" than person B, she would have gotten both items 1 and 2 correct. Still, intuitively I feel like the difficulty of an item should have an impact on the person score. Maybe it's just the case where my intuition is wrong.
Thank you for your question, rag.
Let's use your reasoning the other way round. If you get the easy item wrong, doesn't that make you less smart than someone who gets it right? In Rasch, the plus for getting a harder item correct = the minus for getting an easier item incorrect. http://www.rasch.org/rmt/rmt154k.htm
In practical situations, often success on easier items (for instance, knowing how to press the brake pedal when driving a car) is much more important than success on more difficult items (for instance, knowing how to change gear with a manual gear box). In tests for operators of nuclear power stations or hospital nurses, we would be alarmed about operators/nurses who failed on easier items, regardless of their success on more difficult items.
However, if we think that careless mistakes and/or lucky guesses are skewing the ability estimates, then we can use a technique such as http://www.rasch.org/rmt/rmt62a.htm
okanbulut January 7th, 2014, 4:49pm:
Dear Dr. Linacre,
I am using partial credit model to calibrate and score polytomous test items. I have been using the plot of the expected and empirical ICCs for detecting mistfitting items. Now I would like to replicate the same plot in R using the output given by Winsteps. Although it was easy to draw a model ICC using the item parameters, I was not able to figure out how Winsteps computes empirical values and confidence interval around the model ICC. Could you please elaborate on the computation of those two components? I would appreciate your help.
Certainly, Okan Bulut.
For an empirical ICC, we divide the x-axis (latent variable) into equal intervals = stratify the x-axis. Then in each interval, we average the x-axis values (person measures) and the y-axis values (scored observations). This computation gives us the points on the empirical ICC.
For the confidence intervals, these are drawn around the model ICC at the x-axis value of the empirical point.
1) count the number of empirical observations averaged for the empirical point
2) compute expected observation at the empirical value on the x-axis.
3) the expected observation is a point on the model ICC.
4) compute the model variance corresponding to each observation included in the empirical point.
5) sum the model variances
6) divide by the count of empirical observations = average model variance
7) square-root the average model variance
8) the confidence band is drawn with points at the x-axis value of the empirical point, and y-axis points of expected observation in (2) ± value in (7), constrained to be within observable range.
Thank you very much for your quick response. Your explanation is really helpful. I have one last question regarding stratifying the x-axis. Since the x-axis for the model ICC typically goes from -4 to 4 by 0.5 increments, should I follow a similar way to divide the number of equal intervals for the empirical values as well? Is there a particular way to determine how the x-axis should be stratified into equal intervals?
Mike.Linacre: Okan, the only rule is that we want the empirical ICC to look meaningful. This usually means "not too many and not too few strata". Please try different stratifications until you find one that looks good.
okanbulut: I see it now. I will follow your suggestion and try different numbers of strata to adjust the plot then. Thanks again for your response.
You said that I need to compute the model variance corresponding to each observation included in the empirical point. Do we get the model variance for the estimated measures in the PFILE? If I need to compute it manually, could you describe how I can compute it?
To show the difference from the two programs, I am attaching the plots here. it seems that confidence interval in Winsteps is weighted and so it is not parallel to the model ICC.
To compute model variance for each observed theta, I computed the expected response (E) using the average item difficulty and then multiplied it by (1-E).
Mike.Linacre: Okan, I am investigating ....
okanbulut: I also created equal intervals based on the latent variable. From the smallest theta to the largest one, I created 23 equal intervals on the theta scale. So the number of observations in each interval is quite different actually but when I plotted the values, it goes parallel to the model ICC.
Mike.Linacre: Okan, the plotted point is the average of the observed points in the interval. This is a mean. So its standard error (which is used to draw the confidence bands) depends on the number of observations and the precision of those observations. In general, the more observations summarized by the mean, the smaller its standard error and the closer together its confidence bands. OK?
Thanks for your explanation. As you described, I was expecting to see standard error changing depending on the precision of the observations in each interval and the number of observations. I think the way I computed standard errors might not be correct. I will check it again.
Thanks again for your help,
I am sorry for bothering you again with the same question. I was able to replicate the same ICC plot that Winsteps produces. However, on the upper and lower tails of the plot, confidence interval is way larger in my plot since there is only one person with an extreme theta (e.g., -4.11 in my example) and its standard error is very high. So I wonder if Winsteps does any adjustment on those particular cases. As you can see on the plot I sent in my previous message, one of the theta intervals is between -3 and -2.5 and there is only one examinee in that interval. I was expecting to see higher error and larger confidence intervals for that person in the Winsteps plot as well.
Okan, your computation may be more exact than the Winsteps computation. The Winsteps computations include various limits on values in order to prevent divisions by zeroes or exceedingly small numbers, square-roots of negative numbers, numbers out of range, etc., in boundary conditions.
Looking at your plots above, it appears that a dichotomous variance, E * (1-E), has been applied to a polytomous 0-3 rating scale. Each response on a 0-3 rating scale contains roughly 3 times the information of a dichotomous response. For the exact computation, see "Variance of xni" in http://www.rasch.org/rmt/rmt34e.htm
Imogene January 13th, 2014, 1:08am:
I'd like to link 3 tests to get Rasch Measure values from the 3 that are all in the same frame of reference (for item banking).
There is a subset of items common to all 3, and then 3 subsets of items that have 2 tests in common, then the items used only in one of each of the tests.
I've been using MFORMS to link 2 tests, and I can see how I could just use the items common to all tests in the procedure, ignoring the fact that some items were used in any 2 exams only.
However is there a more thorough way I could be doing this to include information from the estimations when an item has been used twice?
Thank you for your question, Imogene.
First, analyze all 3 tests separately to verify they function correctly.
Then MFORMS= can be used to include 3 tests in one analysis. A DIF analysis of item-by-test tells us if any of the common items have changed difficulty.
Is this what you need, Imogene, or are you thinking of something else?
njdonovan January 11th, 2014, 8:52pm: I am a researcher at Louisiana State University and am looking for someone I could study with to increase my expertise in Rasch analysis during a sabbatical I plan to take within the coming year. I have been working with Rasch for several years and have publications, but I need more in-depth training that I've received from online courses or pre-conference intro courses.
njd, glad to read about your interest in Rasch. There are several options depending on your area of interest (educational, medical, sports, ...) and whether you want to study through the internet or travel within the USA or Internationally. For example, if you traveled internationally, then participating in David Andrich's activities in Perth, Australia, would be a great option.
Suggestion: look at "Coming Events" at www.rasch.org and contact the organizers/presenters of those events to discover if they have other activities that meet your needs.
njdonovan: Michael, many thanks. I will do some research and see what is available. I am hoping to find someone who might take me on in person, as I am not a great internet learner. I am prepared to travel to the place I can obtain the best training, Perth would not be out of the question. I am a rehabilitation scientist primarily involved in developing treatment outcome measures. Thanks again for the information. njd
Mike.Linacre: njd, Alan Tennant in England is a person to contact. He is listed in "Coming Events".
roozbeh January 10th, 2014, 10:46am:
If the reliability and variance of a test are 0/36 and 4, we can be 95% sure that 15 as a score lies between :
How can we answer these kinds of questions?
Roozbeh, this looks like a question about Classical Test Theory.
The important relationships are Spearman (1910):
Reliability = true variance /observed variance
True variance = observed variance - error variance
error variance = observed variance * (1 - Reliability)
Roozbeh, you know these numbers, so you know the error variance.
From the error variance, you can compute the confidence intervals around a score of 15.
roozbeh: I've chosen "a" as a correct answer. However, I have doubt about my calculation. Would you please share your answers with me!
Roozbeh, what is your value for the error variance?
What is your computation for the 95% confidence interval?
uve January 9th, 2014, 7:16pm:
At my institution, we are investigating possiblity of purchasing a CAT system and have been talking to several vendors. One of them uses the Birnbaum model instead. I sent them this link: http://www.rasch.org/rmt/rmt61a.htm and received an interesting response in the form of a white paper from Dr. Bergan.
I thought I'd post this here because I personally find it very interesting to hear from the "other side" from time to time--those who don't use the Rasch model and have problems with it. Some may find the arguements very intriguing, simply ludicrous, or anything between. I for one find it very illuminating, though it has not converted me. Feel free to reply with your opinions and include any materials/resources you feel are valid counter responses.
Hamdollah January 5th, 2014, 6:42am:
Dear Dr Linacre
In a gender DIF study of different sections of an English proficiency test with about 18000 subjects, we found that although some DIF contrasts were statistically significant, the DIF effect (sum of DIF effects/number of items) was not substantive. What seems anomalous is that when we cross-plotted the item difficulties obtained from the two groups about half of the items fell outside the control lines, suggesting lack of invariance of item difficulty estimates.
How can we interpret the conflict?
Thank you for yoru question, Hamdollah.
1) If there are two groups of approximately equal size, then we expect the average DIF in a test to be approximately zero. This is because the average item difficulty is the overall item difficulty. If one group is much smaller than the other group, then the larger group's item difficulty is the overall item difficulty, and the average DIF can be far from zero.
2) The sample size is so large that small variations in item difficulty across groups are statistically significant (too much power). These variations will be too small to have any substantive effect on the subjects' measures. This is a usual problem in DIF studies with large samples. One solution is to do the DIF study on randomly-selected sub-samples with, for instance, a sub-sample size of 300-500. (Google for references.)
Dear Dr Linacre
Thank you for your informative answer!
I randomly selected a sample of 400. The DIF contrast for one of the items in the grammar section and two of the items in the vocabulary section was significant.
You know this DIF analysis is part of a larger construct validation study within the framework of Messick's six aspect of validity. To investigate validity, I have brought evidence obtained based on the 18000 subjects. How do you think I have to report the DIF results? Best on the whole sample none of the items showed DIF, whereas with a subsample of 400 three items showed DIF.
Mike.Linacre: Hamdollah, you are fortunate. You can confirm the findings of the subsample by analyzing multiple non-overlapping subsamples. If the findings of the subsample repeat, then you have confirmed them. If they do not repeat, then they are random accidents in the data.
I drew 5 samples. The same item was flagged for DIF in three of the samples in the grammar section, 2 of the items were flagged two times in reading, and 2 were also flagged two times in the vocab. section. My understanding is that the results are random accidents.
Two points are worthy of note in the person-item maps for different sections of the test, that might be related to DIF detection
1) In the English proficiency test persons and items were dispersed along short spans of the ability continuum. Standard deviation of the person measures for grammar, vocabulary, and reading section of the test were 0.77, 0.88, 0.95 and the S.D.s of the item measures were 0.64, 0.99, 0.81, respectively.
2) The bulk of items did not match the bulk of the persons. In the grammar, vocabulary, and reading sections, mean ability of the subjects are located at 1, 2, and 2 standard deviations blow mean item measures, respectively.
My speculation is that since the majority of the items were difficult for the majority of the persons (no matter male or female) they were functioning equally for the two groups. That's why none of the items in the three sections were flagged for DIF.
Can the difficulty level of the test obscure differential functioning of its items across males and females?
Mike.Linacre: Hamdollah, Mantel-Haenszel DIF computations compare ability strata at the same level for males and females, regardless of the overall ability distributions of males and females. However, if the items are much too hard for the test-takers, then DIF effects are probably smaller than misbehavior effects due to guessing, skipping, giving up early, response sets, etc.
Hamdollah January 8th, 2014, 2:39pm:
Dear Professor Linacre
I've got two questions regarding construct validation within the framework of the six aspect of Messick's using Rasch model.
1) How do you prioritize the six aspects of validity? In the Rasch analysis of a high stakes proficiency test, for example, I have found that generallizability is problematic but substantive is OK.
2) Within each aspect how different evidence should be prioritized? How do I interpret the following evidence as supporting content aspect of validity? For example, in my case person-Item maps are problematic(the bulk of the items are much higher than the bulk of the persons and there are noticeable gaps along the line) and item strata are below 2, whereas Mean square statistics and point measure correlations are acceptable. I mean is lack of item fit to the model a more serious threat to content aspect of validity than mismatch of the location of items and persons and gaps along the line?
Hamdollah, let's prioritize validity:
1) Content validity: is the content of the items what it is supposed to be? For instance, if this is an arithmetic test, is the content "addition, subtraction, ..." and not "geography, history, ..."?
2) Construct validity: is the difficult hierarchy of the items what it is supposed to be? For instance, are the easier items "addition", then more difficult "subtraction", then "multiplication", then the hardest items are "division"?
3) Predictive validity: are the students that are expected to be more proficient performing better than less proficient students on the test? For instance, for an arithmetic test, we expect a strong correlation between student performance and student grade-level. In fact, we should be able to predict a student's grade-level from that student's performance on the arithmetic test.
4) Targeting. Is this test test intended to be easy or hard for a typical student sample? In many tests, the expected score on the test is 70%-90%, this would be an easy test for most students so the bulk of students would be above the bulk of items. In some tests aimed at challenging high-proficiency students, the expected score might be around 40%. A test with an expected score for a typical students of less than 40% is likely to provoke unwanted behavior, such as "guessing", "skipping" and "giving up". These can cause bad fit statistics for items that would function well for on-target student samples.
5) Gaps along the line usually indicate missing content or a poorly conceptualized construct. For instance, a gap along the line of an arithmetic test between "addition" and "subtraction" indicates the need for more difficult "addition" items and/or easier "subtraction" items
6) Item and person fit: these are used to screen out badly misbehaving students and badly written, off-dimension or duplicative items. They usually have little impact on the overall outcomes of the test.
Imogene January 5th, 2014, 9:49pm:
I've had a subsetting issue with a judge facet in a FACETS estimation and have resolved it using Group anchoring, however I thought this facet was already anchored to a mean of 0 by default as only the examinee facet was specified as Noncenter=.
The subsetting is due to disconnectedness as examiners are nested within sites, and I understand that this becomes a problem with identifiability however I am not sure what group anchoring this facet is doing above and beyond what the default Center=0 is doing in the specifications...?
Mike.Linacre: Imogene, center= is not enough where there are disconnected subsets. Suppose there are two equall-sized subsets of examiners (no group-anchoring). We can add add one logit to all the examiners in one subset, and subtract one logit from all the examiners in the other subset. The mean of all the examiners will not change (Center=), but the measures of the examinees will change to match each of their examiners. OK?
Imogene: Thanks so much, ye I understand.
miet1602 January 7th, 2014, 3:53pm:
I have been tryign to run a 2-facet model in facets and for some reason, in the last couple of runs, the Facets is refusing to check for subset connection even though I have tried specifying in the spec file subsets=yes or subsets=report. Could you advise how to make it check for subsets? I attach my spec file.
That is strange, miet1602.
Which version of Facets are you using?
How many observations are there in your data file?
Is the data complete (every script marked by every marker)?
Is each script marked by only one marker?
The version is 3.68.1, and this only occured after i had run quite a few different analysis on the same or simila data. There are 1895 observations. There is quite a bit of missing data - 1897 null observations, but each script is seen by more than one marker.
I attach a spec file which i ran on the same data, with the same number of observations, but with 3 facets (adding a dummy facet for Trainer, i.e. training session) to investigate possible interactions. This one appeared to run fine and subset connection was fine.
I also previously ran the two-facet model, but with more data and better linking as there were 3 extra expert markers in there who have all marked almost all the scripts. The subset connection was checked and fine in that case too.
I would really appreciate any help with this. Thanks!
Mike.Linacre: Miet1602, this may be a bug in Facets 3.68 - please contact me for the current version. mike \~/ winsteps.com
Ok, thanks Mike. My manager or myself will be in touch shortly.
windy January 6th, 2014, 1:32pm:
Hi Dr. Linacre,
I am doing a distractor analysis for multiple-choice items using the empirical option curves in Winsteps. When I select the absolute measure for the x-axis, I've noticed that some items have a quite different range than others. For example, some items only display measures with a negative range, and others are only positive. Can you help me understand what is happening? It seems that the x-axis range should be comparable for all of the items, because all of the students answered each question - unless, perhaps, the x-axis is the measure calculated without the item of interest or something like that.
Stefanie, do you see this happening with Winsteps example data Exam5.txt ?
What version of Winsteps are you using?
Yes, I see it with exam5.txt. I am using Winsteps version 18.104.22.168.
Mike.Linacre: Apologies, Stefanie. This must have been a bug in 3.71 which is now corrected. Please email me directly for a corrected version of Winsteps: mike \~/ winsteps.com
|Coming Rasch-related Events|
|Aug. 11 - Sept. 8, 2023, Fri.-Fri.||On-line workshop: Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com|
|Aug. 29 - 30, 2023, Tue.-Wed.||Pacific Rim Objective Measurement Society (PROMS), World Sports University, Macau, SAR, China https://thewsu.org/en/proms-2023|
|Oct. 6 - Nov. 3, 2023, Fri.-Fri.||On-line workshop: Rasch Measurement - Core Topics (E. Smith, Facets), www.statistics.com|
|June 12 - 14, 2024, Wed.-Fri.||1st Scandinavian Applied Measurement Conference, Kristianstad University, Kristianstad, Sweden http://www.hkr.se/samc2024|