Abstract
Many large-scale national and international testing programs use the Rasch
model to govern the construction of measurement scales that can be used to
monitor standards of performance and monitor performance over time. A
significant issue that arises in such programs is that once a decision has been
made to use the model, it is not possible to reverse the decision if the data do not
fit the model. There are two levels of question that result from such a situation.
One of them involves the issue of misfit to the model. That is, how robust is the
model to violations of fit of the data to the model? A second question emerges
from the premise that the issue of fit to the model is a relative matter. That is,
ultimately, it becomes the decision of users as to whether data fit the model well
enough to suit the purpose of the users. Once this decision has been made, such
as in the case of large-scale testing programs like the ones refocused to above,
then the question reverts to one in which the focus is on the applications of the
Rasch model. More specifically, in the case of this study, the intention is to
examine the consequences of variability of fit to the Rasch model on the
measures of student performance obtained from two different equating
procedures.
Two related simulation studies have been conducted to compare the results
obtained from using two different equating procedures (namely separate and
concurrent equating) with the Rasch Simple Logistic model, as data-model fit
gets progressively worse. The results indicate that when data-model fit ranges
from good fit to average fit (MNSQ ≤ 1.60), there is little or no difference
between the results obtained from the different equating procedures. However,
when data-model fit ranges from relatively poor fit to poor fit (MNSQ > 1.60), the
results from using different equating procedures prove less comparable.
When the results of these two simulation studies are translated to a situation in
Australia, for example, where different states use different equating procedures
to generate a single comparable score and then these scores are used to
compare performances amongst students and to predetermined standards or
benchmarks, it raises significant equity issues. In essence, it means that in the
latter situation, some students are deemed to be either above or below the
standards purely as a consequence of the equating procedure selected. For
example, students could be deemed to be above a benchmark if separate
equating was used to produce the scale; yet these same students could be
deemed to fall below the benchmark if concurrent equating is used. The actual
consequences of this decision will vary from situation to situation. For example, if
the same equating procedure was used each year to equate the data to form a
single scale, then it could be argued that it does not matter if the results vary
from occasion to occasion because it is consistent for the cohort of students from
year to year. However, if other states or countries, for example, use a different
equating procedure and the results are compared, then there is an equity
problem. The extent of the problem is dependent upon the robustness of the
model to varying degrees of misfit.