Date of Award

May 2015

Degree Type


Degree Name

Doctor of Philosophy


Educational Psychology

First Advisor

Cindy M. Walker

Committee Members

Bo Zhang, Razia R. Azen, Catherine S. Taylor, Kathryn R. Zalewski


Differential Item Functioning, Inter-rater Reliability





Tamara B. Miller

The University of Wisconsin-Milwaukee, 2015

Under the Supervision of Professor Cindy M. Walker

This study used empirical and simulated data to compare and contrast traditional measures of inter-rater reliability to a novel measure of inter-rater scoring differences in constructed response items. The purpose of this research was to investigate an alternative measure of inter-rater differences, based in modern test theory, that does not require a fully crossed design for its calculation. The proposed, novel measure of inter-rater differences utilizes methods that are typically used to examine differential item functioning (DIF). The traditional inter-rater reliability measures and the proposed measure were calculated for each item under the following simulated conditions: three sample sizes (N = 1000, 500 and 200) and 4 degrees of rater variability: 1) no rater differences 2) minimal rater differences 3) moderate rater differences and 4) severe rater differences. The empirical data were comprised of 177 examinees scored on 17 constructed response items by two raters. For each of the twelve simulated conditions plus the empirical data set, each item had four measures of inter-rater differences associated with it: 1) an intraclass correlation (ICC), 2) a Cohen's Kappa statistic, 3) a DIF statistic when examinees were fully crossed with the two raters and Rater 1's scores comprise the reference group while Rater 2's scores comprise the focal group and 4) a DIF statistic when members of the focal and reference groups were mutually exclusive and nested within one rater. All indices were interpreted first using a significance level to calculate the Type I error and power for each index and second using a cut value to determine a False Positive Rate and a True Positive Rate for each index. Comparison of the findings from the two different criteria revealed that the DIF statistic derived from the nested design had a large False Positive Rate, therefore it was determined that this index was best interpreted using a significance level criterion. Additional simulation study results found that across the three different sample sizes the ICC and the Cohen's kappa both had Type I error rates of 0%, the DIF statistic from the fully crossed design had Type I error rate from 2% to 12% while the DIF statistic from the nested design had a Type I error rate from 2% to 10%. Further results of the simulation found that, pertaining to moderate and severe rating differences modeled as constant, pervasive rater severity, the ICC and Cohen's kappa both had uniform power of 0% across all three sample sizes, while the fully crossed DIF statistic had a range of power from 89% to 100% across all three sample sizes and the DIF statistic derived from a nested-design had power ranging from 44% to 100% across all three sample sizes. This combination of adequate power and low Type I error for the DIF statistic from the nested design is notable not only because it shows an ability to detect poor rater agreement, but it achieved this by using a design that does not require both raters to score all of the items. Finally, results from the empirical study, with a sample size of 177 examinees, provided an example of the application of the DIF statistic derived from the nested design in relation to the other three inter-rater reliability indices which all required a fully crossed design. In conclusion, the results of this investigation indicate that the proposed measure may provide a more achievable and more powerful indicator of inter-rater reliability for constructed response items under the investigated conditions, than traditional measures currently provide.