Correcting Inter-Rater Reliability for Chance Agreement: Why?Posted: Monday, July 5, 2010 |

In this post, I would like to address the issue as to whether agreement coefficients should or should not be adjusted (or corrected) for the possibility of agreements occurring by pure chance between two raters. A natural and crude way to quantify the extent of agreement between two raters is to compute the relative number of times they both agree about the classification of a number of subjects. I generally refer to that relative number as the overall agreement probability, and denote it by . I believe as well as many others (see Cohen (1960)) that the need to adjust for chance agreement is difficult to question. It is the right way to make that adjustment that should be discussed. Nevertheless some researchers are not yet convinced of the need for such an adjustment (see for example "The Myth of Chance-Corrected Agreement").

Cohen (1960) who introduced the Kappa coefficient, indicated that* "A certain amount of agreement is to be expected by chance, which is readily determined by finding the joint probabilities of the marginals ;..."* If and are respectively the relative number of subjects that raters A and B classified into category k, then the products for various categories k are what Cohen refers to as the joint probabilities of the marginals. While I totally agree with Cohen's assessment that a certain amount of agreement is to be expected by chance, I also disagree with his claim that it *"is readily determined by finding the join probabilities of the marginals.* In other words, I agree with the need to correct for chance agreement that he expressed, and not with the method he proposed for doing it. I discussed extensively about this correction issue in Gwet (2008a). The purpose of this comment is not on the methods of correcting agreement coefficients for chance agreement. Instead, I like present an argument in support of the need for a correction.

I decided to simulate with a computer, many inter-rater reliability
experiments with varying numbers of subjects, where two raters A and B must classify each of the subjects into two nominal categories 1 and 2. The classification of subjects in these simulated experiments is done in a **purely random manner** by both raters. This is a situation where all observed agreements are achieved by pure chance. With these experiments, I wanted to see how the different agreement coefficients behave when all agreements occur by chance. The 5 agreement coefficients investigated are Cohen's Kappa (),
Gwet's (), Fleiss' Kappa (), Brennan-Prediger's coefficient (), and the overall agreement probability (). The experiment was conducted for different values for the number (n) of subjects varying from 2 through 30, and for 35, 40, 45, 50, 55 and 60. For each value of n, the experiment was performed 100,000 times except when the total number of different ways the raters can categorize the subjects was smaller than 100,000. In the latter case, all the scenarios were identified and implemented.

Figure 1 represents the median (50th percentile of the agreement coefficient) of the 5 agreement coefficients investigated. It follows from this figure that when both raters classify the raters in a random manner, the researcher can expect the overall agreement probability to exceed 0.5 about 50% of the times. We are talking about a situation here where there is no "real" or intrinsic agreement, which is a situation where any researcher would want the metric used for quantifying agreement to have a low value close to 0. The other agreement coefficients, which are all corrected for chance agreement, have a median that gets closer to 0 as the number of subjects increases (a very good property). Kappa's median consistently stays close to 0 even when the number subjects is as small as 2, and therefore better reflects what is expected under these circumstances. But this strength of Kappa is also its biggest weakness. The chance-agreement probability associated with Kappa assumes that all ratings are performed randomly. Although that is the case here, that is almost never the case in practice, where only an unknown portion of the ratings are performed randomly.

**Figure 1.** Median of the Distribution of Various Agreement Coefficients by Number of Subjects

Figure 2 below shows the 95th percentile of the distribution of various agreement coefficients including the straight overall agreement probability. This figure indicates that when the ratings are carried out in a purely random manner by the raters, the overall agreement probability is expected to exceed 0.6 about 5% of the times as the number of subjects increases. But when the number of raters is smaller than 15 would exceed 0.7 more than 5% of the times. This is clearly a major problem, which call for an adjustment for chance agreement of some sort. As for the other agreement coefficients, most of them will not exceed 0.3 very often unless the sample size is very

small. Note that if the study is based on 10 subjects or fewer, all of the agreement coefficients may exceed 0.6 about 5% of the times. That is the number of subjects and a few other things must be taken into consideration when interpreting the magnitude of agreement coefficients. I discuss this benchmarking issue extensively in chapter 6 of my book "Handbook of Inter-Rater Reliability (2nd Edition)" (see Gwet, K.L. (2010))

Figure 2. 95th Percentile of the Distribution of Various Agreement Coefficients by Number of Subjects

References:

Cohen, J. (1960). "A coecient of agreement for nominal scales." Educational
and Psychological Measurement, 20, 37-46.

Gwet, K.L. (2008a). Computing inter-rater reliability and its variance in the presence of high agreement, *British Journal of Mathematical and Statistical Psychology* (2008), 61, 29–48

Gwet, K.L. (2010). Handbook of Inter-Rater Reliability (2nd Edition), Advanced Analytics, LLC

**Back to the Inter-Rater Reliability Discussion Corner's Home Page**