Fleiss' Kappa [+3 Annotators]

Description

Fleiss' kappa is a statistical measure that assesses the reliability of agreement among a fixed number of raters who classify items into a set of categories.

It is a chance-corrected measure, meaning it accounts for agreement that would occur simply by random chance. The value ranges from -1 to +1, where +1 is perfect agreement, 0 is agreement no better than chance, and negative values indicate worse-than-chance agreement.

Key features:

Number of raters: Fleiss' kappa is used when there are three or more raters. For two raters, Cohen's kappa is more appropriate.
Data type: It is used for categorical (nominal) data. If the data is ordinal, other measures like Kendall's W or Krippendorff's alpha may be better suited.
Application: It measures inter-rater reliability, or how consistently different raters classify the same set of items. It can also be used for intra-rater reliability if the same rater measures items at different times.

Warning

Some studies have identified that Fleiss' kappa can be affected by a paradoxical behavior where it yields very low values even with high agreement, depending on the specific category assignments.

Example

import numpy as np
from statsmodels.stats.inter_rater import fleiss_kappa

# Example data: 10 subjects, 3 categories, 3 raters for each subject
# The values in the matrix represent the number of raters who assigned the subject to that category.
# Each row must sum to the number of raters (3 in this case).
ratings = np.array([
    [3, 0, 0],  # Subject 1: All 3 raters agree on category 1
    [1, 2, 0],  # Subject 2: 1 rater for cat 1, 2 for cat 2
    [0, 3, 0],  # Subject 3: All 3 raters agree on category 2
    [0, 1, 2],  # Subject 4: 1 rater for cat 2, 2 for cat 3
    [1, 1, 1],  # Subject 5: Each rater chose a different category
    [3, 0, 0],
    [0, 0, 3],
    [1, 0, 2],
    [2, 1, 0],
    [1, 1, 1]
])

kappa = fleiss_kappa(ratings, method="fleiss")
print(kappa)  # 0.2571428571428571