There are a lot of great articles about measuring the performance of data annotator’s agreement on labels, like this one https://towardsdatascience.com/the-definite-guide-for-creating-an-academic-level-dataset-with-industry-requirements-and-6db446a26cb2.
I see mentions in a lot of places of Cohen’s Kappa/Krippendorf’s alpha, Fleischer’s Kappa, Comparing to predefined ground truth, etc.
If you’re managing an annotation process in your organization, how do you evaluate your annotators, and what challenges have you faced in the process?
As a side note, is anyone using programmatic labeling in a real dataset? Thoughts?