The'Problem'of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation

B Plank - arXiv preprint arXiv:2211.02570, 2022 - arxiv.org
Human variation in labeling is often considered noise. Annotation projects for machine
learning (ML) aim at minimizing human label variation, with the assumption to maximize …

Learning from disagreement: A survey

AN Uma, T Fornaciari, D Hovy, S Paun, B Plank… - Journal of Artificial …, 2021 - jair.org
Abstract Many tasks in Natural Language Processing (NLP) and Computer Vision (CV) offer
evidence that humans disagree, from objective tasks such as part-of-speech tagging to more …

ArMIS-the Arabic misogyny and sexism corpus with annotator subjective disagreements

D Almanea, M Poesio - Proceedings of the Thirteenth Language …, 2022 - aclanthology.org
The use of misogynistic and sexist language has increased in recent years in social media,
and is increasing in the Arabic world in reaction to reforms attempting to remove restrictions …

Stop measuring calibration when humans disagree

J Baan, W Aziz, B Plank, R Fernandez - arXiv preprint arXiv:2210.16133, 2022 - arxiv.org
Calibration is a popular framework to evaluate whether a classifier knows when it does not
know-ie, its predictive probabilities are a good indication of how likely a prediction is to be …

Annobert: Effectively representing multiple annotators' label choices to improve hate speech detection

W Yin, V Agarwal, A Jiang, A Zubiaga… - Proceedings of the …, 2023 - ojs.aaai.org
Supervised machine learning approaches often rely on a" ground truth" label. However,
obtaining one label through majority voting ignores the important subjectivity information in …

Scaling and disagreements: Bias, noise, and ambiguity

A Uma, D Almanea, M Poesio - Frontiers in Artificial Intelligence, 2022 - frontiersin.org
Crowdsourced data are often rife with disagreement, either because of genuine item
ambiguity, overlapping labels, subjectivity, or annotator error. Hence, a variety of methods …

Superlim: A Swedish language understanding evaluation benchmark

A Berdičevskis, G Bouma, R Kurtz… - Proceedings of the …, 2023 - aclanthology.org
We present Superlim, a multi-task NLP benchmark and analysis platform for evaluating
Swedish language models, a counterpart to the English-language (Super) GLUE suite. We …

A General Model for Aggregating Annotations Across Simple, Complex, and Multi-Object Annotation Tasks

A Braylan, M Marabella, O Alonso, M Lease - Journal of Artificial Intelligence …, 2023 - jair.org
Human annotations are vital to supervised learning, yet annotators often disagree on the
correct label, especially as annotation tasks increase in complexity. A common strategy to …

Aggregating crowdsourced and automatic judgments to scale up a corpus of anaphoric reference for fiction and Wikipedia texts

J Yu, S Paun, M Camilleri, PC Garcia… - arXiv preprint arXiv …, 2022 - arxiv.org
Although several datasets annotated for anaphoric reference/coreference exist, even the
largest such datasets have limitations in terms of size, range of domains, coverage of …

Measuring annotator agreement generally across complex structured, multi-object, and free-text annotation tasks

A Braylan, O Alonso, M Lease - … of the ACM Web Conference 2022, 2022 - dl.acm.org
When annotators label data, a key metric for quality assurance is inter-annotator agreement
(IAA): the extent to which annotators agree on their labels. Though many IAA measures exist …