GPTEval: A survey on assessments of ChatGPT and GPT-4

R Mao, G Chen, X Zhang, F Guerin… - arXiv preprint arXiv …, 2023 - arxiv.org
The emergence of ChatGPT has generated much speculation in the press about its potential
to disrupt social and economic systems. Its astonishing language ability has aroused strong …

Kola: Carefully benchmarking world knowledge of large language models

J Yu, X Wang, S Tu, S Cao, D Zhang-Li, X Lv… - arXiv preprint arXiv …, 2023 - arxiv.org
The unprecedented performance of large language models (LLMs) necessitates
improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we …

Evaluating human-language model interaction

M Lee, M Srivastava, A Hardy, J Thickstun… - arXiv preprint arXiv …, 2022 - arxiv.org
Many real-world applications of language models (LMs), such as writing assistance and
code autocomplete, involve human-LM interaction. However, most benchmarks are non …

The 2024 repronlp shared task on reproducibility of evaluations in nlp: Overview and results

A Belz, C Thomson - Proceedings of the Fourth Workshop on …, 2024 - aclanthology.org
This paper presents an overview of, and the results from, the 2024 Shared Task on
Reproducibility of Evaluations in NLP (ReproNLP'24), following on from three previous …

Are experts needed? On human evaluation of counselling reflection generation

Z Wu, S Balloccu, E Reiter, R Helaoui… - Proceedings of the …, 2023 - aclanthology.org
Reflection is a crucial counselling skill where the therapist conveys to the client their
interpretation of what the client said. Language models have recently been used to generate …

Non-repeatable experiments and non-reproducible results: The reproducibility crisis in human evaluation in NLP

A Belz, C Thomson, E Reiter, S Mille - Findings of the Association …, 2023 - aclanthology.org
Human evaluation is widely regarded as the litmus test of quality in NLP. A basic
requirementof all evaluations, but in particular where they are used for meta-evaluation, is …

Temporal and second language influence on intra-annotator agreement and stability in hate speech labelling

G Abercrombie, D Hovy… - 17th Linguistic …, 2023 - researchportal.hw.ac.uk
Much work in natural language processing (NLP) relies on human annotation. The majority
of this implicitly assumes that annotator's labels are temporally stable, although the reality is …

Common Flaws in Running Human Evaluation Experiments in NLP

C Thomson, E Reiter, A Belz - Computational Linguistics, 2024 - direct.mit.edu
While conducting a coordinated set of repeat runs of human evaluation experiments in NLP,
we discovered flaws in every single experiment we selected for inclusion via a systematic …

With a little help from the authors: Reproducing human evaluation of an MT error detector

O Plátek, M Lango, O Dušek - arXiv preprint arXiv:2308.06527, 2023 - arxiv.org
This work presents our efforts to reproduce the results of the human evaluation experiment
presented in the paper of Vamvas and Sennrich (2022), which evaluated an automatic …

Understanding counterspeech for online harm mitigation

YL Chung, G Abercrombie, F Enock, J Bright… - arXiv preprint arXiv …, 2023 - arxiv.org
Counterspeech offers direct rebuttals to hateful speech by challenging perpetrators of hate
and showing support to targets of abuse. It provides a promising alternative to more …