The unprecedented performance of large language models (LLMs) necessitates improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we …
Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non …
A Belz, C Thomson - Proceedings of the Fourth Workshop on …, 2024 - aclanthology.org
This paper presents an overview of, and the results from, the 2024 Shared Task on Reproducibility of Evaluations in NLP (ReproNLP'24), following on from three previous …
Reflection is a crucial counselling skill where the therapist conveys to the client their interpretation of what the client said. Language models have recently been used to generate …
Human evaluation is widely regarded as the litmus test of quality in NLP. A basic requirementof all evaluations, but in particular where they are used for meta-evaluation, is …
Much work in natural language processing (NLP) relies on human annotation. The majority of this implicitly assumes that annotator's labels are temporally stable, although the reality is …
While conducting a coordinated set of repeat runs of human evaluation experiments in NLP, we discovered flaws in every single experiment we selected for inclusion via a systematic …
This work presents our efforts to reproduce the results of the human evaluation experiment presented in the paper of Vamvas and Sennrich (2022), which evaluated an automatic …
Counterspeech offers direct rebuttals to hateful speech by challenging perpetrators of hate and showing support to targets of abuse. It provides a promising alternative to more …