Missing information, unresponsive authors, experimental flaws: The impossibility of assessing...

R Mao, G Chen, X Zhang, F Guerin… - arXiv preprint arXiv …, 2023 - arxiv.org

The emergence of ChatGPT has generated much speculation in the press about its potential
to disrupt social and economic systems. Its astonishing language ability has aroused strong …

被引用次数：53 相关文章所有 4 个版本

[PDF] arxiv.org

Kola: Carefully benchmarking world knowledge of large language models

J Yu, X Wang, S Tu, S Cao, D Zhang-Li, X Lv… - arXiv preprint arXiv …, 2023 - arxiv.org

The unprecedented performance of large language models (LLMs) necessitates
improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we …

被引用次数：70 相关文章所有 3 个版本

[PDF] arxiv.org

Evaluating human-language model interaction

M Lee, M Srivastava, A Hardy, J Thickstun… - arXiv preprint arXiv …, 2022 - arxiv.org

Many real-world applications of language models (LMs), such as writing assistance and
code autocomplete, involve human-LM interaction. However, most benchmarks are non …

被引用次数：77 相关文章所有 4 个版本

[PDF] aclanthology.org

The 2024 repronlp shared task on reproducibility of evaluations in nlp: Overview and results

A Belz, C Thomson - Proceedings of the Fourth Workshop on …, 2024 - aclanthology.org

This paper presents an overview of, and the results from, the 2024 Shared Task on
Reproducibility of Evaluations in NLP (ReproNLP'24), following on from three previous …

被引用次数：13 相关文章所有 3 个版本

[PDF] aclanthology.org

Are experts needed? On human evaluation of counselling reflection generation

Z Wu, S Balloccu, E Reiter, R Helaoui… - Proceedings of the …, 2023 - aclanthology.org

Reflection is a crucial counselling skill where the therapist conveys to the client their
interpretation of what the client said. Language models have recently been used to generate …

被引用次数：7 相关文章所有 6 个版本

[PDF] aclanthology.org

Non-repeatable experiments and non-reproducible results: The reproducibility crisis in human evaluation in NLP

A Belz, C Thomson, E Reiter, S Mille - Findings of the Association …, 2023 - aclanthology.org

Human evaluation is widely regarded as the litmus test of quality in NLP. A basic
requirementof all evaluations, but in particular where they are used for meta-evaluation, is …

被引用次数：12 相关文章所有 3 个版本

[PDF] hw.ac.uk

Temporal and second language influence on intra-annotator agreement and stability in hate speech labelling

G Abercrombie, D Hovy… - 17th Linguistic …, 2023 - researchportal.hw.ac.uk

Much work in natural language processing (NLP) relies on human annotation. The majority
of this implicitly assumes that annotator's labels are temporally stable, although the reality is …

被引用次数：8 相关文章所有 6 个版本

[PDF] mit.edu

Common Flaws in Running Human Evaluation Experiments in NLP

C Thomson, E Reiter, A Belz - Computational Linguistics, 2024 - direct.mit.edu

While conducting a coordinated set of repeat runs of human evaluation experiments in NLP,
we discovered flaws in every single experiment we selected for inclusion via a systematic …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

With a little help from the authors: Reproducing human evaluation of an MT error detector

O Plátek, M Lango, O Dušek - arXiv preprint arXiv:2308.06527, 2023 - arxiv.org

This work presents our efforts to reproduce the results of the human evaluation experiment
presented in the paper of Vamvas and Sennrich (2022), which evaluated an automatic …

被引用次数：3 相关文章所有 5 个版本

[PDF] arxiv.org

Understanding counterspeech for online harm mitigation

YL Chung, G Abercrombie, F Enock, J Bright… - arXiv preprint arXiv …, 2023 - arxiv.org

Counterspeech offers direct rebuttals to hateful speech by challenging perpetrators of hate
and showing support to targets of abuse. It provides a promising alternative to more …

被引用次数：4 相关文章所有 2 个版本

高级搜索

QQ 群