Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

[HTML][HTML] Human evaluation of automatically generated text: Current trends and best practice guidelines

C van der Lee, A Gatt, E van Miltenburg… - Computer Speech & …, 2021 - Elsevier
Currently, there is little agreement as to how Natural Language Generation (NLG) systems
should be evaluated, with a particularly high degree of variation in the way that human …

Can large language models be an alternative to human evaluations?

CH Chiang, H Lee - arXiv preprint arXiv:2305.01937, 2023 - arxiv.org
Human evaluation is indispensable and inevitable for assessing the quality of texts
generated by machine learning models or written by humans. However, human evaluation is …

Evaluation of text generation: A survey

A Celikyilmaz, E Clark, J Gao - arXiv preprint arXiv:2006.14799, 2020 - arxiv.org
The paper surveys evaluation methods of natural language generation (NLG) systems that
have been developed in the last few years. We group NLG evaluation methods into three …

The (un) suitability of automatic evaluation metrics for text simplification

F Alva-Manchego, C Scarton, L Specia - Computational Linguistics, 2021 - direct.mit.edu
In order to simplify sentences, several rewriting operations can be performed, such as
replacing complex words per simpler synonyms, deleting unnecessary information, and …

Analyzing dataset annotation quality management in the wild

JC Klie, RE Castilho, I Gurevych - Computational Linguistics, 2024 - direct.mit.edu
Data quality is crucial for training accurate, unbiased, and trustworthy machine learning
models as well as for their correct evaluation. Recent work, however, has shown that even …

Upvotes? Downvotes? No Votes? Understanding the relationship between reaction mechanisms and political discourse on Reddit

O Papakyriakopoulos, S Engelmann… - Proceedings of the 2023 …, 2023 - dl.acm.org
A significant share of political discourse occurs online on social media platforms.
Policymakers and researchers try to understand the role of social media design in shaping …

Spot the bot: A robust and efficient framework for the evaluation of conversational dialogue systems

J Deriu, D Tuggener, P Von Däniken… - arXiv preprint arXiv …, 2020 - arxiv.org
The lack of time-efficient and reliable evaluation methods hamper the development of
conversational dialogue systems (chatbots). Evaluations requiring humans to converse with …

Hark: A deep learning system for navigating privacy feedback at scale

H Harkous, ST Peddinti, R Khandelwal… - … IEEE Symposium on …, 2022 - ieeexplore.ieee.org
Integrating user feedback is one of the pillars for building successful products. However, this
feedback is generally collected in an unstructured free-text form, which is challenging to …

Establishing annotation quality in multi-label annotations

M Marchal, M Scholman, F Yung… - Proceedings of the 29th …, 2022 - aclanthology.org
In many linguistic fields requiring annotated data, multiple interpretations of a single item are
possible. Multi-label annotations more accurately reflect this possibility. However, allowing …