Agreement is overrated: A plea for correlation to assess human evaluation reliability

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org

Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

被引用次数：152 相关文章所有 6 个版本

[HTML] sciencedirect.com

[HTML][HTML] Human evaluation of automatically generated text: Current trends and best practice guidelines

C van der Lee, A Gatt, E van Miltenburg… - Computer Speech & …, 2021 - Elsevier

Currently, there is little agreement as to how Natural Language Generation (NLG) systems
should be evaluated, with a particularly high degree of variation in the way that human …

被引用次数：199 相关文章所有 2 个版本

[PDF] arxiv.org

Can large language models be an alternative to human evaluations?

CH Chiang, H Lee - arXiv preprint arXiv:2305.01937, 2023 - arxiv.org

Human evaluation is indispensable and inevitable for assessing the quality of texts
generated by machine learning models or written by humans. However, human evaluation is …

被引用次数：434 相关文章所有 5 个版本

[PDF] arxiv.org

Evaluation of text generation: A survey

A Celikyilmaz, E Clark, J Gao - arXiv preprint arXiv:2006.14799, 2020 - arxiv.org

The paper surveys evaluation methods of natural language generation (NLG) systems that
have been developed in the last few years. We group NLG evaluation methods into three …

被引用次数：418 相关文章所有 2 个版本

[PDF] mit.edu

The (un) suitability of automatic evaluation metrics for text simplification

F Alva-Manchego, C Scarton, L Specia - Computational Linguistics, 2021 - direct.mit.edu

In order to simplify sentences, several rewriting operations can be performed, such as
replacing complex words per simpler synonyms, deleting unnecessary information, and …

被引用次数：91 相关文章所有 10 个版本

[PDF] mit.edu

Analyzing dataset annotation quality management in the wild

JC Klie, RE Castilho, I Gurevych - Computational Linguistics, 2024 - direct.mit.edu

Data quality is crucial for training accurate, unbiased, and trustworthy machine learning
models as well as for their correct evaluation. Recent work, however, has shown that even …

被引用次数：14 相关文章所有 4 个版本

[PDF] arxiv.org

Upvotes? Downvotes? No Votes? Understanding the relationship between reaction mechanisms and political discourse on Reddit

O Papakyriakopoulos, S Engelmann… - Proceedings of the 2023 …, 2023 - dl.acm.org

A significant share of political discourse occurs online on social media platforms.
Policymakers and researchers try to understand the role of social media design in shaping …

被引用次数：19 相关文章所有 3 个版本

[PDF] arxiv.org

Spot the bot: A robust and efficient framework for the evaluation of conversational dialogue systems

J Deriu, D Tuggener, P Von Däniken… - arXiv preprint arXiv …, 2020 - arxiv.org

The lack of time-efficient and reliable evaluation methods hamper the development of
conversational dialogue systems (chatbots). Evaluations requiring humans to converse with …

被引用次数：48 相关文章所有 7 个版本

[PDF] github.io

Hark: A deep learning system for navigating privacy feedback at scale

H Harkous, ST Peddinti, R Khandelwal… - … IEEE Symposium on …, 2022 - ieeexplore.ieee.org

Integrating user feedback is one of the pillars for building successful products. However, this
feedback is generally collected in an unstructured free-text form, which is challenging to …

被引用次数：28 相关文章所有 5 个版本

[PDF] aclanthology.org

Establishing annotation quality in multi-label annotations

M Marchal, M Scholman, F Yung… - Proceedings of the 29th …, 2022 - aclanthology.org

In many linguistic fields requiring annotated data, multiple interpretations of a single item are
possible. Multi-label annotations more accurately reflect this possibility. However, allowing …

被引用次数：18 相关文章所有 3 个版本

高级搜索

QQ 群