It takes two to tango: Navigating conceptualizations of NLP tasks and measurements of performance

I Solaiman, Z Talat, W Agnew, L Ahmad… - arXiv preprint arXiv …, 2023 - arxiv.org

Generative AI systems across modalities, ranging from text, image, audio, and video, have
broad social impacts, but there exists no official standard for means of evaluating those …

被引用次数：109 相关文章所有 2 个版本

[PDF] cell.com Full View

An archival perspective on pretraining data

MA Desai, IV Pasquetto, AZ Jacobs, D Card - Patterns, 2024 - cell.com

Alongside an explosion in research and development related to large language models,
there has been a concomitant rise in the creation of pretraining datasets—massive …

被引用次数：10 相关文章所有 8 个版本

[PDF] arxiv.org

Open problems in technical ai governance

A Reuel, B Bucknall, S Casper, T Fist, L Soder… - arXiv preprint arXiv …, 2024 - arxiv.org

AI progress is creating a growing range of risks and opportunities, but it is often unclear how
they should be navigated. In many cases, the barriers and uncertainties faced are at least …

被引用次数：19 相关文章所有 4 个版本

[PDF] arxiv.org

Position: measure dataset diversity, don't just claim it

D Zhao, JTA Andrews, O Papakyriakopoulos… - arXiv preprint arXiv …, 2024 - arxiv.org

Machine learning (ML) datasets, often perceived as neutral, inherently encapsulate abstract
and disputed social constructs. Dataset curators frequently employ value-laden terms such …

被引用次数：7 相关文章

[PDF] arxiv.org

Large Language Models: The Need for Nuance in Current Debates and a Pragmatic Perspective on Understanding

B Van Dijk, T Kouwenhoven, MR Spruit… - arXiv preprint arXiv …, 2023 - arxiv.org

Current Large Language Models (LLMs) are unparalleled in their ability to generate
grammatically correct, fluent text. LLMs are appearing rapidly, and debates on LLM …

被引用次数：14 相关文章所有 5 个版本

[PDF] aaai.org

Scaling laws do not scale

F Diaz, M Madaio - Proceedings of the AAAI/ACM Conference on AI …, 2024 - ojs.aaai.org

Recent work has advocated for training AI models on ever-larger datasets, arguing that as
the size of a dataset increases, the performance of a model trained on that dataset will …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

A Reuel, A Hardy, C Smith, M Lamparth… - arXiv preprint arXiv …, 2024 - arxiv.org

AI models are increasingly prevalent in high-stakes environments, necessitating thorough
assessment of their capabilities and risks. Benchmarks are popular for measuring these …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Evaluating Task-oriented Dialogue Systems: A Systematic Review of Measures, Constructs and their Operationalisations

A Braggaar, C Liebrecht, E van Miltenburg… - arXiv preprint arXiv …, 2023 - arxiv.org

This review gives an extensive overview of evaluation methods for task-oriented dialogue
systems, paying special attention to practical applications of dialogue systems, for example …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers

L Pacchiardi, M Tesic, LG Cheke… - arXiv preprint arXiv …, 2024 - arxiv.org

The integrity of AI benchmarks is fundamental to accurately assess the capabilities of AI
systems. The internal validity of these benchmarks-ie, making sure they are free from …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

On the Role of Entity and Event Level Conceptualization in Generalizable Reasoning: A Survey of Tasks, Methods, Applications, and Future Directions

W Wang, T Fang, H Shi, B Xu, W Ding, L Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

Entity-and event-level conceptualization, as fundamental elements of human cognition,
plays a pivotal role in generalizable reasoning. This process involves abstracting specific …

被引用次数：5 相关文章

高级搜索

QQ 群