Evaluating the social impact of generative ai systems in systems and society

I Solaiman, Z Talat, W Agnew, L Ahmad… - arXiv preprint arXiv …, 2023 - arxiv.org
Generative AI systems across modalities, ranging from text, image, audio, and video, have
broad social impacts, but there exists no official standard for means of evaluating those …

An archival perspective on pretraining data

MA Desai, IV Pasquetto, AZ Jacobs, D Card - Patterns, 2024 - cell.com
Alongside an explosion in research and development related to large language models,
there has been a concomitant rise in the creation of pretraining datasets—massive …

Open problems in technical ai governance

A Reuel, B Bucknall, S Casper, T Fist, L Soder… - arXiv preprint arXiv …, 2024 - arxiv.org
AI progress is creating a growing range of risks and opportunities, but it is often unclear how
they should be navigated. In many cases, the barriers and uncertainties faced are at least …

Position: measure dataset diversity, don't just claim it

D Zhao, JTA Andrews, O Papakyriakopoulos… - arXiv preprint arXiv …, 2024 - arxiv.org
Machine learning (ML) datasets, often perceived as neutral, inherently encapsulate abstract
and disputed social constructs. Dataset curators frequently employ value-laden terms such …

Large Language Models: The Need for Nuance in Current Debates and a Pragmatic Perspective on Understanding

B Van Dijk, T Kouwenhoven, MR Spruit… - arXiv preprint arXiv …, 2023 - arxiv.org
Current Large Language Models (LLMs) are unparalleled in their ability to generate
grammatically correct, fluent text. LLMs are appearing rapidly, and debates on LLM …

Scaling laws do not scale

F Diaz, M Madaio - Proceedings of the AAAI/ACM Conference on AI …, 2024 - ojs.aaai.org
Recent work has advocated for training AI models on ever-larger datasets, arguing that as
the size of a dataset increases, the performance of a model trained on that dataset will …

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

A Reuel, A Hardy, C Smith, M Lamparth… - arXiv preprint arXiv …, 2024 - arxiv.org
AI models are increasingly prevalent in high-stakes environments, necessitating thorough
assessment of their capabilities and risks. Benchmarks are popular for measuring these …

Evaluating Task-oriented Dialogue Systems: A Systematic Review of Measures, Constructs and their Operationalisations

A Braggaar, C Liebrecht, E van Miltenburg… - arXiv preprint arXiv …, 2023 - arxiv.org
This review gives an extensive overview of evaluation methods for task-oriented dialogue
systems, paying special attention to practical applications of dialogue systems, for example …

Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers

L Pacchiardi, M Tesic, LG Cheke… - arXiv preprint arXiv …, 2024 - arxiv.org
The integrity of AI benchmarks is fundamental to accurately assess the capabilities of AI
systems. The internal validity of these benchmarks-ie, making sure they are free from …

On the Role of Entity and Event Level Conceptualization in Generalizable Reasoning: A Survey of Tasks, Methods, Applications, and Future Directions

W Wang, T Fang, H Shi, B Xu, W Ding, L Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Entity-and event-level conceptualization, as fundamental elements of human cognition,
plays a pivotal role in generalizable reasoning. This process involves abstracting specific …