BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

TY Zhuo, MC Vu, J Chim, H Hu, W Yu… - arXiv preprint arXiv …, 2024 - arxiv.org
Automated software engineering has been greatly empowered by the recent advances in
Large Language Models (LLMs) for programming. While current benchmarks have shown …

Show, Don't Tell: Aligning Language Models with Demonstrated Feedback

O Shaikh, M Lam, J Hejna, Y Shao, M Bernstein… - arXiv preprint arXiv …, 2024 - arxiv.org
Language models are aligned to emulate the collective voice of many, resulting in outputs
that align with no one in particular. Steering LLMs away from generic output is possible …

LLark: A Multimodal Instruction-Following Language Model for Music

JP Gardner, S Durand, D Stoller… - Forty-first International …, 2023 - openreview.net
Music has a unique and complex structure which is challenging for both expert humans and
existing AI systems to understand, and presents unique challenges relative to other forms of …

Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models

AA Ivanova, A Sathe, B Lipkin, U Kumar… - arXiv preprint arXiv …, 2024 - arxiv.org
The ability to build and leverage world models is essential for a general-purpose AI agent.
Testing such capabilities is hard, in part because the building blocks of world models are ill …

A SMART Mnemonic Sounds like" Glue Tonic": Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick

N Balepur, M Shu, A Hoyle, A Robey, S Feng… - arXiv preprint arXiv …, 2024 - arxiv.org
Keyword mnemonics are memorable explanations that link new terms to simpler keywords.
Prior works generate mnemonics for students, but they do not guide models toward …

LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models

E Stengel-Eskin, P Hase, M Bansal - arXiv preprint arXiv:2405.21028, 2024 - arxiv.org
When answering questions, LLMs can convey not only an answer, but a level of confidence
about the answer being correct. This includes explicit confidence markers (eg giving a …

DOLOMITES: Domain-Specific Long-Form Methodical Tasks

C Malaviya, P Agrawal, K Ganchev… - arXiv preprint arXiv …, 2024 - arxiv.org
Experts in various fields routinely perform methodical writing tasks to plan, organize, and
report their work. From a clinician writing a differential diagnosis for a patient, to a teacher …

AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models

J Cheng, Y Lu, X Gu, P Ke, X Liu, Y Dong… - arXiv preprint arXiv …, 2024 - arxiv.org
Although Large Language Models (LLMs) are becoming increasingly powerful, they still
exhibit significant but subtle weaknesses, such as mistakes in instruction-following or coding …

Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments

H Zhou, X Wan, Y Liu, N Collier, I Vulić… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) have shown promising abilities as cost-effective and
reference-free evaluators for assessing language generation quality. In particular, pairwise …

Compare without Despair: Reliable Preference Evaluation with Generation Separability

S Ghosh, T Srinivasan, S Swayamdipta - arXiv preprint arXiv:2407.01878, 2024 - arxiv.org
Human evaluation of generated language through pairwise preference judgments is
pervasive. However, under common scenarios, such as when generations from a model pair …