Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

C Burns, P Izmailov, JH Kirchner, B Baker… - arXiv preprint arXiv …, 2023 - arxiv.org
Widely used alignment techniques, such as reinforcement learning from human feedback
(RLHF), rely on the ability of humans to supervise model behavior-for example, to evaluate …

Weak-to-strong reasoning

Y Yang, Y Ma, P Liu - arXiv preprint arXiv:2407.13647, 2024 - arxiv.org
When large language models (LLMs) exceed human-level capabilities, it becomes
increasingly challenging to provide full-scale and accurate supervision for these models …

Your Weak LLM is Secretly a Strong Teacher for Alignment

L Tao, Y Li - arXiv preprint arXiv:2409.08813, 2024 - arxiv.org
The burgeoning capabilities of large language models (LLMs) have underscored the need
for alignment to ensure these models act in accordance with human values and intentions …

MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

A Opedal, H Shirakami, B Schölkopf, A Saparov… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) can solve arithmetic word problems with high accuracy, but
little is known about how well they generalize to problems that are more complex than the …

EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM?

A Agrawal, M Ding, Z Che, C Deng, A Satheesh… - arXiv preprint arXiv …, 2024 - arxiv.org
How can we harness the collective capabilities of multiple Large Language Models (LLMs)
to create an even more powerful model? This question forms the foundation of our research …

Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models

A Muhamed, M Diab, V Smith - arXiv preprint arXiv:2411.00743, 2024 - arxiv.org
Understanding and mitigating the potential risks associated with foundation models (FMs)
hinges on developing effective interpretability methods. Sparse Autoencoders (SAEs) have …

Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

Z Yang, Y Zhang, T Liu, J Yang, J Lin, C Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) have demonstrated impressive capabilities, but still suffer
from inconsistency issues (eg LLMs can react differently to disturbances like rephrasing or …

Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

M Ding, C Deng, J Choo, Z Wu, A Agrawal… - arXiv preprint arXiv …, 2024 - arxiv.org
While generalization over tasks from easy to hard is crucial to profile language models
(LLMs), the datasets with fine-grained difficulty annotations for each problem across a broad …

Aligning LLMs with Domain Invariant Reward Models

D Wu, S Choudhury - arXiv preprint arXiv:2501.00911, 2025 - arxiv.org
Aligning large language models (LLMs) to human preferences is challenging in domains
where preference data is unavailable. We address the problem of learning reward models …

Evaluating Fine-Tuning Efficiency of Human-Inspired Learning Strategies in Medical Question Answering

Y Yang, AM Bean, R McCraith, A Mahdi - arXiv preprint arXiv:2408.07888, 2024 - arxiv.org
Fine-tuning Large Language Models (LLMs) incurs considerable training costs, driving the
need for data-efficient training with optimised data ordering. Human-inspired strategies offer …