Anytime-Valid Confidence Sequences for Consistent Uncertainty Estimation in Early-Exit Neural Networks

M Jazbec, P Forré, S Mandt, D Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Early-exit neural networks (EENNs) facilitate adaptive inference by producing predictions at
multiple stages of the forward pass. In safety-critical applications, these predictions are only …

Energy-Efficient Inference With Software-Hardware Co-Design for Sustainable Artificial Intelligence of Things

S Dai, Z Luo, W Luo, S Wang, C Dai… - IEEE Internet of …, 2024 - ieeexplore.ieee.org
The emerging field of Artificial Intelligence of Things (AIoT) is propelled by the remarkable
success of deep learning and hardware evolution, which has a significant impact on our …

On the Role of Depth and Looping for In-Context Learning with Task Diversity

K Gatmiry, N Saunshi, SJ Reddi, S Jegelka… - arXiv preprint arXiv …, 2024 - arxiv.org
The intriguing in-context learning (ICL) abilities of deep Transformer models have lately
garnered significant attention. By studying in-context linear regression on unimodal …

Fast yet Safe: Early-Exiting with Risk Control

M Jazbec, A Timans, TH Veljković, K Sakmann… - arXiv preprint arXiv …, 2024 - arxiv.org
Scaling machine learning models significantly improves their performance. However, such
gains come at the cost of inference being slow and resource-intensive. Early-exit neural …

DuoDiff: Accelerating Diffusion Models with a Dual-Backbone Approach

DG Fernández, RA Matişan, AM Muñoz… - arXiv preprint arXiv …, 2024 - arxiv.org
Diffusion models have achieved unprecedented performance in image generation, yet they
suffer from slow inference due to their iterative sampling process. To address this, early …

RAEE: A Training-Free Retrieval-Augmented Early Exiting Framework for Efficient Inference

L Huang, S Wu, Y Cui, Y Xiong, X Liu, TW Kuo… - arXiv preprint arXiv …, 2024 - arxiv.org
Deploying large language model inference remains challenging due to their high
computational overhead. Early exiting accelerates model inference by adaptively reducing …

Cascade-Aware Training of Language Models

C Wang, S Augenstein, K Rush, W Jitkrittum… - arXiv preprint arXiv …, 2024 - arxiv.org
Reducing serving cost and latency is a fundamental concern for the deployment of language
models (LMs) in business applications. To address this, cascades of LMs offer an effective …

Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling

YC Yu, CC Kuo, Z Ye, YC Chang, YS Li - arXiv preprint arXiv:2406.12585, 2024 - arxiv.org
Ensembling multiple models has always been an effective approach to push the limits of
existing performance and is widely used in classification tasks by simply averaging the …

Dynamic Vocabulary Pruning in Early-Exit LLMs

J Vincenti, KA Sadek, J Velja, M Nulli… - arXiv preprint arXiv …, 2024 - arxiv.org
Increasing the size of large language models (LLMs) has been shown to lead to better
performance. However, this comes at the cost of slower and more expensive inference. Early …