Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

H Xia, Z Yang, Q Dong, P Wang, Y Li, T Ge… - arXiv preprint arXiv …, 2024 - arxiv.org
To mitigate the high inference latency stemming from autoregressive decoding in Large
Language Models (LLMs), Speculative Decoding has emerged as a novel decoding …

Medusa: Simple llm inference acceleration framework with multiple decoding heads

T Cai, Y Li, Z Geng, H Peng, JD Lee, D Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
The inference process in Large Language Models (LLMs) is often limited due to the absence
of parallelism in the auto-regressive decoding process, resulting in most operations being …

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H Jin… - arXiv preprint arXiv …, 2023 - arxiv.org
In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation

H Xia, T Ge, P Wang, SQ Chen, F Wei… - Findings of the …, 2023 - aclanthology.org
Abstract We propose Speculative Decoding (SpecDec), for the first time ever, to formally
study exploiting the idea of speculative execution to accelerate autoregressive (AR) …

Break the sequential dependency of llm inference using lookahead decoding

Y Fu, P Bailis, I Stoica, H Zhang - arXiv preprint arXiv:2402.02057, 2024 - arxiv.org
Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded,
resulting in high latency and significant wastes of the parallel processing power of modern …

Chain-of-thought reasoning without prompting

X Wang, D Zhou - arXiv preprint arXiv:2402.10200, 2024 - arxiv.org
In enhancing the reasoning capabilities of large language models (LLMs), prior research
primarily focuses on specific prompting techniques such as few-shot or zero-shot chain-of …

Eagle: Speculative sampling requires rethinking feature uncertainty

Y Li, F Wei, C Zhang, H Zhang - arXiv preprint arXiv:2401.15077, 2024 - arxiv.org
Auto-regressive decoding makes the inference of Large Language Models (LLMs) time-
consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater …

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models

C Zhang, Z Liu, D Song - arXiv preprint arXiv:2404.14897, 2024 - arxiv.org
With the increasingly giant scales of (causal) large language models (LLMs), the inference
efficiency comes as one of the core concerns along the improved performance. In contrast to …

Kangaroo: Lossless self-speculative decoding via double early exiting

F Liu, Y Tang, Z Liu, Y Ni, K Han, Y Wang - arXiv preprint arXiv …, 2024 - arxiv.org
Speculative decoding has demonstrated its effectiveness in accelerating the inference of
large language models while maintaining a consistent sampling distribution. However, the …