Distillspec: Improving speculative decoding via knowledge distillation

Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

H Xia, Z Yang, Q Dong, P Wang, Y Li, T Ge… - arXiv preprint arXiv …, 2024 - arxiv.org

To mitigate the high inference latency stemming from autoregressive decoding in Large
Language Models (LLMs), Speculative Decoding has emerged as a novel decoding …

被引用次数：18 相关文章所有 4 个版本

[PDF] arxiv.org

Medusa: Simple llm inference acceleration framework with multiple decoding heads

T Cai, Y Li, Z Geng, H Peng, JD Lee, D Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

The inference process in Large Language Models (LLMs) is often limited due to the absence
of parallelism in the auto-regressive decoding process, resulting in most operations being …

被引用次数：46 相关文章所有 3 个版本

[PDF] arxiv.org

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H Jin… - arXiv preprint arXiv …, 2023 - arxiv.org

In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

被引用次数：32 相关文章所有 2 个版本

[PDF] aclanthology.org

Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation

H Xia, T Ge, P Wang, SQ Chen, F Wei… - Findings of the …, 2023 - aclanthology.org

Abstract We propose Speculative Decoding (SpecDec), for the first time ever, to formally
study exploiting the idea of speculative execution to accelerate autoregressive (AR) …

被引用次数：20 相关文章所有 3 个版本

[PDF] arxiv.org

Break the sequential dependency of llm inference using lookahead decoding

Y Fu, P Bailis, I Stoica, H Zhang - arXiv preprint arXiv:2402.02057, 2024 - arxiv.org

Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded,
resulting in high latency and significant wastes of the parallel processing power of modern …

被引用次数：16 相关文章所有 3 个版本

[PDF] arxiv.org

Chain-of-thought reasoning without prompting

X Wang, D Zhou - arXiv preprint arXiv:2402.10200, 2024 - arxiv.org

In enhancing the reasoning capabilities of large language models (LLMs), prior research
primarily focuses on specific prompting techniques such as few-shot or zero-shot chain-of …

被引用次数：13 相关文章所有 2 个版本

[PDF] arxiv.org

Eagle: Speculative sampling requires rethinking feature uncertainty

Y Li, F Wei, C Zhang, H Zhang - arXiv preprint arXiv:2401.15077, 2024 - arxiv.org

Auto-regressive decoding makes the inference of Large Language Models (LLMs) time-
consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater …

被引用次数：20 相关文章所有 3 个版本

[PDF] arxiv.org

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

被引用次数：11 相关文章所有 5 个版本

[PDF] arxiv.org

Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models

C Zhang, Z Liu, D Song - arXiv preprint arXiv:2404.14897, 2024 - arxiv.org

With the increasingly giant scales of (causal) large language models (LLMs), the inference
efficiency comes as one of the core concerns along the improved performance. In contrast to …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Kangaroo: Lossless self-speculative decoding via double early exiting

F Liu, Y Tang, Z Liu, Y Ni, K Han, Y Wang - arXiv preprint arXiv …, 2024 - arxiv.org

Speculative decoding has demonstrated its effectiveness in accelerating the inference of
large language models while maintaining a consistent sampling distribution. However, the …

被引用次数：5 相关文章所有 2 个版本

高级搜索

QQ 群