The inference process in Large Language Models (LLMs) is often limited due to the absence of parallelism in the auto-regressive decoding process, resulting in most operations being …
In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …
H Xia, T Ge, P Wang, SQ Chen, F Wei… - Findings of the …, 2023 - aclanthology.org
Abstract We propose Speculative Decoding (SpecDec), for the first time ever, to formally study exploiting the idea of speculative execution to accelerate autoregressive (AR) …
Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded, resulting in high latency and significant wastes of the parallel processing power of modern …
X Wang, D Zhou - arXiv preprint arXiv:2402.10200, 2024 - arxiv.org
In enhancing the reasoning capabilities of large language models (LLMs), prior research primarily focuses on specific prompting techniques such as few-shot or zero-shot chain-of …
Y Li, F Wei, C Zhang, H Zhang - arXiv preprint arXiv:2401.15077, 2024 - arxiv.org
Auto-regressive decoding makes the inference of Large Language Models (LLMs) time- consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater …
Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks. However, the substantial computational and memory …
C Zhang, Z Liu, D Song - arXiv preprint arXiv:2404.14897, 2024 - arxiv.org
With the increasingly giant scales of (causal) large language models (LLMs), the inference efficiency comes as one of the core concerns along the improved performance. In contrast to …
Speculative decoding has demonstrated its effectiveness in accelerating the inference of large language models while maintaining a consistent sampling distribution. However, the …