[PDF][PDF] Skeleton-of-thought: Large language models can do parallel decoding

X Ning, Z Lin, Z Zhou, Z Wang, H Yang… - Proceedings ENLSP …, 2023 - lirias.kuleuven.be
This work aims at decreasing the end-to-end generation latency of large language models
(LLMs). One of the major causes of the high generation latency is the sequential decoding …

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H Jin… - arXiv preprint arXiv …, 2023 - arxiv.org
In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

Large language model inference acceleration: A comprehensive hardware perspective

J Li, J Xu, S Huang, Y Chen, W Li, J Liu, Y Lian… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have demonstrated remarkable capabilities across various
fields, from natural language understanding to text generation. Compared to non-generative …

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

T Dao, A Gu - arXiv preprint arXiv:2405.21060, 2024 - arxiv.org
While Transformers have been the main architecture behind deep learning's success in
language modeling, state-space models (SSMs) such as Mamba have recently been shown …

A survey of resource-efficient llm and multimodal foundation models

M Xu, W Yin, D Cai, R Yi, D Xu, Q Wang, B Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large foundation models, including large language models (LLMs), vision transformers
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …

Resource-efficient Algorithms and Systems of Foundation Models: A Survey

M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024 - dl.acm.org
Large foundation models, including large language models, vision transformers, diffusion,
and LLM-based multimodal models, are revolutionizing the entire machine learning …

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

Memserve: Context caching for disaggregated llm serving with elastic memory pool

C Hu, H Huang, J Hu, J Xu, X Chen, T Xie… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language model (LLM) serving has transformed from stateless to stateful systems,
utilizing techniques like context caching and disaggregated inference. These optimizations …

Llm for mobile: An initial roadmap

D Chen, Y Liu, M Zhou, Y Zhao, H Wang… - ACM Transactions on …, 2024 - dl.acm.org
When mobile meets LLMs, mobile app users deserve to have more intelligent usage
experiences. For this to happen, we argue that there is a strong need to apply LLMs for the …

Inference without interference: Disaggregate llm inference for mixed downstream workloads

C Hu, H Huang, L Xu, X Chen, J Xu, S Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformer-based large language model (LLM) inference serving is now the backbone of
many cloud services. LLM inference consists of a prefill phase and a decode phase …