Efficient large language models: A survey

Z Wan, X Wang, C Liu, S Alam, Y Zheng, J Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language Models (LLMs) have demonstrated remarkable capabilities in important
tasks such as natural language understanding and language generation, and thus have the …

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management

W Lee, J Lee, J Seo, J Sim - 18th USENIX Symposium on Operating …, 2024 - usenix.org
Transformer-based large language models (LLMs) demonstrate impressive performance
across various natural language processing tasks. Serving LLM inference for generating …

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H Jin… - arXiv preprint arXiv …, 2023 - arxiv.org
In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

{Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention}

B Gao, Z He, P Sharma, Q Kang, D Jevdjic… - 2024 USENIX Annual …, 2024 - usenix.org
Interacting with humans through multi-turn conversations is a fundamental feature of large
language models (LLMs). However, existing LLM serving engines executing multi-turn …

Transformers are multi-state rnns

M Oren, M Hassid, N Yarden, Y Adi… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformers are considered conceptually different from the previous generation of state-of-
the-art NLP models-recurrent neural networks (RNNs). In this work, we demonstrate that …

[PDF][PDF] Efficiently Programming Large Language Models using SGLang.

L Zheng, L Yin, Z Xie, J Huang, C Sun, CH Yu, S Cao… - 2023 - par.nsf.gov
Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …

Personal llm agents: Insights and survey about the capability, efficiency and security

Y Li, H Wen, W Wang, X Li, Y Yuan, G Liu, J Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Since the advent of personal computing devices, intelligent personal assistants (IPAs) have
been one of the key technologies that researchers and engineers have focused on, aiming …

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention

H Jiang, Y Li, C Zhang, Q Wu, X Luo, S Ahn… - arXiv preprint arXiv …, 2024 - arxiv.org
The computational challenges of Large Language Model (LLM) inference remain a
significant barrier to their widespread deployment, especially as prompt lengths continue to …

Mobile edge intelligence for large language models: A contemporary survey

G Qu, Q Chen, W Wei, Z Lin, X Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
On-device large language models (LLMs), referring to running LLMs on edge devices, have
raised considerable interest owing to their superior privacy, reduced latency, and bandwidth …

[PDF][PDF] Sglang: Efficient execution of structured language model programs

L Zheng, L Yin, Z Xie, C Sun, J Huang… - arXiv preprint arXiv …, 2024 - minjiazhang.github.io
Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …