On-device language models: A comprehensive review

J Xu, Z Li, W Chen, Q Wang, X Gao, Q Cai… - arXiv preprint arXiv …, 2024 - arxiv.org
The advent of large language models (LLMs) revolutionized natural language processing
applications, and running LLMs on edge devices has become increasingly attractive for …

Llm inference serving: Survey of recent advances and opportunities

B Li, Y Jiang, V Gadepally, D Tiwari - arXiv preprint arXiv:2407.12391, 2024 - arxiv.org
This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

B Wu, S Liu, Y Zhong, P Sun, X Liu, X Jin - Proceedings of the ACM …, 2024 - dl.acm.org
The context window of large language models (LLMs) is rapidly increasing, leading to a
huge variance in resource usage between different requests as well as between different …

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

Mooncake: A kvcache-centric disaggregated architecture for llm serving

R Qin, Z Li, W He, M Zhang, Y Wu, W Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It
features a KVCache-centric disaggregated architecture that separates the prefill and …

Memserve: Context caching for disaggregated llm serving with elastic memory pool

C Hu, H Huang, J Hu, J Xu, X Chen, T Xie… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language model (LLM) serving has transformed from stateless to stateful systems,
utilizing techniques like context caching and disaggregated inference. These optimizations …

Taming throughput-latency tradeoff in llm inference with sarathi-serve

A Agrawal, N Kedia, A Panwar, J Mohan… - arXiv preprint arXiv …, 2024 - arxiv.org
Each LLM serving request goes through two phases. The first is prefill which processes the
entire input prompt to produce one output token and the second is decode which generates …

Neo: Saving gpu memory crisis with cpu offloading for online llm inference

X Jiang, Y Zhou, S Cao, I Stoica, M Yu - arXiv preprint arXiv:2411.01142, 2024 - arxiv.org
Online LLM inference powers many exciting applications such as intelligent chatbots and
autonomous agents. Modern LLM inference engines widely rely on request batching to …

Taming {Throughput-Latency} Tradeoff in {LLM} Inference with {Sarathi-Serve}

A Agrawal, N Kedia, A Panwar, J Mohan… - … USENIX Symposium on …, 2024 - usenix.org
Each LLM serving request goes through two phases. The first is prefill which processes the
entire input prompt and produces the first output token and the second is decode which …

A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models

C Guo, F Cheng, Z Du, J Kiessling, J Ku, S Li… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid development of large language models (LLMs) has significantly transformed the
field of artificial intelligence, demonstrating remarkable capabilities in natural language …