Inference without interference: Disaggregate llm inference for mixed downstream workloads

J Xu, Z Li, W Chen, Q Wang, X Gao, Q Cai… - arXiv preprint arXiv …, 2024 - arxiv.org

The advent of large language models (LLMs) revolutionized natural language processing
applications, and running LLMs on edge devices has become increasingly attractive for …

被引用次数：11 相关文章所有 3 个版本

[PDF] arxiv.org

Llm inference serving: Survey of recent advances and opportunities

B Li, Y Jiang, V Gadepally, D Tiwari - arXiv preprint arXiv:2407.12391, 2024 - arxiv.org

This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …

被引用次数：10 相关文章所有 2 个版本

[PDF] arxiv.org

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

B Wu, S Liu, Y Zhong, P Sun, X Liu, X Jin - Proceedings of the ACM …, 2024 - dl.acm.org

The context window of large language models (LLMs) is rapidly increasing, leading to a
huge variance in resource usage between different requests as well as between different …

被引用次数：17 相关文章所有 2 个版本

[PDF] arxiv.org

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

被引用次数：71 相关文章所有 5 个版本

[PDF] arxiv.org

Mooncake: A kvcache-centric disaggregated architecture for llm serving

R Qin, Z Li, W He, M Zhang, Y Wu, W Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It
features a KVCache-centric disaggregated architecture that separates the prefill and …

被引用次数：11 相关文章

[PDF] arxiv.org

Memserve: Context caching for disaggregated llm serving with elastic memory pool

C Hu, H Huang, J Hu, J Xu, X Chen, T Xie… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language model (LLM) serving has transformed from stateless to stateful systems,
utilizing techniques like context caching and disaggregated inference. These optimizations …

被引用次数：8 相关文章所有 2 个版本

[PDF] arxiv.org

Taming throughput-latency tradeoff in llm inference with sarathi-serve

A Agrawal, N Kedia, A Panwar, J Mohan… - arXiv preprint arXiv …, 2024 - arxiv.org

Each LLM serving request goes through two phases. The first is prefill which processes the
entire input prompt to produce one output token and the second is decode which generates …

被引用次数：84 相关文章所有 2 个版本

[PDF] arxiv.org

Neo: Saving gpu memory crisis with cpu offloading for online llm inference

X Jiang, Y Zhou, S Cao, I Stoica, M Yu - arXiv preprint arXiv:2411.01142, 2024 - arxiv.org

Online LLM inference powers many exciting applications such as intelligent chatbots and
autonomous agents. Modern LLM inference engines widely rely on request batching to …

被引用次数：3 相关文章所有 2 个版本

[PDF] usenix.org

Taming {Throughput-Latency} Tradeoff in {LLM} Inference with {Sarathi-Serve}

A Agrawal, N Kedia, A Panwar, J Mohan… - … USENIX Symposium on …, 2024 - usenix.org

Each LLM serving request goes through two phases. The first is prefill which processes the
entire input prompt and produces the first output token and the second is decode which …

[PDF] arxiv.org

A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models

C Guo, F Cheng, Z Du, J Kiessling, J Ku, S Li… - arXiv preprint arXiv …, 2024 - arxiv.org

The rapid development of large language models (LLMs) has significantly transformed the
field of artificial intelligence, demonstrating remarkable capabilities in natural language …

被引用次数：2 相关文章所有 2 个版本

高级搜索

QQ 群