Splitwise: Efficient generative llm inference using phase splitting

P Patel, E Choukse, C Zhang, A Shah… - 2024 ACM/IEEE 51st …, 2024 - ieeexplore.ieee.org
Generative large language model (LLM) applications are growing rapidly, leading to large-
scale deployments of expensive and power-hungry GPUs. Our characterization of LLM …

Resource-efficient algorithms and systems of foundation models: A survey

M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024 - dl.acm.org
Large foundation models, including large language models, vision transformers, diffusion,
and LLM-based multimodal models, are revolutionizing the entire machine learning …

Llm inference unveiled: Survey and roofline model insights

Z Yuan, Y Shang, Y Zhou, Z Dong, Z Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a
unique blend of opportunities and challenges. Although the field has expanded and is …

{Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention}

B Gao, Z He, P Sharma, Q Kang, D Jevdjic… - 2024 USENIX Annual …, 2024 - usenix.org
Interacting with humans through multi-turn conversations is a fundamental feature of large
language models (LLMs). However, existing LLM serving engines executing multi-turn …

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

B Wu, S Liu, Y Zhong, P Sun, X Liu, X Jin - Proceedings of the ACM …, 2024 - dl.acm.org
The context window of large language models (LLMs) is rapidly increasing, leading to a
huge variance in resource usage between different requests as well as between different …

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H Jin… - arXiv preprint arXiv …, 2023 - arxiv.org
In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

[PDF][PDF] The efficiency spectrum of large language models: An algorithmic survey

T Ding, T Chen, H Zhu, J Jiang, Y Zhong… - arXiv preprint arXiv …, 2023 - researchgate.net
The rapid growth of Large Language Models (LLMs) has been a driving force in
transforming various domains, reshaping the artificial general intelligence landscape …

Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache

B Lin, C Zhang, T Peng, H Zhao, W Xiao, M Sun… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid proliferation of Large Language Models (LLMs) has been a driving force in the
growth of cloud-based LLM services, which are now integral to advancing AI applications …

A survey of resource-efficient llm and multimodal foundation models

M Xu, W Yin, D Cai, R Yi, D Xu, Q Wang, B Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large foundation models, including large language models (LLMs), vision transformers
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …

Salute the classic: Revisiting challenges of machine translation in the age of large language models

J Pang, F Ye, DF Wong, D Yu, S Shi, Z Tu… - Transactions of the …, 2025 - direct.mit.edu
Abstract The evolution of Neural Machine Translation (NMT) has been significantly
influenced by six core challenges (Koehn and Knowles,) that have acted as benchmarks for …