[PDF][PDF] Efficient large language models: A survey

Z Wan, X Wang, C Liu, S Alam, Y Zheng… - arXiv preprint arXiv …, 2023 - researchgate.net
Abstract Large Language Models (LLMs) have demonstrated remarkable capabilities in
important tasks such as natural language understanding, language generation, and …

Break the sequential dependency of llm inference using lookahead decoding

Y Fu, P Bailis, I Stoica, H Zhang - arXiv preprint arXiv:2402.02057, 2024 - arxiv.org
Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded,
resulting in high latency and significant wastes of the parallel processing power of modern …

LLM Inference Serving: Survey of Recent Advances and Opportunities

B Li, Y Jiang, V Gadepally, D Tiwari - arXiv preprint arXiv:2407.12391, 2024 - arxiv.org
This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …

Inss: An intelligent scheduling orchestrator for multi-gpu inference with spatio-temporal sharing

Z Han, R Zhou, C Xu, Y Zeng… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
As the applications of AI proliferate, it is critical to increase the throughput of online DNN
inference services. Multi-process service (MPS) improves the utilization rate of GPU …

Efficient training and inference: Techniques for large language models using llama

SR Cunningham, D Archambault, A Kung - Authorea Preprints, 2024 - techrxiv.org
To enhance the efficiency of language models, it would involve optimizing their training and
inference processes to reduce computational demands while maintaining high performance …

Teola: Towards End-to-End Optimization of LLM-based Applications

X Tan, Y Jiang, Y Yang, H Xu - arXiv preprint arXiv:2407.00326, 2024 - arxiv.org
Large language model (LLM)-based applications consist of both LLM and non-LLM
components, each contributing to the end-to-end latency. Despite great efforts to optimize …

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

Z Wang, Z Wang, L Le, HS Zheng, S Mishra… - arXiv preprint arXiv …, 2024 - arxiv.org
Retrieval augmented generation (RAG) combines the generative abilities of large language
models (LLMs) with external knowledge sources to provide more accurate and up-to-date …

ELMS: Elasticized Large Language Models On Mobile Devices

W Yin, R Yi, D Xu, G Huang, M Xu, X Liu - arXiv preprint arXiv:2409.09071, 2024 - arxiv.org
On-device Large Language Models (LLMs) are revolutionizing mobile AI, enabling
applications such as UI automation while addressing privacy concerns. Currently, the …

Efficiently Dispatching Flash Attention For Partially Filled Attention Masks

A Sharma, J Geiping - arXiv preprint arXiv:2409.15097, 2024 - arxiv.org
Transformers are widely used across various applications, many of which yield sparse or
partially filled attention matrices. Examples include attention masks designed to reduce the …

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

X Luo, Y Wang, Q Zhu, Z Zhang, X Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid growth in the parameters of large language models (LLMs) has made inference
latency a fundamental bottleneck, limiting broader application of LLMs. Speculative …