Kvquant: Towards 10 million context length llm inference with kv cache quantization

X Zhu, J Li, Y Liu, C Ma, W Wang - Transactions of the Association for …, 2024 - direct.mit.edu

Abstract Large Language Models (LLMs) have transformed natural language processing
tasks successfully. Yet, their large size and high computational needs pose challenges for …

被引用次数：212 相关文章所有 2 个版本

[PDF] arxiv.org

Llm inference unveiled: Survey and roofline model insights

Z Yuan, Y Shang, Y Zhou, Z Dong, Z Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org

The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a
unique blend of opportunities and challenges. Although the field has expanded and is …

被引用次数：50 相关文章所有 2 个版本

[PDF] arxiv.org

A survey of low-bit large language models: Basics, systems, and algorithms

R Gong, Y Ding, Z Wang, C Lv, X Zheng, J Du… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models (LLMs) have achieved remarkable advancements in natural
language processing, showcasing exceptional performance across various tasks. However …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

被引用次数：68 相关文章所有 5 个版本

[PDF] arxiv.org

Think: Thinner key cache by query-driven pruning

Y Xu, Z Jie, H Dong, L Wang, X Lu, A Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) have revolutionized the field of natural language
processing, achieving unprecedented performance across a variety of applications …

被引用次数：8 相关文章所有 4 个版本

[HTML] mdpi.com

[HTML][HTML] The Future of Intelligent Healthcare: A Systematic Analysis and Discussion on the Integration and Impact of Robots Using Large Language Models for …

S Pashangpour, G Nejat - Robotics, 2024 - mdpi.com

The potential use of large language models (LLMs) in healthcare robotics can help address
the significant demand put on healthcare systems around the world with respect to an aging …

被引用次数：4 相关文章所有 6 个版本

[PDF] arxiv.org

Dynamic memory compression: Retrofitting llms for accelerated inference

P Nawrot, A Łańcucki, M Chochowski, D Tarjan… - arXiv preprint arXiv …, 2024 - arxiv.org

Transformers have emerged as the backbone of large language models (LLMs). However,
generation remains inefficient due to the need to store in memory a cache of key-value …

被引用次数：25 相关文章所有 3 个版本

[PDF] arxiv.org

Beyond kv caching: Shared attention for efficient llms

B Liao, DV Vargas - arXiv preprint arXiv:2407.12866, 2024 - arxiv.org

The efficiency of large language models (LLMs) remains a critical challenge, particularly in
contexts where computational resources are limited. Traditional attention mechanisms in …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org Full View

LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale

J Cho, M Kim, H Choi, G Heo… - 2024 IEEE International …, 2024 - ieeexplore.ieee.org

Recently, there has been an extensive research effort in building efficient large language
model (LLM) inference serving systems. These efforts not only include innovations in the …

被引用次数：2 相关文章所有 6 个版本

[PDF] arxiv.org

Unifying kv cache compression for large language models with leankv

Y Zhang, Y Hu, R Zhao, J Lui, H Chen - arXiv preprint arXiv:2412.03131, 2024 - arxiv.org

Large language models (LLMs) demonstrate exceptional performance but incur high serving
costs due to substantial memory demands, with the key-value (KV) cache being a primary …

被引用次数：1 相关文章所有 2 个版本

高级搜索

QQ 群