Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression...

Z Wan, X Wang, C Liu, S Alam, Y Zheng, J Liu… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) have demonstrated remarkable capabilities in important
tasks such as natural language understanding and language generation, and thus have the …

被引用次数：116 相关文章所有 7 个版本

[PDF] usenix.org

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management

W Lee, J Lee, J Seo, J Sim - 18th USENIX Symposium on Operating …, 2024 - usenix.org

Transformer-based large language models (LLMs) demonstrate impressive performance
across various natural language processing tasks. Serving LLM inference for generating …

被引用次数：30 相关文章

[PDF] arxiv.org

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H Jin… - arXiv preprint arXiv …, 2023 - arxiv.org

In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

被引用次数：65 相关文章所有 2 个版本

[PDF] usenix.org

{Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention}

B Gao, Z He, P Sharma, Q Kang, D Jevdjic… - 2024 USENIX Annual …, 2024 - usenix.org

Interacting with humans through multi-turn conversations is a fundamental feature of large
language models (LLMs). However, existing LLM serving engines executing multi-turn …

被引用次数：14 相关文章所有 2 个版本

[PDF] arxiv.org

Transformers are multi-state rnns

M Oren, M Hassid, N Yarden, Y Adi… - arXiv preprint arXiv …, 2024 - arxiv.org

Transformers are considered conceptually different from the previous generation of state-of-
the-art NLP models-recurrent neural networks (RNNs). In this work, we demonstrate that …

被引用次数：32 相关文章所有 2 个版本

[PDF] nsf.gov

[PDF][PDF] Efficiently Programming Large Language Models using SGLang.

L Zheng, L Yin, Z Xie, J Huang, C Sun, CH Yu, S Cao… - 2023 - par.nsf.gov

Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …

被引用次数：62 相关文章所有 2 个版本

[PDF] arxiv.org

Personal llm agents: Insights and survey about the capability, efficiency and security

Y Li, H Wen, W Wang, X Li, Y Yuan, G Liu, J Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

Since the advent of personal computing devices, intelligent personal assistants (IPAs) have
been one of the key technologies that researchers and engineers have focused on, aiming …

被引用次数：106 相关文章所有 3 个版本

[PDF] arxiv.org

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention

H Jiang, Y Li, C Zhang, Q Wu, X Luo, S Ahn… - arXiv preprint arXiv …, 2024 - arxiv.org

The computational challenges of Large Language Model (LLM) inference remain a
significant barrier to their widespread deployment, especially as prompt lengths continue to …

被引用次数：25 相关文章所有 4 个版本

[PDF] arxiv.org

Mobile edge intelligence for large language models: A contemporary survey

G Qu, Q Chen, W Wei, Z Lin, X Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

On-device large language models (LLMs), referring to running LLMs on edge devices, have
raised considerable interest owing to their superior privacy, reduced latency, and bandwidth …

被引用次数：19 相关文章所有 4 个版本

[PDF] github.io

[PDF][PDF] Sglang: Efficient execution of structured language model programs

L Zheng, L Yin, Z Xie, C Sun, J Huang… - arXiv preprint arXiv …, 2024 - minjiazhang.github.io

Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …

被引用次数：18 相关文章

高级搜索

QQ 群