- 学术资源搜索

On-device language models: A comprehensive review

J Xu, Z Li, W Chen, Q Wang, X Gao, Q Cai… - arXiv preprint arXiv …, 2024 - arxiv.org

The advent of large language models (LLMs) revolutionized natural language processing
applications, and running LLMs on edge devices has become increasingly attractive for …

被引用次数：11 相关文章所有 3 个版本

[PDF] acm.org

Resource-efficient algorithms and systems of foundation models: A survey

M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024 - dl.acm.org

Large foundation models, including large language models, vision transformers, diffusion,
and LLM-based multimodal models, are revolutionizing the entire machine learning …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Nanoflow: Towards optimal large language model serving throughput

K Zhu, Y Zhao, L Zhao, G Zuo, Y Gu, D Xie… - arXiv preprint arXiv …, 2024 - arxiv.org

The increasing usage of Large Language Models (LLMs) has resulted in a surging demand
for planet-scale serving systems, where tens of thousands of GPUs continuously serve …

被引用次数：9 相关文章所有 3 个版本

[PDF] arxiv.org

Llm inference serving: Survey of recent advances and opportunities

B Li, Y Jiang, V Gadepally, D Tiwari - arXiv preprint arXiv:2407.12391, 2024 - arxiv.org

This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …

被引用次数：10 相关文章所有 2 个版本

[PDF] arxiv.org

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

被引用次数：71 相关文章所有 5 个版本

[PDF] arxiv.org

Instinfer: In-storage attention offloading for cost-effective long-context llm inference

X Pan, E Li, Q Li, S Liang, Y Shan, K Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org

The widespread of Large Language Models (LLMs) marks a significant milestone in
generative AI. Nevertheless, the increasing context length and batch size in offline LLM …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

Recommendation with generative models

Y Deldjoo, Z He, J McAuley, A Korikov… - arXiv preprint arXiv …, 2024 - arxiv.org

Generative models are a class of AI models capable of creating new instances of data by
learning and sampling from their statistical distributions. In recent years, these models have …

被引用次数：7 相关文章所有 5 个版本

[PDF] arxiv.org

Mooncake: A kvcache-centric disaggregated architecture for llm serving

R Qin, Z Li, W He, M Zhang, Y Wu, W Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It
features a KVCache-centric disaggregated architecture that separates the prefill and …

被引用次数：11 相关文章

[PDF] arxiv.org

Memserve: Context caching for disaggregated llm serving with elastic memory pool

C Hu, H Huang, J Hu, J Xu, X Chen, T Xie… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language model (LLM) serving has transformed from stateless to stateful systems,
utilizing techniques like context caching and disaggregated inference. These optimizations …

被引用次数：8 相关文章所有 2 个版本

[PDF] arxiv.org

Integrating LLMs With ITS: Recent Advances, Potentials, Challenges, and Future Directions

D Mahmud, H Hajmohamed… - IEEE Transactions …, 2025 - ieeexplore.ieee.org

Intelligent Transportation Systems (ITS) are crucial for the development and operation of
smart cities, addressing key challenges in efficiency, productivity, and environmental …

被引用次数：1 相关文章所有 3 个版本

高级搜索

QQ 群

On-device language models: A comprehensive review

Resource-efficient algorithms and systems of foundation models: A survey

Nanoflow: Towards optimal large language model serving throughput

Llm inference serving: Survey of recent advances and opportunities

A survey on efficient inference for large language models

Instinfer: In-storage attention offloading for cost-effective long-context llm inference

Recommendation with generative models

Mooncake: A kvcache-centric disaggregated architecture for llm serving

Memserve: Context caching for disaggregated llm serving with elastic memory pool

Integrating LLMs With ITS: Recent Advances, Potentials, Challenges, and Future Directions

引用