Towards efficient generative large language model serving: A survey from algorithms to systems

R Gallotta, G Todd, M Zammit, S Earle, A Liapis… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent years have seen an explosive increase in research on large language models
(LLMs), and accompanying public engagement on the topic. While starting as a niche area …

被引用次数：18 相关文章所有 2 个版本

[PDF] arxiv.org

SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

X Miao, G Oliaro, Z Zhang, X Cheng, Z Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

This paper introduces SpecInfer, a system that accelerates generative large language model
(LLM) serving with tree-based speculative inference and verification. The key idea behind …

被引用次数：63 相关文章所有 4 个版本

[PDF] acm.org

Spotserve: Serving generative large language models on preemptible instances

X Miao, C Shi, J Duan, X Xi, D Lin, B Cui… - Proceedings of the 29th …, 2024 - dl.acm.org

The high computational and memory requirements of generative large language models
(LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary …

被引用次数：21 相关文章所有 4 个版本

[PDF] arxiv.org

Break the sequential dependency of llm inference using lookahead decoding

Y Fu, P Bailis, I Stoica, H Zhang - arXiv preprint arXiv:2402.02057, 2024 - arxiv.org

Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded,
resulting in high latency and significant wastes of the parallel processing power of modern …

被引用次数：16 相关文章所有 3 个版本

[PDF] arxiv.org

A comprehensive survey of large language models and multimodal large language models in medicine

H Xiao, F Zhou, X Liu, T Liu, Z Li, X Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

Since the release of ChatGPT and GPT-4, large language models (LLMs) and multimodal
large language models (MLLMs) have garnered significant attention due to their powerful …

被引用次数：2 相关文章所有 2 个版本

[PDF] acm.org

Specinfer: Accelerating large language model serving with tree-based speculative inference and verification

X Miao, G Oliaro, Z Zhang, X Cheng, Z Wang… - Proceedings of the 29th …, 2024 - dl.acm.org

This paper introduces SpecInfer, a system that accelerates generative large language model
(LLM) serving with tree-based speculative inference and verification. The key idea behind …

被引用次数：14 相关文章

[PDF] arxiv.org

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

被引用次数：11 相关文章所有 5 个版本

[PDF] techrxiv.org

Reducing llm hallucination using knowledge distillation: A case study with mistral large and mmlu benchmark

D McDonald, R Papadopoulos, L Benningfield - Authorea Preprints, 2024 - techrxiv.org

The application of knowledge distillation to reduce hallucination in large language models
represents a novel and significant advancement in enhancing the reliability and accuracy of …

被引用次数：16 相关文章所有 4 个版本

[PDF] arxiv.org

Wkvquant: Quantizing weight and key/value cache for large language models gains more

Y Yue, Z Yuan, H Duanmu, S Zhou, J Wu… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) face significant deployment challenges due to their
substantial memory requirements and the computational demands of auto-regressive text …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

A survey on the memory mechanism of large language model based agents

Z Zhang, X Bo, C Ma, R Li, X Chen, Q Dai, J Zhu… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language model (LLM) based agents have recently attracted much attention from the
research and industry communities. Compared with original LLMs, LLM-based agents are …

被引用次数：3 相关文章所有 2 个版本

高级搜索

QQ 群