Specinfer: Accelerating large language model serving with tree-based speculative inference...

Z Wan, X Wang, C Liu, S Alam, Y Zheng… - arXiv preprint arXiv …, 2023 - researchgate.net

Abstract Large Language Models (LLMs) have demonstrated remarkable capabilities in
important tasks such as natural language understanding, language generation, and …

被引用次数：69 相关文章所有 7 个版本

[PDF] arxiv.org

Break the sequential dependency of llm inference using lookahead decoding

Y Fu, P Bailis, I Stoica, H Zhang - arXiv preprint arXiv:2402.02057, 2024 - arxiv.org

Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded,
resulting in high latency and significant wastes of the parallel processing power of modern …

被引用次数：31 相关文章所有 3 个版本

[PDF] arxiv.org

LLM Inference Serving: Survey of Recent Advances and Opportunities

B Li, Y Jiang, V Gadepally, D Tiwari - arXiv preprint arXiv:2407.12391, 2024 - arxiv.org

This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …

被引用次数：1 相关文章所有 2 个版本

Inss: An intelligent scheduling orchestrator for multi-gpu inference with spatio-temporal sharing

Z Han, R Zhou, C Xu, Y Zeng… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

As the applications of AI proliferate, it is critical to increase the throughput of online DNN
inference services. Multi-process service (MPS) improves the utilization rate of GPU …

被引用次数：3 相关文章所有 4 个版本

[PDF] techrxiv.org

Efficient training and inference: Techniques for large language models using llama

SR Cunningham, D Archambault, A Kung - Authorea Preprints, 2024 - techrxiv.org

To enhance the efficiency of language models, it would involve optimizing their training and
inference processes to reduce computational demands while maintaining high performance …

被引用次数：24 相关文章所有 3 个版本

[PDF] arxiv.org

Teola: Towards End-to-End Optimization of LLM-based Applications

X Tan, Y Jiang, Y Yang, H Xu - arXiv preprint arXiv:2407.00326, 2024 - arxiv.org

Large language model (LLM)-based applications consist of both LLM and non-LLM
components, each contributing to the end-to-end latency. Despite great efforts to optimize …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

X Luo, Y Wang, Q Zhu, Z Zhang, X Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

The rapid growth in the parameters of large language models (LLMs) has made inference
latency a fundamental bottleneck, limiting broader application of LLMs. Speculative …

高级搜索

QQ 群