Low latency rnn inference with cellular batching

JM Ede - Machine Learning: Science and Technology, 2021 - iopscience.iop.org

Deep learning is transforming most areas of science and technology, including electron
microscopy. This review paper offers a practical perspective aimed at developers with …

被引用次数：111 相关文章所有 8 个版本

[PDF] acm.org

Efficient memory management for large language model serving with pagedattention

W Kwon, Z Li, S Zhuang, Y Sheng, L Zheng… - Proceedings of the 29th …, 2023 - dl.acm.org

High throughput serving of large language models (LLMs) requires batching sufficiently
many requests at a time. However, existing systems struggle because the key-value cache …

被引用次数：1126 相关文章所有 4 个版本

[PDF] usenix.org

Orca: A distributed serving system for {Transformer-Based} generative models

GI Yu, JS Jeong, GW Kim, S Kim, BG Chun - 16th USENIX Symposium …, 2022 - usenix.org

Large-scale Transformer-based models trained for generation tasks (eg, GPT-3) have
recently attracted huge interest, emphasizing the need for system support for serving models …

被引用次数：337 相关文章所有 6 个版本

[PDF] acm.org

Deep learning workload scheduling in gpu datacenters: A survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org

Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

被引用次数：19 相关文章所有 4 个版本

[PDF] usenix.org

Rammer: Enabling holistic deep learning compiler optimizations with {rTasks}

L Ma, Z Xie, Z Yang, J Xue, Y Miao, W Cui… - … USENIX Symposium on …, 2020 - usenix.org

Performing Deep Neural Network (DNN) computation on hardware accelerators efficiently is
challenging. Existing DNN frameworks and compilers often treat the DNN operators in a …

被引用次数：161 相关文章所有 7 个版本

[PDF] acm.org

Turbotransformers: an efficient gpu serving system for transformer models

J Fang, Y Yu, C Zhao, J Zhou - Proceedings of the 26th ACM SIGPLAN …, 2021 - dl.acm.org

The transformer is the most critical algorithm innovation of the Nature Language Processing
(NLP) field in recent years. Unlike the Recurrent Neural Network (RNN) models …

被引用次数：151 相关文章所有 4 个版本

[PDF] arxiv.org

Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache

B Lin, C Zhang, T Peng, H Zhao, W Xiao, M Sun… - arXiv preprint arXiv …, 2024 - arxiv.org

The rapid proliferation of Large Language Models (LLMs) has been a driving force in the
growth of cloud-based LLM services, which are now integral to advancing AI applications …

被引用次数：28 相关文章所有 2 个版本

[PDF] nsf.gov

Optimizing inference serving on serverless platforms

A Ali, R Pinciroli, F Yan, E Smirni - Proceedings of the VLDB Endowment, 2022 - par.nsf.gov

Serverless computing is gaining popularity for machine learning (ML) serving workload due
to its autonomous resource scaling, easy to use and pay-per-use cost model. Existing …

被引用次数：51 相关文章所有 5 个版本

[PDF] usenix.org

{DVABatch}: Diversity-aware {Multi-Entry}{Multi-Exit} batching for efficient processing of {DNN} services on {GPUs}

W Cui, H Zhao, Q Chen, H Wei, Z Li, D Zeng… - 2022 USENIX Annual …, 2022 - usenix.org

The DNN inferences are often batched for better utilizing the hardware in existing DNN
serving systems. However, DNN serving exhibits diversity in many aspects, such as input …

被引用次数：43 相关文章所有 5 个版本

Tbdb: Token bucket-based dynamic batching for resource scheduling supporting neural network inference in intelligent consumer electronics

H Gao, B Qiu, Y Wang, S Yu, Y Xu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Consumer electronics such as mobile phones, wearable devices, and vehicle electronics
use many intelligent applications such as voice commands, machine translation, and face …

被引用次数：33 相关文章所有 2 个版本

高级搜索

QQ 群