Deep learning in electron microscopy

JM Ede - Machine Learning: Science and Technology, 2021 - iopscience.iop.org
Deep learning is transforming most areas of science and technology, including electron
microscopy. This review paper offers a practical perspective aimed at developers with …

Efficient memory management for large language model serving with pagedattention

W Kwon, Z Li, S Zhuang, Y Sheng, L Zheng… - Proceedings of the 29th …, 2023 - dl.acm.org
High throughput serving of large language models (LLMs) requires batching sufficiently
many requests at a time. However, existing systems struggle because the key-value cache …

Orca: A distributed serving system for {Transformer-Based} generative models

GI Yu, JS Jeong, GW Kim, S Kim, BG Chun - 16th USENIX Symposium …, 2022 - usenix.org
Large-scale Transformer-based models trained for generation tasks (eg, GPT-3) have
recently attracted huge interest, emphasizing the need for system support for serving models …

Deep learning workload scheduling in gpu datacenters: A survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

Rammer: Enabling holistic deep learning compiler optimizations with {rTasks}

L Ma, Z Xie, Z Yang, J Xue, Y Miao, W Cui… - … USENIX Symposium on …, 2020 - usenix.org
Performing Deep Neural Network (DNN) computation on hardware accelerators efficiently is
challenging. Existing DNN frameworks and compilers often treat the DNN operators in a …

Turbotransformers: an efficient gpu serving system for transformer models

J Fang, Y Yu, C Zhao, J Zhou - Proceedings of the 26th ACM SIGPLAN …, 2021 - dl.acm.org
The transformer is the most critical algorithm innovation of the Nature Language Processing
(NLP) field in recent years. Unlike the Recurrent Neural Network (RNN) models …

Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache

B Lin, C Zhang, T Peng, H Zhao, W Xiao, M Sun… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid proliferation of Large Language Models (LLMs) has been a driving force in the
growth of cloud-based LLM services, which are now integral to advancing AI applications …

Optimizing inference serving on serverless platforms

A Ali, R Pinciroli, F Yan, E Smirni - Proceedings of the VLDB Endowment, 2022 - par.nsf.gov
Serverless computing is gaining popularity for machine learning (ML) serving workload due
to its autonomous resource scaling, easy to use and pay-per-use cost model. Existing …

{DVABatch}: Diversity-aware {Multi-Entry}{Multi-Exit} batching for efficient processing of {DNN} services on {GPUs}

W Cui, H Zhao, Q Chen, H Wei, Z Li, D Zeng… - 2022 USENIX Annual …, 2022 - usenix.org
The DNN inferences are often batched for better utilizing the hardware in existing DNN
serving systems. However, DNN serving exhibits diversity in many aspects, such as input …

Tbdb: Token bucket-based dynamic batching for resource scheduling supporting neural network inference in intelligent consumer electronics

H Gao, B Qiu, Y Wang, S Yu, Y Xu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Consumer electronics such as mobile phones, wearable devices, and vehicle electronics
use many intelligent applications such as voice commands, machine translation, and face …