{AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving

Z Li, L Zheng, Y Zhong, V Liu, Y Sheng, X Jin… - … USENIX Symposium on …, 2023 - usenix.org
Model parallelism is conventionally viewed as a method to scale a single large deep
learning model beyond the memory limits of a single device. In this paper, we demonstrate …

Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision

W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arXiv preprint arXiv …, 2022 - arxiv.org
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …

Serving heterogeneous machine learning models on {Multi-GPU} servers with {Spatio-Temporal} sharing

S Choi, S Lee, Y Kim, J Park, Y Kwon… - 2022 USENIX Annual …, 2022 - usenix.org
As machine learning (ML) techniques are applied to a widening range of applications, high
throughput ML inference serving has become critical for online services. Such ML inference …

Tabi: An efficient multi-level inference system for large language models

Y Wang, K Chen, H Tan, K Guo - Proceedings of the Eighteenth …, 2023 - dl.acm.org
Today's trend of building ever larger language models (LLMs), while pushing the
performance of natural language processing, adds significant latency to the inference stage …

: Increasing GPU Utilization during Generative Inference for Higher Throughput

Y Jin, CF Wu, D Brooks, GY Wei - Advances in Neural …, 2023 - proceedings.neurips.cc
Generating texts with a large language model (LLM) consumes massive amounts of
memory. Apart from the already-large model parameters, the key/value (KV) cache that …

Optimizing video analytics with declarative model relationships

F Romero, J Hauswald, A Partap, D Kang… - Proceedings of the …, 2022 - dl.acm.org
The availability of vast video collections and the accuracy of ML models has generated
significant interest in video analytics systems. Since naively processing all frames using …

Cloud-Native Computing: A Survey from the Perspective of Services

S Deng, H Zhao, B Huang, C Zhang… - Proceedings of the …, 2024 - ieeexplore.ieee.org
The development of cloud computing delivery models inspires the emergence of cloud-
native computing. Cloud-native computing, as the most influential development principle for …

Edgeadaptor: Online configuration adaption, model selection and resource provisioning for edge dnn inference serving at scale

K Zhao, Z Zhou, X Chen, R Zhou… - IEEE Transactions …, 2022 - ieeexplore.ieee.org
The accelerating convergence of artificial intelligence and edge computing has sparked a
recent wave of interest in edge intelligence. While pilot efforts focused on edge DNN …

Stepconf: Slo-aware dynamic resource configuration for serverless function workflows

Z Wen, Y Wang, F Liu - IEEE INFOCOM 2022-IEEE Conference …, 2022 - ieeexplore.ieee.org
Function-as-a-Service (FaaS) offers a fine-grained resource provision model, enabling
developers to build highly elastic cloud applications. User requests are handled by a series …

Neuroscaler: Neural video enhancement at scale

H Yeo, H Lim, J Kim, Y Jung, J Ye, D Han - Proceedings of the ACM …, 2022 - dl.acm.org
High-definition live streaming has experienced tremendous growth. However, the video
quality of live video is often limited by the streamer's uplink bandwidth. Recently, neural …