A survey of large-scale deep learning serving system optimization: Challenges and opportunities

F Yu, D Wang, L Shangguan, M Zhang, X Tang… - arXiv preprint arXiv …, 2021 - arxiv.org
Deep Learning (DL) models have achieved superior performance in many application
domains, including vision, language, medical, commercial ads, entertainment, etc. With the …

A survey of multi-tenant deep learning inference on gpu

F Yu, D Wang, L Shangguan, M Zhang, C Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
Deep Learning (DL) models have achieved superior performance. Meanwhile, computing
hardware like NVIDIA GPUs also demonstrated strong computing scaling trends with 2x …

Hardware-Assisted Virtualization of Neural Processing Units for Cloud Platforms

Y Xue, Y Liu, L Nai, J Huang - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
Cloud platforms today have been deploying hardware accelerators like neural processing
units (NPUs) for powering machine learning (ML) inference services. To maximize the …

Miriam: Exploiting elastic kernels for real-time multi-dnn inference on edge gpu

Z Zhao, N Ling, N Guan, G Xing - … of the 21st ACM Conference on …, 2023 - dl.acm.org
Many applications such as autonomous driving and augmented reality, require the
concurrent running of multiple deep neural networks (DNN) that poses different levels of real …

GMorph: Accelerating Multi-DNN Inference via Model Fusion

Q Yang, T Yang, M Xiang, L Zhang, H Wang… - Proceedings of the …, 2024 - dl.acm.org
AI-powered applications often involve multiple deep neural network (DNN)-based prediction
tasks to support application-level functionalities. However, executing multi-DNNs can be …

AccuMO: Accuracy-centric multitask offloading in edge-assisted mobile augmented reality

ZJ Kong, Q Xu, J Meng, YC Hu - Proceedings of the 29th Annual …, 2023 - dl.acm.org
Immersive applications such as Augmented Reality (AR) and Mixed Reality (MR) often need
to perform multiple latency-critical tasks on every frame captured by the camera, which all …

Rosgm: A real-time gpu management framework with plug-in policies for ros 2

R Li, T Hu, X Jiang, L Li, W Xing… - 2023 IEEE 29th Real …, 2023 - ieeexplore.ieee.org
Robot Operating System (ROS) is a prevailing software framework for robotic appliscation
development. Graphics Processing Unit (GPU) is widely used in many ROS applications as …

Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs

A Chen, F Xu, L Han, Y Dong, L Chen… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
GPUs have become the defacto hardware devices for accelerating Deep Neural Network
(DNN) inference workloads. However, the conventional sequential execution mode of DNN …

Jigsaw: Taming bev-centric perception on dual-soc for autonomous driving

L Sun, C Li, X Hou, T Huang, C Xu… - 2024 IEEE Real …, 2024 - ieeexplore.ieee.org
Real-time perception is important for autonomous driving. We observe an emerging trend
using one large and critical fusion-based Bird's-Eye-View (BEV) Deep Neural Network …

Boosting dnn cold inference on edge devices

R Yi, T Cao, A Zhou, X Ma, S Wang, M Xu - Proceedings of the 21st …, 2023 - dl.acm.org
DNNs are ubiquitous on edge devices nowadays. With its increasing importance and use
cases, it's not likely to pack all DNNs into device memory and expect that each inference has …