Nanily: A qos-aware scheduling for dnn inference workload in clouds

X Tang, P Wang, Q Liu, W Wang… - 2019 IEEE 21st …, 2019 - ieeexplore.ieee.org
DNN inferences are widely emerging as a service and must run in sub-second latency,
which need GPU hardware to achieve parallel accelerating. Prior works to improve the …

Jily: Cost-aware AutoScaling of heterogeneous GPU for DNN inference in public cloud

Z Wang, X Tang, Q Liu, J Han - 2019 IEEE 38th International …, 2019 - ieeexplore.ieee.org
Recently, a large number of DNN inference services have emerged in public clouds, making
the low-cost deployment of DNN inference services a hot research topic. Previous studies …

iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud

F Xu, J Xu, J Chen, L Chen, R Shang… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
GPUs are essential to accelerating the latency-sensitive deep neural network (DNN)
inference workloads in cloud datacenters. To fully utilize GPU resources, spatial sharing of …

Sla-driven ml inference framework for clouds with heterogeneous accelerators

J Cho, D Zad Tootaghaj, L Cao… - … of Machine Learning …, 2022 - proceedings.mlsys.org
The current design of Serverless computing frameworks assumes that all the requests and
underlying compute hardware are homogeneous. This homogeneity assumption causes two …

D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs

A Dhakal, SG Kulkarni, KK Ramakrishnan - arXiv preprint arXiv …, 2023 - arxiv.org
Hardware accelerators such as GPUs are required for real-time, low-latency inference with
Deep Neural Networks (DNN). However, due to the inherent limits to the parallelism they …

Automated runtime-aware scheduling for multi-tenant dnn inference on gpu

F Yu, S Bray, D Wang, L Shangguan… - 2021 IEEE/ACM …, 2021 - ieeexplore.ieee.org
With the fast development of deep neural networks (DNNs), many real-world applications
are adopting multiple models to conduct compound tasks, such as co-running classification …

Ebird: Elastic batch for improving responsiveness and throughput of deep learning services

W Cui, M Wei, Q Chen, X Tang, J Leng… - 2019 IEEE 37th …, 2019 - ieeexplore.ieee.org
GPUs have been widely adopted to serve online deep learning-based services that have
stringent QoS requirements. However, emerging deep learning serving systems often result …

CODA: Improving resource utilization by slimming and co-locating DNN and CPU jobs

H Zhao, W Cui, Q Chen, J Leng, K Yu… - 2020 IEEE 40th …, 2020 - ieeexplore.ieee.org
While deep neural network (DNN) models are often trained on GPUs, many companies and
research institutes build GPU clusters that are shared by different groups. On such GPU …

Qos-aware scheduling of heterogeneous servers for inference in deep neural networks

Z Fang, T Yu, OJ Mengshoel, RK Gupta - Proceedings of the 2017 ACM …, 2017 - dl.acm.org
Deep neural networks (DNNs) are popular in diverse fields such as computer vision and
natural language processing. DNN inference tasks are emerging as a service provided by …

S^ 3dnn: Supervised streaming and scheduling for gpu-accelerated real-time dnn workloads

H Zhou, S Bateni, C Liu - 2018 IEEE Real-Time and Embedded …, 2018 - ieeexplore.ieee.org
Deep Neural Networks (DNNs) are being widely applied in many advanced embedded
systems that require autonomous decision making, eg, autonomous driving and robotics. To …