AutoInfer: Self-Driving Management for Resource-Efficient, SLO-Aware Machine= Learning Inference in GPU Clusters

B Cai, Q Guo, X Dong - IEEE Internet of Things Journal, 2022 - ieeexplore.ieee.org
As Internet of Things (IoT) keeps growing, IoT-side intelligence services, such as intelligent
personal assistant, healthcare surveillance, and smart home service, offload more and more …

WattWiser: Power & Resource-Efficient Scheduling for Multi-Model Multi-GPU Inference Servers

A Jahanshahi, M Rezvani, D Wong - … of the 14th International Green and …, 2023 - dl.acm.org
With the increasing integration of Machine Learning (ML) applications into cloud services,
providing high throughput Machine Learning inference serving has become a major …

HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving

H Mo, L Zhu, L Shi, S Tan, S Wang - Electronics, 2023 - mdpi.com
To accelerate the inference of machine-learning (ML) model serving, clusters of machines
require the use of expensive hardware accelerators (eg, GPUs) to reduce execution time …

{MLaaS} in the wild: Workload analysis and scheduling in {Large-Scale} heterogeneous {GPU} clusters

Q Weng, W Xiao, Y Yu, W Wang, C Wang, J He… - … USENIX Symposium on …, 2022 - usenix.org
With the sustained technological advances in machine learning (ML) and the availability of
massive datasets recently, tech companies are deploying large ML-as-a-Service (MLaaS) …

The ooo vliw jit compiler for gpu inference

P Jain, X Mo, A Jain, A Tumanov, JE Gonzalez… - arXiv preprint arXiv …, 2019 - arxiv.org
Current trends in Machine Learning~(ML) inference on hardware accelerated devices (eg,
GPUs, TPUs) point to alarmingly low utilization. As ML inference is increasingly time …

Efficient GPU Resource Management under Latency and Power Constraints for Deep Learning Inference

D Liu, Z Ma, A Zhang, K Zheng - … on Mobile Ad Hoc and Smart …, 2023 - ieeexplore.ieee.org
Recent rapid development in deep learning (DL) applications generates harsh requirements
for DL inference services provided by GPU servers. On one hand, a high volume of different …

Know Your Enemy To Save Cloud Energy: Energy-Performance Characterization of Machine Learning Serving

J Yu, J Kim, E Seo - 2023 IEEE International Symposium on …, 2023 - ieeexplore.ieee.org
The proportion of machine learning (ML) inference in modern cloud workloads is rapidly
increasing, and graphic processing units (GPUs) are the most preferred computational …

Horus: Interference-aware and prediction-based scheduling in deep learning systems

G Yeung, D Borowiec, R Yang, A Friday… - … on Parallel and …, 2021 - ieeexplore.ieee.org
To accelerate the training of Deep Learning (DL) models, clusters of machines equipped
with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of …

Batopt: Optimizing GPU-based deep learning inference using dynamic batch processing

D Zhang, Y Luo, Y Wang, X Kui… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Deep learning (DL) has been applied in billions of mobile devices due to its astonishing
performance in image, text, and audio processing. However, limited by the computing …

Toward interference-aware gpu container co-scheduling learning from application profiles

S Kim, Y Kim - … Conference on Autonomic Computing and Self …, 2020 - ieeexplore.ieee.org
Issues related to operating Graphic Processing Unit (GPU) applications efficiently and
improving overall system throughput in a GPU cluster environment exist. The platform may …