Elastic resource management for deep learning applications in a container cluster

Y Mao, V Sharma, W Zheng, L Cheng… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
… This work aims to improve the resource contention problem for concurrent deep learning
-wide resource management. The learning tasks considered are implemented as containerized …

DEARS: A deep learning based elastic and automatic resource scheduling framework for cloud applications

M Hassan, H Chen, Y Liu - … Processing with Applications …, 2018 - ieeexplore.ieee.org
… the uncertainty in real-life resource management and deal with the bursty or dramatic
decrease on workloads, we analyze the violation of SLAs after resource scheduling, and regard it …

Elastic deep learning in multi-tenant GPU clusters

Y Wu, K Ma, X Yan, Z Liu, Z Cai… - … on Parallel and …, 2021 - ieeexplore.ieee.org
… Input: Job pending queue groups: G, compaction threshold: N, parallelism for job j: pj, minimum
number of GPUs for job j: minj, resource manager that manages the free GPUs in cluster: …

DERP: A deep reinforcement learning cloud system for elastic resource provisioning

C Bitsakos, I Konstantinou… - 2018 IEEE international …, 2018 - ieeexplore.ieee.org
… we presented a Deep Reinforcement Learning agent for cloud elasticity problems, called …
algorithmic techniques in both deep learning and cloud resource management areas. Our …

Efficient autonomic and elastic resource management techniques in cloud environment: taxonomy and analysis

MAN Saif, SK Niranjan, HDE Al-Ariki - Wireless Networks, 2021 - Springer
… efficient autonomic and elastic resource management approaches in the … resource
management, such as provisioning, allocation, scheduling, and monitoring, and also the elasticity

Lyra: Elastic scheduling for deep learning clusters

J Li, H Xu, Y Zhu, Z Liu, C Guo, C Wang - Proceedings of the Eighteenth …, 2023 - dl.acm.org
… Lyra is a GPU cluster scheduler that exploits capacity loaning with elastic job scheduling. It
runs on top of a cluster resource manager such as YARN [51] and Kubernetes [23] to execute …

Flowcon: Elastic flow configuration for containerized deep learning applications

W Zheng, M Tynes, H Gorelick, Y Mao… - Proceedings of the 48th …, 2019 - dl.acm.org
resources utilizing resources effectively to achieve high-performance data analytics becomes
desirable. Although cluster resource managementmachine learning (ML) / deep learning (…

Deep reinforcement learning based elasticity-compatible heterogeneous resource management for time-critical computing

Z Liu, L Wang, G Quan - … of the 49th International Conference on Parallel …, 2020 - dl.acm.org
deep reinforcement learning (DRL) techniques in this work to obtain a resource management
… • We propose a deep reinforcement learning based approach utilizing LSTM model and …

Artificial intelligence for elastic management and orchestration of 5G networks

DM Gutierrez-Estevez, M Gramaglia… - IEEE wireless …, 2019 - ieeexplore.ieee.org
… • Slice-aware resource management … the management and orchestration of future networks
proposed by ETSI. Then we discuss the application of AI in the context of resource elasticity

ElasticFlow: An elastic serverless training platform for distributed deep learning

D Gu, Y Zhao, Y Zhong, Y Xiong, Z Han… - Proceedings of the 28th …, 2023 - dl.acm.org
learning. ElasticFlow provides a serverless interface with two distinct features: (𝑖) users specify
only the deep neural network … level, and manual resource management for deep learning