Chronus: A novel deadline-aware scheduler for deep learning training jobs

S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org

The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …

被引用次数：3 相关文章

DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster

Z Chen, X Zhao, C Zhi, J Yin - IEEE Transactions on Parallel …, 2023 - ieeexplore.ieee.org

Deep learning tasks (DLT) include training and inference tasks, where training DLTs have
requirements on minimizing average job completion time (JCT) and inference tasks need …

被引用次数：1 相关文章所有 4 个版本

Enabling switch memory management for distributed training with in-network aggregation

B Zhao, C Liu, J Dong, Z Cao, W Nie… - IEEE INFOCOM 2023 …, 2023 - ieeexplore.ieee.org

Distributed training (DT) in shared clusters usually deploys a scheduler for resource
allocation to multiple concurrent jobs. Meanwhile, a recent acceleration primitive, In-Network …

被引用次数：4 相关文章

[PDF] github.io

Towards GPU Memory Efficiency for Distributed Training at Scale

R Cheng, C Cai, S Yilmaz, R Mitra, M Bag… - Proceedings of the …, 2023 - dl.acm.org

The scale of deep learning models has grown tremendously in recent years. State-of-the-art
models have reached billions of parameters and terabyte-scale model sizes. Training of …

被引用次数：1 相关文章所有 3 个版本

[PDF] github.io

Tear Up the Bubble Boom: Lessons Learned From a Deep Learning Research and Development Cluster

Z Yang, Z Ye, T Fu, J Luo, X Wei, Y Luo… - 2022 IEEE 40th …, 2022 - ieeexplore.ieee.org

With the proliferation of deep learning, there exists a strong need to efficiently operate GPU
clusters for deep learning production in giant AI companies, as well as for research and …

被引用次数：3 相关文章所有 4 个版本

[PDF] zhenxiao.com

A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud via Reinforcement Learning

M Xing, H Mao, S Yin, L Pan, Z Zhang, Z Xiao… - Proceedings of the 29th …, 2023 - dl.acm.org

Public cloud GPU clusters are becoming emerging platforms for training distributed deep
learning jobs. Under this training paradigm, the job scheduler is a crucial component to …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey

F Liang, Z Zhang, H Lu, C Li, V Leung, Y Guo… - arXiv preprint arXiv …, 2024 - arxiv.org

With rapidly increasing distributed deep learning workloads in large-scale data centers,
efficient distributed deep learning framework strategies for resource allocation and workload …

UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands

W Gao, Z Ye, P Sun, T Zhang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

The growth of deep learning training (DLT) jobs in modern GPU clusters calls for efficient
deep learning (DL) scheduler designs. Due to the extensive applications of DL technology …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training

J Bang, Y Choi, M Kim, Y Kim, M Rhu - arXiv preprint arXiv:2312.12391, 2023 - arxiv.org

As large language models (LLMs) become widespread in various application domains, a
critical challenge the AI community is facing is how to train these large AI models in a cost …

被引用次数：2 相关文章所有 2 个版本

On a Meta Learning-based Scheduler for Deep Learning Clusters

J Yang, L Bao, W Liu, R Yang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Deep learning (DL) has become a dominating type of workloads on AI computing platforms.
The performance of such platforms highly depends on how distributed DL jobs are …

被引用次数：3 相关文章所有 5 个版本

高级搜索

QQ 群