Cognn: efficient scheduling for concurrent gnn training on gpus

Q Sun, Y Liu, H Yang, R Zhang, M Dun… - … Conference for High …, 2022 - ieeexplore.ieee.org
Graph neural networks (GNNs) suffer from low GPU utilization due to frequent memory
accesses. Existing concurrent training mechanisms cannot be directly adapted to GNNs …

Characterizing Power Management Opportunities for LLMs in the Cloud

P Patel, E Choukse, C Zhang, Í Goiri, B Warrier… - Proceedings of the 29th …, 2024 - dl.acm.org
Recent innovation in large language models (LLMs), and their myriad use cases have
rapidly driven up the compute demand for datacenter GPUs. Several cloud providers and …

Funcpipe: A pipelined serverless framework for fast and cost-efficient training of deep learning models

Y Liu, B Jiang, T Guo, Z Huang, W Ma, X Wang… - Proceedings of the …, 2022 - dl.acm.org
Training deep learning (DL) models in the cloud has become a norm. With the emergence of
serverless computing and its benefits of true pay-as-you-go pricing and scalability, systems …

Schedtune: A heterogeneity-aware gpu scheduler for deep learning

H Albahar, S Dongare, Y Du, N Zhao… - 2022 22nd IEEE …, 2022 - ieeexplore.ieee.org
Modern cluster management systems, such as Kubernetes, support heterogeneous
workloads and resources. However, existing resource schedulers in these systems do not …

Astraea: A fair deep learning scheduler for multi-tenant gpu clusters

Z Ye, P Sun, W Gao, T Zhang, X Wang… - … on Parallel and …, 2021 - ieeexplore.ieee.org
Modern GPU clusters are designed to support distributed Deep Learning jobs from multiple
tenants concurrently. Each tenant may have varied and dynamic resource demands …

How different are the cloud workloads? characterizing large-scale private and public cloud workloads

X Qin, M Ma, Y Zhao, J Zhang, C Du… - 2023 53rd Annual …, 2023 - ieeexplore.ieee.org
With the rapid development of cloud systems, an increasing number of service workloads
are deployed in the private cloud and/or public cloud. Although large cloud providers such …

Characterizing multi-instance gpu for machine learning workloads

B Li, V Gadepally, S Samsi… - 2022 IEEE International …, 2022 - ieeexplore.ieee.org
As machine learning (ML) becomes more and more popular, datacenter operators use
hardware accelerators such as GPUs to tackle the high computation demand of ML …

Hydra: Deadline-aware and efficiency-oriented scheduling for deep learning jobs on heterogeneous gpus

Z Yang, H Wu, Y Xu, Y Wu, H Zhong… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
With the rapid proliferation of deep learning (DL) jobs running on heterogeneous GPUs,
scheduling DL jobs to meet various scheduling requirements, such as meeting deadlines …

TapFinger: Task placement and fine-grained resource allocation for edge machine learning

Y Li, T Zeng, X Zhang, J Duan… - IEEE INFOCOM 2023 …, 2023 - ieeexplore.ieee.org
Machine learning (ML) tasks are one of the major workloads in today's edge computing
networks. Existing edge-cloud schedulers allocate the requested amounts of resources to …

Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters

Z Mo, H Xu, C Xu - Proceedings of the 29th ACM International …, 2024 - dl.acm.org
Modern GPU clusters inherently exhibit heterogeneity, encompassing various aspects such
as computation and communication. This heterogeneity poses a significant challenge for the …