[PDF][PDF] Cross-System Analysis of Job Characterization and Scheduling in Large-Scale Computing Clusters

MSR Di Zhang, B Xie, S Di, D Dai - webpages.charlotte.edu
Amid the growing prevalence of artificial intelligence (AI) and deep learning (DL) across
industries and science disciplines, high-performance computing (HPC) clusters are …

[PDF][PDF] Hadar: Heterogeneity-Aware Optimization-Based Online Scheduling for Deep Learning Cluster

A Sultana, F Xu, X Yuan, L Chen, NF Tzeng - prefer-nsf.org
With the wide adoption of deep neural network (DNN) models for various applications,
enterprises, and cloud providers have built deep learning clusters and increasingly …

[PDF][PDF] Dissecting I/O Burstiness in Machine Learning Cloud Platform: A Case Study on Alibaba's MLaaS

Q Zou, Y Deng, Y Zhu, Y Zhou, J Cai, S He - msstconference.org
With advancements in machine learning (ML) technology and the availability of large ML-as-
a-Service (MLaaS) clouds, accurately understanding the I/O behaviors in the storage …

[PDF][PDF] Optimizing Resource Management for Machine Learning Workloads in High-Performance Clusters

D Zhang, D Dai - researchgate.net
Resource management and job scheduling are the key to high-performance computing
(HPC) clusters for high system utilization, short user wait time, and fair resource allocation …

[PDF][PDF] ASSIST-IoT Technical Report# 13

Federated learning (FL) was proposed to facilitate the training of models in a distributed
environment. It supports the protection of (local) data privacy and uses local resources for …