Machine learning-based temperature prediction for runtime thermal management across system components

K Zhang, A Guliani, S Ogrenci-Memik… - … on parallel and …, 2017 - ieeexplore.ieee.org
Elevated temperatures limit the peak performance of systems because of frequent
interventions by thermal throttling. Non-uniform thermal states across system nodes also …

Not all gpus are created equal: characterizing variability in large-scale, accelerator-rich systems

P Sinha, A Guliani, R Jain, B Tran… - … Conference for High …, 2022 - ieeexplore.ieee.org
Scientists are increasingly exploring and utilizing the massive parallelism of general-
purpose accelerators such as GPUs for scientific breakthroughs. As a result, datacenters …

Semi-dynamic load balancing: Efficient distributed learning in non-dedicated environments

C Chen, Q Weng, W Wang, B Li, B Li - … of the 11th ACM Symposium on …, 2020 - dl.acm.org
Machine learning (ML) models are increasingly trained in clusters with non-dedicated
workers possessing heterogeneous resources. In such scenarios, model training efficiency …

Accelerating distributed learning in non-dedicated environments

C Chen, Q Weng, W Wang, B Li… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Machine learning (ML) models are increasingly trained with distributed workers possessing
heterogeneous resources. In such scenarios, model training efficiency may be negatively …

PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters

R Jain, B Tran, K Chen, MD Sinclair… - … Conference for High …, 2024 - ieeexplore.ieee.org
Large-scale computing systems are increasingly using accelerators such as GPUs to enable
peta-and exa-scale levels of compute to meet the needs of Machine Learning (ML) and …

Distributed Online Min-Max Load Balancing with Risk-Averse Assistance

J Wang, B Liang - 2023 IEEE 43rd International Conference on …, 2023 - ieeexplore.ieee.org
Motivated by a wide range of applications from parallel computing to distributed learning, we
study distributed online load balancing among multiple workers. We aim to minimize the …

Proactive, Accuracy-aware Straggler Mitigation in Machine Learning Clusters

S Tairin, H Shen, A Iyer - 2024 IEEE International Parallel and …, 2024 - ieeexplore.ieee.org
Slower workers, known as stragglers, can signifi-cantly prolong training time in Machine
Learning (ML) clusters. We present SMS, a proactive straggler mitigation system with four …

Energy‐efficient load balancing for divisible tasks on heterogeneous clusters

Y Zhang, M Li, F Tong - Transactions on Emerging …, 2023 - Wiley Online Library
With the growing power consumption of heterogeneous clusters, energy‐aware resource
management has been a hot research topic recently. Among various candidate techniques …

SciLance: Mitigate Load Imbalance for Parallel Scientific Applications in Cloud Environments

X Wang, L Wan, S Klasky, D Zhao… - 2023 IEEE International …, 2023 - ieeexplore.ieee.org
Elastic cloud computing provides new opportunities for accelerating the process of scientific
discovery. However, unlike high-performance computing (HPC) systems that are built and …

A Case for Criticality Models in Exascale Systems

B Kocoloski, L Piga, W Huang, I Paul… - … Conference on Cluster …, 2016 - ieeexplore.ieee.org
Performance variation is a significant problem for large scale HPC systems and will increase
on future exascale systems. In this work, we show that performance variation impacts the …