The emerging field of Artificial Intelligence for IT Operations (AIOps) utilizes monitoring data, big data platforms, and machine learning, to automate operations and maintenance (O&M) …
A Javadpour, G Wang, S Rezaei, KC Li - The Journal of Supercomputing, 2020 - Springer
Straggler task detection is one of the main challenges in applying MapReduce for parallelizing and distributing large-scale data processing. It is defined as detecting running …
J Yu, D Feng, W Tong, P Lv, Y Xiong - Proceedings of the 50th …, 2021 - dl.acm.org
It is common to deploy multiple workloads in one cluster to achieve high resource utilization, which tends to bring more resource contentions and performance interferences. If the …
J Yu, W Tong, P Lv, D Feng - Journal of Parallel and Distributed Computing, 2022 - Elsevier
Resource contentions and performance interferences can lead to workload performance degradation in mixed-workload deployment clusters. Previous work guarantees the resource …
In this paper, we present a comprehensive analysis of GPU cluster traces from Alibaba, released in 2023, focusing on understanding the detailed settings of nodes and pods and …
C Liu, J Zheng, W Wu, B Zhao, W Nie… - 2023 IEEE 29th …, 2023 - ieeexplore.ieee.org
As the emergence of recently popular large language model, distributed training (DT) optimizes the performance via using different parallelization strategies, resource schedulers …
N Nguyen, T Dang, J Hass… - 2019 IEEE/ACM Industry …, 2019 - ieeexplore.ieee.org
Scheduling, visualizing, and balancing resource allocations in High-Performance Computing Centers are complicated tasks due to a large amount of data and the dynamic …
HA Joshiara, CS Thaker, SM Shah… - … Journal of Data …, 2022 - inderscienceonline.com
This paper plans to implement intelligent techniques in finding straggler tasks along with speculating their way of execution. Here, the LFCSO-LVQ is proposed to effectively identify …
H Shen, C Li - 2019 IEEE International Symposium on …, 2019 - ieeexplore.ieee.org
While workload colocation improves cluster utilization in cloud environments, it introduces performance-impacting contentions on unmanaged resources. We address the problem of …