Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres

SS Gill, X Ouyang, P Garraghan - The Journal of Supercomputing, 2020 - Springer
Cloud computing systems are splitting compute-and data-intensive jobs into smaller tasks to
execute them in a parallel manner using clusters to improve execution time. However, such …

Multi-source distributed system data for ai-powered analytics

S Nedelkoski, J Bogatinovski, AK Mandapati… - Service-Oriented and …, 2020 - Springer
The emerging field of Artificial Intelligence for IT Operations (AIOps) utilizes monitoring data,
big data platforms, and machine learning, to automate operations and maintenance (O&M) …

Detecting straggler MapReduce tasks in big data processing infrastructure by neural network

A Javadpour, G Wang, S Rezaei, KC Li - The Journal of Supercomputing, 2020 - Springer
Straggler task detection is one of the main challenges in applying MapReduce for
parallelizing and distributing large-scale data processing. It is defined as detecting running …

CERES: Container-based elastic resource management system for mixed workloads

J Yu, D Feng, W Tong, P Lv, Y Xiong - Proceedings of the 50th …, 2021 - dl.acm.org
It is common to deploy multiple workloads in one cluster to achieve high resource utilization,
which tends to bring more resource contentions and performance interferences. If the …

Terms: Task management policies to achieve high performance for mixed workloads using surplus resources

J Yu, W Tong, P Lv, D Feng - Journal of Parallel and Distributed Computing, 2022 - Elsevier
Resource contentions and performance interferences can lead to workload performance
degradation in mixed-workload deployment clusters. Previous work guarantees the resource …

GPU cluster dynamics: insights from Alibaba's 2023 trace release

A Siavashi, M Momtazpour - Computing, 2025 - Springer
In this paper, we present a comprehensive analysis of GPU cluster traces from Alibaba,
released in 2023, focusing on understanding the detailed settings of nodes and pods and …

Identifying Performance Bottleneck in Shared In-Network Aggregation during Distributed Training

C Liu, J Zheng, W Wu, B Zhao, W Nie… - 2023 IEEE 29th …, 2023 - ieeexplore.ieee.org
As the emergence of recently popular large language model, distributed training (DT)
optimizes the performance via using different parallelization strategies, resource schedulers …

Hiperjobviz: Visualizing resource allocations in high-performance computing center via multivariate health-status data

N Nguyen, T Dang, J Hass… - 2019 IEEE/ACM Industry …, 2019 - ieeexplore.ieee.org
Scheduling, visualizing, and balancing resource allocations in High-Performance
Computing Centers are complicated tasks due to a large amount of data and the dynamic …

Detection of stragglers and optimal rescheduling of slow running tasks in big data environment using LFCSO-LVQ classifier and enhanced PSO algorithm

HA Joshiara, CS Thaker, SM Shah… - … Journal of Data …, 2022 - inderscienceonline.com
This paper plans to implement intelligent techniques in finding straggler tasks along with
speculating their way of execution. Here, the LFCSO-LVQ is proposed to effectively identify …

Detecting last-level cache contention in workload colocation with meta learning

H Shen, C Li - 2019 IEEE International Symposium on …, 2019 - ieeexplore.ieee.org
While workload colocation improves cluster utilization in cloud environments, it introduces
performance-impacting contentions on unmanaged resources. We address the problem of …