Zeno: A straggler diagnosis system for distributed computing using machine learning

Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres

SS Gill, X Ouyang, P Garraghan - The Journal of Supercomputing, 2020 - Springer

Cloud computing systems are splitting compute-and data-intensive jobs into smaller tasks to
execute them in a parallel manner using clusters to improve execution time. However, such …

被引用次数：30 相关文章所有 8 个版本

[PDF] hal.science

Multi-source distributed system data for ai-powered analytics

S Nedelkoski, J Bogatinovski, AK Mandapati… - Service-Oriented and …, 2020 - Springer

The emerging field of Artificial Intelligence for IT Operations (AIOps) utilizes monitoring data,
big data platforms, and machine learning, to automate operations and maintenance (O&M) …

被引用次数：49 相关文章所有 8 个版本

[PDF] arxiv.org

Detecting straggler MapReduce tasks in big data processing infrastructure by neural network

A Javadpour, G Wang, S Rezaei, KC Li - The Journal of Supercomputing, 2020 - Springer

Straggler task detection is one of the main challenges in applying MapReduce for
parallelizing and distributing large-scale data processing. It is defined as detecting running …

被引用次数：36 相关文章所有 6 个版本

CERES: Container-based elastic resource management system for mixed workloads

J Yu, D Feng, W Tong, P Lv, Y Xiong - Proceedings of the 50th …, 2021 - dl.acm.org

It is common to deploy multiple workloads in one cluster to achieve high resource utilization,
which tends to bring more resource contentions and performance interferences. If the …

被引用次数：7 相关文章所有 2 个版本

Terms: Task management policies to achieve high performance for mixed workloads using surplus resources

J Yu, W Tong, P Lv, D Feng - Journal of Parallel and Distributed Computing, 2022 - Elsevier

Resource contentions and performance interferences can lead to workload performance
degradation in mixed-workload deployment clusters. Previous work guarantees the resource …

被引用次数：1 相关文章所有 2 个版本

[PDF] researchsquare.com

GPU cluster dynamics: insights from Alibaba's 2023 trace release

A Siavashi, M Momtazpour - Computing, 2025 - Springer

In this paper, we present a comprehensive analysis of GPU cluster traces from Alibaba,
released in 2023, focusing on understanding the detailed settings of nodes and pods and …

被引用次数：1 相关文章所有 2 个版本

Identifying Performance Bottleneck in Shared In-Network Aggregation during Distributed Training

C Liu, J Zheng, W Wu, B Zhao, W Nie… - 2023 IEEE 29th …, 2023 - ieeexplore.ieee.org

As the emergence of recently popular large language model, distributed training (DT)
optimizes the performance via using different parallelization strategies, resource schedulers …

被引用次数：1 相关文章

[PDF] nsf.gov

Hiperjobviz: Visualizing resource allocations in high-performance computing center via multivariate health-status data

N Nguyen, T Dang, J Hass… - 2019 IEEE/ACM Industry …, 2019 - ieeexplore.ieee.org

Scheduling, visualizing, and balancing resource allocations in High-Performance
Computing Centers are complicated tasks due to a large amount of data and the dynamic …

被引用次数：5 相关文章所有 3 个版本

Detection of stragglers and optimal rescheduling of slow running tasks in big data environment using LFCSO-LVQ classifier and enhanced PSO algorithm

HA Joshiara, CS Thaker, SM Shah… - … Journal of Data …, 2022 - inderscienceonline.com

This paper plans to implement intelligent techniques in finding straggler tasks along with
speculating their way of execution. Here, the LFCSO-LVQ is proposed to effectively identify …

被引用次数：1 相关文章所有 6 个版本

Detecting last-level cache contention in workload colocation with meta learning

H Shen, C Li - 2019 IEEE International Symposium on …, 2019 - ieeexplore.ieee.org

While workload colocation improves cluster utilization in cloud environments, it introduces
performance-impacting contentions on unmanaged resources. We address the problem of …

被引用次数：5 相关文章所有 2 个版本

高级搜索

QQ 群