End-to-end I/O monitoring on leading supercomputers

B Yang, W Xue, T Zhang, S Liu, X Ma, X Wang… - ACM Transactions on …, 2023 - dl.acm.org
This paper offers a solution to overcome the complexities of production system I/O
performance monitoring. We present Beacon, an end-to-end I/O resource monitoring and …

Systematically inferring I/O performance variability by examining repetitive job behavior

E Costa, T Patel, B Schwaller, JM Brandt… - Proceedings of the …, 2021 - dl.acm.org
Monitoring and analyzing I/O behaviors is critical to the efficient utilization of parallel storage
systems. Unfortunately, with increasing I/O requirements and resource contention, I/O …

Access patterns and performance behaviors of multi-layer supercomputer i/o subsystems under production load

JL Bez, AM Karimi, AK Paul, B Xie, S Byna… - Proceedings of the 31st …, 2022 - dl.acm.org
Scientific computing workloads at HPC facilities have been shifting from traditional
numerical simulations to AI/ML applications for training and inference while processing and …

Understanding hpc application i/o behavior using system level statistics

AK Paul, O Faaland, A Moody… - 2020 IEEE 27th …, 2020 - ieeexplore.ieee.org
The processor performance of high performance computing (HPC) systems is increasing at
a much higher rate than storage performance. This imbalance leads to I/O performance …

{StRAID}: Stripe-threaded Architecture for Parity-based {RAIDs} with Ultra-fast {SSDs}

S Wang, Q Cao, Z Lu, H Jiang, J Yao… - 2022 USENIX Annual …, 2022 - usenix.org
Popular software storage architecture Linux Multiple-Disk (MD) for parity-based RAID (eg,
RAID5 and RAID6) assigns one or more centralized worker threads to efficiently process all …

tf-Darshan: Understanding fine-grained I/O performance in machine learning workloads

SWD Chien, A Podobas, IB Peng… - 2020 IEEE International …, 2020 - ieeexplore.ieee.org
Machine Learning applications on HPC systems have been gaining popularity in recent
years. The upcoming large scale systems will offer tremendous parallelism for training …

Full lifecycle data analysis on a large-scale and leadership supercomputer: what can we learn from it?

B Yang, H Wei, W Zhu, Y Zhang, W Liu… - 2024 USENIX Annual …, 2024 - usenix.org
The system architecture of contemporary supercomputers is growing increasingly intricate
with the ongoing evolution of system-wide network and storage technologies, making it …

Explorations and Exploitation for Parity-based RAIDs with Ultra-fast SSDs

S Wang, Q Cao, H Jiang, Z Lu, J Yao, Y Chen… - ACM Transactions on …, 2024 - dl.acm.org
Following a conventional design principle that pays more fast-CPU-cycles for fewer slow-
I/Os, popular software storage architecture Linux Multiple-Disk (MD) for parity-based RAID …

ScaleCache: A Scalable Page Cache for Multiple Solid-State Drives

KT Pham, S Cho, S Lee, LA Nguyen, H Yeo… - Proceedings of the …, 2024 - dl.acm.org
This paper presents a scalable page cache called ScaleCache for improving SSD
scalability. Specifically, we first propose a concurrent data structure of page cache based on …

DaYu: Optimizing Distributed Scientific Workflows by Decoding Dataflow Semantics and Dynamics

M Tang, J Cernuda, J Ye, L Guo… - 2024 IEEE …, 2024 - ieeexplore.ieee.org
The combination of ever-growing scientific datasets and distributed workflow complexity
creates I/O performance bottlenecks due to data volume, velocity, and variety. Although the …