Failures in large scale systems: long-term measurement, analysis, and implications

S Gupta, T Patel, C Engelmann, D Tiwari - Proceedings of the …, 2017 - dl.acm.org
Resilience is one of the key challenges in maintaining high efficiency of future extreme scale
supercomputers. Researchers and system practitioners rely on field-data studies to …

Job characteristics on large-scale systems: long-term analysis, quantification, and implications

T Patel, Z Liu, R Kettimuthu, P Rich… - … conference for high …, 2020 - ieeexplore.ieee.org
HPC workload analysis and resource consumption characteristics are the key to driving
better operation practices, system procurement decisions, and designing effective resource …

A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

L Bautista-Gomez, A Benoit, S Di, T Herault… - Future Generation …, 2024 - Elsevier
Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …

Doomsday: Predicting which node will fail when on supercomputers

A Das, F Mueller, P Hargrove… - … Conference for High …, 2018 - ieeexplore.ieee.org
Predicting which node will fail and how soon remains a challenge for HPC resilience, yet
may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing …

GPU lifetimes on Titan supercomputer: Survival analysis and reliability

G Ostrouchov, D Maxwell, RA Ashraf… - … Conference for High …, 2020 - ieeexplore.ieee.org
The Cray XK7 Titan was the top supercomputer system in the world for a long time and
remained critically important throughout its nearly seven year life. It was an interesting …

What does power consumption behavior of hpc jobs reveal?: Demystifying, quantifying, and predicting power consumption characteristics

T Patel, A Wagenhäuser, C Eibel… - 2020 IEEE …, 2020 - ieeexplore.ieee.org
As we approach exascale computing, large-scale HPC systems are becoming increasingly
power-constrained, requiring them to run HPC workloads in an energy-efficient manner. The …

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

Power-capping aware checkpointing: On the interplay among power-capping, temperature, reliability, performance, and energy

K Tang, D Tiwari, S Gupta, P Huang… - 2016 46th Annual …, 2016 - ieeexplore.ieee.org
Checkpoint and restart mechanisms have been widely used in large scientific simulation
applications to make forward progress in case of failures. However, none of the prior works …

Assuming failure independence: are we right to be wrong?

G Aupy, Y Robert, F Vivien - 2017 IEEE International …, 2017 - ieeexplore.ieee.org
This paper revisits the failure 1 temporal independence hypothesis which is omnipresent in
the analysis of resilience methods for HPC. We explain why a previous approach is …

CoREC: Scalable and resilient in-memory data staging for in-situ workflows

S Duan, P Subedi, P Davis, K Teranishi… - ACM Transactions on …, 2020 - dl.acm.org
The dramatic increase in the scale of current and planned high-end HPC systems is leading
new challenges, such as the growing costs of data movement and IO, and the reduced mean …