Reducing waste in extreme scale systems through introspective analysis

S Gupta, T Patel, C Engelmann, D Tiwari - Proceedings of the …, 2017 - dl.acm.org

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale
supercomputers. Researchers and system practitioners rely on field-data studies to …

被引用次数：172 相关文章所有 12 个版本

[PDF] google.com

Job characteristics on large-scale systems: long-term analysis, quantification, and implications

T Patel, Z Liu, R Kettimuthu, P Rich… - … conference for high …, 2020 - ieeexplore.ieee.org

HPC workload analysis and resource consumption characteristics are the key to driving
better operation practices, system procurement decisions, and designing effective resource …

被引用次数：53 相关文章所有 4 个版本

A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

L Bautista-Gomez, A Benoit, S Di, T Herault… - Future Generation …, 2024 - Elsevier

Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …

[PDF] umn.edu

Doomsday: Predicting which node will fail when on supercomputers

A Das, F Mueller, P Hargrove… - … Conference for High …, 2018 - ieeexplore.ieee.org

Predicting which node will fail and how soon remains a challenge for HPC resilience, yet
may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing …

被引用次数：52 相关文章所有 11 个版本

[PDF] osti.gov

GPU lifetimes on Titan supercomputer: Survival analysis and reliability

G Ostrouchov, D Maxwell, RA Ashraf… - … Conference for High …, 2020 - ieeexplore.ieee.org

The Cray XK7 Titan was the top supercomputer system in the world for a long time and
remained critically important throughout its nearly seven year life. It was an interesting …

被引用次数：30 相关文章所有 6 个版本

[PDF] google.com

What does power consumption behavior of hpc jobs reveal?: Demystifying, quantifying, and predicting power consumption characteristics

T Patel, A Wagenhäuser, C Eibel… - 2020 IEEE …, 2020 - ieeexplore.ieee.org

As we approach exascale computing, large-scale HPC systems are becoming increasingly
power-constrained, requiring them to run HPC workloads in an energy-efficient manner. The …

被引用次数：28 相关文章所有 3 个版本

[HTML] hep.com.cn

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer

With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

被引用次数：2 相关文章所有 4 个版本

[PDF] christian-engelmann.info

Power-capping aware checkpointing: On the interplay among power-capping, temperature, reliability, performance, and energy

K Tang, D Tiwari, S Gupta, P Huang… - 2016 46th Annual …, 2016 - ieeexplore.ieee.org

Checkpoint and restart mechanisms have been widely used in large scientific simulation
applications to make forward progress in case of failures. However, none of the prior works …

被引用次数：28 相关文章所有 6 个版本

[PDF] hal.science

Assuming failure independence: are we right to be wrong?

G Aupy, Y Robert, F Vivien - 2017 IEEE International …, 2017 - ieeexplore.ieee.org

This paper revisits the failure 1 temporal independence hypothesis which is omnipresent in
the analysis of resilience methods for HPC. We explain why a previous approach is …

被引用次数：23 相关文章所有 11 个版本

[PDF] acm.org

CoREC: Scalable and resilient in-memory data staging for in-situ workflows

S Duan, P Subedi, P Davis, K Teranishi… - ACM Transactions on …, 2020 - dl.acm.org

The dramatic increase in the scale of current and planned high-end HPC systems is leading
new challenges, such as the growing costs of data movement and IO, and the reduced mean …

被引用次数：12 相关文章所有 4 个版本

高级搜索

QQ 群