Combining partial redundancy and checkpointing for HPC

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org

Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

被引用次数：427 相关文章所有 14 个版本

[PDF] utk.edu

[图书][B] Fault tolerance techniques for high-performance computing

J Dongarra, T Herault, Y Robert - 2015 - Springer

This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …

被引用次数：262 相关文章所有 22 个版本

[PDF] ncsu.edu

Detection and correction of silent data corruption for large-scale high-performance computing

D Fiala, F Mueller, C Engelmann… - SC'12: Proceedings …, 2012 - ieeexplore.ieee.org

Faults have become the norm rather than the exception for high-end computing clusters.
Exacerbating this situation, some of these faults remain undetected, manifesting themselves …

被引用次数：389 相关文章所有 31 个版本

[PDF] acm.org

Failures in large scale systems: long-term measurement, analysis, and implications

S Gupta, T Patel, C Engelmann, D Tiwari - Proceedings of the …, 2017 - dl.acm.org

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale
supercomputers. Researchers and system practitioners rely on field-data studies to …

被引用次数：176 相关文章所有 12 个版本

[PDF] google.com

Job characteristics on large-scale systems: long-term analysis, quantification, and implications

T Patel, Z Liu, R Kettimuthu, P Rich… - … conference for high …, 2020 - ieeexplore.ieee.org

HPC workload analysis and resource consumption characteristics are the key to driving
better operation practices, system procurement decisions, and designing effective resource …

被引用次数：57 相关文章所有 4 个版本

[PDF] brad.ac.uk

Failure prediction using machine learning in a virtualised HPC system and application

B Mohammed, I Awan, H Ugail, M Younas - Cluster Computing, 2019 - Springer

Failure is an increasingly important issue in high performance computing and cloud
systems. As large-scale systems continue to grow in scale and complexity, mitigating the …

被引用次数：89 相关文章所有 10 个版本

[PDF] pasalabs.org

Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool

D Li, JS Vetter, W Yu - SC'12: Proceedings of the International …, 2012 - ieeexplore.ieee.org

Extreme-scale scientific applications are at a significant risk of being hit by soft errors on
supercomputers as the scale of these systems and the component density continues to …

被引用次数：123 相关文章所有 12 个版本

[PDF] uth.gr

APOGEE: Adaptive prefetching on GPUs for energy efficiency

A Sethia, G Dasika, M Samadi… - Proceedings of the 22nd …, 2013 - ieeexplore.ieee.org

Modern graphics processing units (GPUs) combine large amounts of parallel hardware with
fast context switching among thousands of active threads to achieve high performance …

被引用次数：94 相关文章所有 11 个版本

[PDF] umn.edu

Doomsday: Predicting which node will fail when on supercomputers

A Das, F Mueller, P Hargrove… - … Conference for High …, 2018 - ieeexplore.ieee.org

Predicting which node will fail and how soon remains a challenge for HPC resilience, yet
may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing …

被引用次数：54 相关文章所有 11 个版本

[PDF] arxiv.org

Resiliency in numerical algorithm design for extreme scale simulations

E Agullo, M Altenbernd, H Anzt… - … Journal of High …, 2022 - journals.sagepub.com

This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …

被引用次数：9 相关文章所有 22 个版本

高级搜索

QQ 群