[HTML][HTML] Toward exascale resilience: 2014 update

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

Failure prediction for HPC systems and applications: Current situation and open issues

A Gainaru, F Cappello, M Snir… - … International journal of …, 2013 - journals.sagepub.com
As large-scale systems evolve towards post-petascale computing, it is crucial to focus on
providing fault-tolerance strategies that aim to minimize fault's effects on applications. By far …

Exploring automatic, online failure recovery for scientific applications at extreme scales

M Gamell, DS Katz, H Kolla, J Chen… - SC'14: Proceedings …, 2014 - ieeexplore.ieee.org
Application resilience is a key challenge that must be addressed in order to realize the
exascale vision. Process/node failures, an important class of failures, are typically handled …

Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems

D Tiwari, S Gupta, SS Vazhkudai - 2014 44th Annual IEEE/IFIP …, 2014 - ieeexplore.ieee.org
Continuing increase in the computational power of supercomputers has enabled large-scale
scientific applications in the areas of astrophysics, fusion, climate and combustion to run …

Machine learning models for GPU error prediction in a large scale HPC system

B Nie, J Xue, S Gupta, T Patel… - 2018 48th Annual …, 2018 - ieeexplore.ieee.org
GPUs are widely deployed on large-scale HPC systems to provide powerful computational
capability for scientific applications from various domains. As those applications are …

Optimization of multi-level checkpoint model for large scale HPC applications

S Di, MS Bouguerra, L Bautista-Gomez… - 2014 IEEE 28th …, 2014 - ieeexplore.ieee.org
HPC community projects that future extreme scale systems will be much less stable than
current Petascale systems, thus requiring sophisticated fault tolerance to guarantee the …

Anomaly detection and anticipation in high performance computing systems

A Borghesi, M Molan, M Milano… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
In their quest toward Exascale, High Performance Computing (HPC) systems are rapidly
becoming larger and more complex, together with the issues concerning their maintenance …

Doomsday: Predicting which node will fail when on supercomputers

A Das, F Mueller, P Hargrove… - … Conference for High …, 2018 - ieeexplore.ieee.org
Predicting which node will fail and how soon remains a challenge for HPC resilience, yet
may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing …

Digging deeper into cluster system logs for failure prediction and root cause diagnosis

X Fu, R Ren, SA McKee, J Zhan… - 2014 IEEE International …, 2014 - ieeexplore.ieee.org
As the sizes of supercomputers and data centers grow towards exascale, failures become
normal. System logs play a critical role in the increasingly complex tasks of automatic failure …

Toward reliable and rapid elasticity for streaming dataflows on clouds

A Shukla, Y Simmhan - 2018 IEEE 38th International …, 2018 - ieeexplore.ieee.org
The pervasive availability of streaming data is driving Fast Data platforms for low-latency
streaming applications. Such applications need to respond to dynamism in the input rates …