[HTML][HTML] Toward exascale resilience: 2014 update

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

[图书][B] Fault tolerance techniques for high-performance computing

J Dongarra, T Herault, Y Robert - 2015 - Springer
This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …

A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

L Bautista-Gomez, A Benoit, S Di, T Herault… - Future Generation …, 2024 - Elsevier
Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …

Ultra low power magnetic flip-flop based on checkpointing/power gating and self-enable mechanisms

D Chabi, W Zhao, E Deng, Y Zhang… - … on Circuits and …, 2014 - ieeexplore.ieee.org
Advanced computing systems suffer from high static power due to the rapidly rising leakage
currents in deep sub-micron MOS technologies. Fast access non-volatile memories (NVM) …

ACR: Automatic checkpoint/restart for soft and hard error protection

X Ni, E Meneses, N Jain, LV Kalé - Proceedings of the international …, 2013 - dl.acm.org
As machines increase in scale, many researchers have predicted that failure rates will
correspondingly increase. Soft errors do not inhibit execution, but may silently generate …

Using migratable objects to enhance fault tolerance schemes in supercomputers

E Meneses, X Ni, G Zheng… - IEEE transactions on …, 2014 - ieeexplore.ieee.org
Supercomputers have seen an exponential increase in their size in the last two decades.
Such a high growth rate is expected to take us to exascale in the timeframe 2018-2022. But …

Benchmarking variables for checkpointing in hpc applications

X Fu, X Huang, W Xu, W Zhang, S Meng… - 2024 IEEE …, 2024 - ieeexplore.ieee.org
Checkpoint/Restart (C/R) is a widely used fault tolerance mechanism in converged systems
of cloud, edge, and HPC. However, users often rely on their experience to determine which …

On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems

M Amoon, N El-Bahnasawy, S Sadi… - Journal of Ambient …, 2019 - Springer
The likelihood of failures rises in cloud computing systems as a result of their unstable
nature. Additionally, the size of a cloud computing system varies with time and thus failures …

Accelerating seismic redatuming using tile low-rank approximations on NEC SX-Aurora TSUBASA

Y Hong, H Ltaief, M Ravasi, L Gatineau, DE Keyes - 2021 - repository.kaust.edu.sa
With the aim of imaging subsurface discontinuities, seismic data recorded at the surface of
the Earth must be numerically re-positioned at locations in the subsurface where reflections …

[PDF][PDF] Perspective shape-from-shading by fast marching

A Tankus, N Sochen, Y Yeshurun - CVPR (1), 2004 - courses.cs.tau.ac.il
Abstract Shape-from-Shading (SfS) is a fundamental problem in Computer Vision. At its
basis lies the image irradiance equation. Recently, the authors proposed to base the image …