Checkpointing strategies for parallel jobs

M Bougeret, H Casanova, M Rabie, Y Robert… - Proceedings of 2011 …, 2011 - dl.acm.org
This work provides an analysis of checkpointing strategies for minimizing expected job
execution times in an environment that is subject to processor failures. In the case of both …

Hydee: Failure containment without event logging for large scale send-deterministic mpi applications

A Guermouche, T Ropars, M Snir… - 2012 IEEE 26th …, 2012 - ieeexplore.ieee.org
High performance computing will probably reach exascale in this decade. At this scale,
mean time between failures is expected to be a few hours. Existing fault tolerant protocols for …

Correlated set coordination in fault tolerant message logging protocols

A Bouteiller, T Herault, G Bosilca… - Euro-Par 2011 Parallel …, 2011 - Springer
Based on our current expectation for the exascale systems, composed of hundred of
thousands of many-core nodes, the mean time between failures will become small, even …

On the use of cluster-based partial message logging to improve fault tolerance for mpi hpc applications

T Ropars, A Guermouche, B Uçar, E Meneses… - Euro-Par 2011 Parallel …, 2011 - Springer
Fault tolerance is becoming a major concern in HPC systems. The two traditional
approaches for message passing applications, coordinated checkpointing and message …

Long term activity monitoring with a wearable sensor node

KV Laerhoven, HW Gellersen… - … Workshop on Wearable …, 2006 - ieeexplore.ieee.org
This paper introduces an encapsulated sensor node that is devised to monitor and record
motion patterns over long, quotidian periods of time with potential application in …

Towards management of energy consumption in hpc systems with fault tolerance

M Morán, J Balladini, D Rexachs… - 2020 IEEE Congreso …, 2020 - ieeexplore.ieee.org
High-performance computing continues to increase its computing power and energy
efficiency. However, energy consumption continues to rise and finding ways to limit and/or …

[HTML][HTML] Fault tolerance at system level based on RADIC architecture

M Castro-León, H Meyer, D Rexachs… - Journal of Parallel and …, 2015 - Elsevier
The increasing failure rate in High Performance Computing encourages the investigation of
fault tolerance mechanisms to guarantee the execution of an application in spite of node …

Work-in-progress: Optimal checkpointing strategy for real-time systems with both logical and timing correctness

L Zhang, Z Wang, F Kong - 2022 IEEE Real-Time Systems …, 2022 - ieeexplore.ieee.org
This paper proposes an optimal checkpoint scheme for fault resilience in real-time systems,
in which we consider both logical consistency and timing correctness. First, we partition …

Hybrid Message Pessimistic Logging. Improving current pessimistic message logging protocols

H Meyer, R Muresano, M Castro-León… - Journal of Parallel and …, 2017 - Elsevier
With the growing scale of HPC applications, there has been an increase in the number of
interruptions as a consequence of hardware failures. The remarkable decrease of Mean …

Correlated set coordination in fault tolerant message logging protocols for many‐core clusters

A Bouteiller, T Herault, G Bosilca… - Concurrency and …, 2013 - Wiley Online Library
With our current expectation for the exascale systems, composed of hundred of thousands of
many‐core nodes, the mean time between failures will become small, even under the most …