Scalable group-based checkpoint/restart for large-scale message-passing systems

M Bougeret, H Casanova, M Rabie, Y Robert… - Proceedings of 2011 …, 2011 - dl.acm.org

This work provides an analysis of checkpointing strategies for minimizing expected job
execution times in an environment that is subject to processor failures. In the case of both …

被引用次数：126 相关文章所有 17 个版本

[PDF] hal.science

Hydee: Failure containment without event logging for large scale send-deterministic mpi applications

A Guermouche, T Ropars, M Snir… - 2012 IEEE 26th …, 2012 - ieeexplore.ieee.org

High performance computing will probably reach exascale in this decade. At this scale,
mean time between failures is expected to be a few hours. Existing fault tolerant protocols for …

被引用次数：75 相关文章所有 17 个版本

[PDF] idsa.prd.fr

Correlated set coordination in fault tolerant message logging protocols

A Bouteiller, T Herault, G Bosilca… - Euro-Par 2011 Parallel …, 2011 - Springer

Based on our current expectation for the exascale systems, composed of hundred of
thousands of many-core nodes, the mean time between failures will become small, even …

被引用次数：59 相关文章所有 12 个版本

[PDF] hal.science

On the use of cluster-based partial message logging to improve fault tolerance for mpi hpc applications

T Ropars, A Guermouche, B Uçar, E Meneses… - Euro-Par 2011 Parallel …, 2011 - Springer

Fault tolerance is becoming a major concern in HPC systems. The two traditional
approaches for message passing applications, coordinated checkpointing and message …

被引用次数：60 相关文章所有 12 个版本

[PDF] core.ac.uk

Long term activity monitoring with a wearable sensor node

KV Laerhoven, HW Gellersen… - … Workshop on Wearable …, 2006 - ieeexplore.ieee.org

This paper introduces an encapsulated sensor node that is devised to monitor and record
motion patterns over long, quotidian periods of time with potential application in …

被引用次数：73 相关文章所有 17 个版本

[PDF] arxiv.org

Towards management of energy consumption in hpc systems with fault tolerance

M Morán, J Balladini, D Rexachs… - 2020 IEEE Congreso …, 2020 - ieeexplore.ieee.org

High-performance computing continues to increase its computing power and energy
efficiency. However, energy consumption continues to rise and finding ways to limit and/or …

被引用次数：10 相关文章所有 7 个版本

[HTML] sciencedirect.com

[HTML][HTML] Fault tolerance at system level based on RADIC architecture

M Castro-León, H Meyer, D Rexachs… - Journal of Parallel and …, 2015 - Elsevier

The increasing failure rate in High Performance Computing encourages the investigation of
fault tolerance mechanisms to guarantee the execution of an application in spite of node …

被引用次数：19 相关文章所有 13 个版本

[PDF] nsf.gov

Work-in-progress: Optimal checkpointing strategy for real-time systems with both logical and timing correctness

L Zhang, Z Wang, F Kong - 2022 IEEE Real-Time Systems …, 2022 - ieeexplore.ieee.org

This paper proposes an optimal checkpoint scheme for fault resilience in real-time systems,
in which we consider both logical consistency and timing correctness. First, we partition …

被引用次数：4 相关文章所有 4 个版本

[PDF] upc.edu

Hybrid Message Pessimistic Logging. Improving current pessimistic message logging protocols

H Meyer, R Muresano, M Castro-León… - Journal of Parallel and …, 2017 - Elsevier

With the growing scale of HPC applications, there has been an increase in the number of
interruptions as a consequence of hardware failures. The remarkable decrease of Mean …

被引用次数：15 相关文章所有 10 个版本

[PDF] psu.edu

Correlated set coordination in fault tolerant message logging protocols for many‐core clusters

A Bouteiller, T Herault, G Bosilca… - Concurrency and …, 2013 - Wiley Online Library

With our current expectation for the exascale systems, composed of hundred of thousands of
many‐core nodes, the mean time between failures will become small, even under the most …

被引用次数：24 相关文章所有 10 个版本

高级搜索

QQ 群