相关文章- 学术资源搜索

Reinit: Evaluating the performance of global-restart recovery methods for mpi fault tolerance

G Georgakoudis, L Guo, I Laguna - International Conference on High …, 2020 - Springer

Scaling supercomputers comes with an increase in failure rates due to the increasing
number of hardware components. In standard practice, applications are made resilient …

被引用次数：33 相关文章所有 6 个版本

[PDF] semanticscholar.org

Fault-tolerant MPI

A Bouteiller - Fault-Tolerance Techniques for High-Performance …, 2015 - Springer

As supercomputers are entering an era of massive parallelism where the frequency of faults
is increasing, the MPI standard remains distressingly vague on the consequence of failures …

被引用次数：10 相关文章所有 5 个版本

[PDF] arxiv.org

Checkpoint/restart approaches for a thread-based MPI runtime

J Adam, M Kermarquer, JB Besnard… - Parallel Computing, 2019 - Elsevier

Fault-tolerance has always been an important topic when it comes to running massively
parallel programs at scale. Statistically, hardware and software failures are expected to …

被引用次数：14 相关文章所有 5 个版本

[PDF] udc.es

Resilient MPI applications using an application-level checkpointing framework and ULFM

N Losada, I Cores, MJ Martín, P González - The Journal of …, 2017 - Springer

Future exascale systems, formed by millions of cores, will present high failure rates, and
long-running applications will need to make use of new fault tolerance techniques to ensure …

被引用次数：40 相关文章所有 8 个版本

[PDF] researchgate.net

Transparent high-speed network checkpoint/restart in mpi

J Adam, JB Besnard, AD Malony, S Shende… - Proceedings of the 25th …, 2018 - dl.acm.org

Fault-tolerance has always been an important topic when it comes to running massively
parallel programs at scale. Statistically, hardware and software failures are expected to …

被引用次数：12 相关文章所有 2 个版本

[PDF] osti.gov

[PDF][PDF] A global exception fault tolerance model for MPI

I Laguna, T Gamblin, K Mohror, M Schulz, H Pritchard… - 2014 - osti.gov

Driven both by the anticipated hardware reliability constraints for exascale systems, and the
desire to use MPI in a broader application space, there is an ongoing effort to incorporate …

被引用次数：13 相关文章所有 3 个版本

[PDF] utk.edu

Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI

W Bland, P Du, A Bouteiller, T Herault… - Concurrency and …, 2013 - Wiley Online Library

Most predictions of exascale machines picture billion ways parallelism, encompassing not
only millions of cores but also tens of thousands of nodes. Even considering extremely …

被引用次数：30 相关文章所有 12 个版本

[PDF] sciencedirect.com

Local rollback for resilient MPI applications with application-level checkpointing and message logging

N Losada, G Bosilca, A Bouteiller, P González… - Future Generation …, 2019 - Elsevier

The resilience approach generally used in high-performance computing (HPC) relies on
coordinated checkpoint/restart, a global rollback of all the processes that are running the …

被引用次数：34 相关文章所有 7 个版本

[PDF] psu.edu

Post-failure recovery of MPI communication capability: Design and rationale

W Bland, A Bouteiller, T Herault… - … Journal of High …, 2013 - journals.sagepub.com

As supercomputers are entering an era of massive parallelism where the frequency of faults
is increasing, the MPI Standard remains distressingly vague on the consequence of failures …

被引用次数：276 相关文章所有 8 个版本

[PDF] osti.gov

Toward local failure local recovery resilience model using MPI-ULFM

K Teranishi, MA Heroux - Proceedings of the 21st european mpi users' …, 2014 - dl.acm.org

The current system reaction to the loss of a single MPI process is to kill all the remaining
processes and restart the application from the most recent checkpoint. This approach will …

被引用次数：111 相关文章所有 2 个版本

高级搜索

QQ 群