Fault tolerance of MPI applications in exascale systems: The ULFM solution

N Losada, P González, MJ Martín, G Bosilca… - Future Generation …, 2020 - Elsevier
The growth in the number of computational resources used by high-performance computing
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …

Failure recovery in resilient X10

D Grove, SS Hamouda, B Herta, A Iyengar… - ACM Transactions on …, 2019 - dl.acm.org
Cloud computing has made the resources needed to execute large-scale in-memory
distributed computations widely available. Specialized programming models, eg …

Runtime level failure detection and propagation in HPC systems

D Zhong, A Bouteiller, X Luo, G Bosilca - Proceedings of the 26th …, 2019 - dl.acm.org
As the scale of high-performance computing (HPC) systems continues to grow, mean-time-to-
failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. In order …

A malleable and fault-tolerant task pool framework for X10

M Bungart, C Fohry - 2017 IEEE International Conference on …, 2017 - ieeexplore.ieee.org
Current HPC environments require parallel programs that are both malleable and fault-
tolerant. Malleability denotes the ability to embrace system-initiated resource changes, and …

[HTML][HTML] Fault tolerance for lifeline-based global load balancing

C Fohry, M Bungart, P Plock - Journal of Software Engineering and …, 2017 - scirp.org
Fault tolerance has become an important issue in parallel computing. It is often addressed at
system level, but application-level approaches receive increasing attention. We consider a …

HOPE: a parallel execution model based on hierarchical omission

M Yasugi, D Muraoka, T Hiraishi, S Umatani… - Proceedings of the 48th …, 2019 - dl.acm.org
This paper presents a new approach to fault-tolerant language systems without a single
point of failure for irregular parallel applications. Work-stealing frameworks provide good …

Elastic deep learning through resilient collective operations

J Li, G Bosilca, A Bouteiller, B Nicolae - … of the SC'23 Workshops of The …, 2023 - dl.acm.org
A robust solution that incorporates fault tolerance and elastic scaling capabilities for
distributed deep learning. Taking advantage of MPI resilient capabilities, aka. User-Level …

[PDF][PDF] Multitier reactive programming in high performance computing

D Sokolowski, P Martens… - 6th Workshop on …, 2019 - programming-group.com
Abstract High Performance Computing (HPC) is crucial in a number of sectors, including
weather forecasts, particle simulations and fluid dynamics. Existing programming …

[PDF][PDF] Fehlertoleranz und Elastizität für ein Framework zur globalen Lastenbalancierung

M Bungart - 2018 - kobra.uni-kassel.de
Zusammenfassung Die Anzahl an Rechenknoten in Hochleistungsrechnern wächst stetig. In
solchen Systemen nimmt die Bedeutung von Fehlertoleranz zu, da die Wahrscheinlichkeit …

Authenticated key exchange with group support for wireless sensor networks

P Svenda, V Matyas - … Conference on Mobile Adhoc and Sensor …, 2007 - ieeexplore.ieee.org
This paper targets the area of wireless sensor networks. Probabilistic key pre-distribution
schemes were developed to deal with limited memory of a single node and high number of …