Resilient X10: Efficient failure-aware programming

D Cunningham, D Grove, B Herta, A Iyengar… - Proceedings of the 19th …, 2014 - dl.acm.org
Scale-out programs run on multiple processes in a cluster. In scale-out systems, processes
can fail. Computations using traditional libraries such as MPI fail when any component …

Scalable replay with partial-order dependencies for message-logging fault tolerance

J Lifflander, E Meneses, H Menon… - 2014 IEEE …, 2014 - ieeexplore.ieee.org
Deterministic replay of a parallel application is commonly used for discovering bugs or to
recover from a hard fault with message-logging fault tolerance. For message passing …

Fault-tolerant termination detection with Safra's algorithm

G Karlos, W Fokkink, P Fuchs - International Conference on Networked …, 2021 - Springer
Safra's distributed termination detection algorithm employs a logical token ring structure
within a distributed network; only passive nodes forward the token, and a counter in the …

Car driving behaviour observation using an immersive car driving simulator

Y Tateyama, Y Mori, K Yamamoto, T Ogi… - … Conference on P2P …, 2010 - ieeexplore.ieee.org
Using a car driving simulator, we can observe drivers' behaviors in dangerous situations
safely. We constructed an immersive car driving simulator. We conducted an experiment in a …

Resilient optimistic termination detection for the async-finish model

SS Hamouda, J Milthorpe - … , ISC High Performance 2019, Frankfurt/Main …, 2019 - Springer
Driven by increasing core count and decreasing mean-time-to-failure in supercomputers,
HPC runtime systems must improve support for dynamic task-parallel execution and …

Towards resilient Chapel: Design and implementation of a transparent resilience mechanism for Chapel

K Panagiotopoulou, HW Loidl - … of the 3rd International Conference on …, 2015 - dl.acm.org
The exponential increase of components in modern High Performance Computing (HPC)
systems poses a challenge on their resilience: predictions of time between failures on …

Extreme-scale viability of collective communication for resilient task scheduling and work stealing

J Wilke, J Bennett, H Kolla, K Teranishi… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org
Extreme-scale computing will bring significant changes to high performance computing
system architectures. In particular, the increased number of system components is creating a …

Coordination languages and MPI perturbation theory: The FOX tuple space framework for resilience

JJ Wilke - 2014 IEEE International Parallel & Distributed …, 2014 - ieeexplore.ieee.org
Coordination languages are an established programming model for distributed computing,
but have been largely eclipsed by message passing (MPI) in scientific computing. In contrast …

[PDF][PDF] A fault-tolerant variant of the mahapatra-dutt termination detection algorithm

KB Ardal - 2017 - cs.vu.nl
In distributed systems it is important to know when a distributed algorithm has finished its
computation. No single process can decide when an algorithm has terminated with only its …

[PDF][PDF] Improving Tseng's Fault-Tolerant Termination Detection Algorithm

L Taglialatela - 2021 - cs.vu.nl
Distributed systems are networks of computers that communicate among each other through
message-passing. Such systems can be particularly useful for computing high-workloads …