Tree‐based fault‐tolerant collective operations for MPI

A Margolin, A Barak - Concurrency and Computation: Practice …, 2021 - Wiley Online Library
With the increase in size and complexity of high‐performance computing systems, the
probability of failures, and the cost of recovery grow. Parallel applications running on these …

An Algorithm-Based Fault Tolerance Strategy for the Bitonic Sort Parallel Algorithm

ET Camargo, EP Duarte - 2021 10th Latin-American …, 2021 - ieeexplore.ieee.org
High Performance Computing (HPC) systems are employed to solve hard problems and rely
on parallel algorithms which present very long execution times-up to several days. These …