Fault tolerance for long running applications is critical to guard against failure of either compute resources or a network. Accomplishing this task in software is non-trivial and there …
Fault tolerance is an active area of research for OpenSHMEM programs. In this work, we present the first approach using system-level transparent checkpointing. This complements …
C Coti, AD Malony - Parallel Processing and Applied Mathematics: 13th …, 2020 - Springer
Future HPC programming systems must address the challenge of how to integrate shared and distributed memory parallelism. The growing number of server cores argues in favor of …
F Butelle, C Coti - 2018 International Conference on High …, 2018 - ieeexplore.ieee.org
Traditional interprocess communication requires cooperation and synchronization between sender and receiver. The One-sided communication model is a new way and very promising …
C Coti, AD Malony - Concurrency and Computation: Practice …, 2021 - Wiley Online Library
In this article, we introduce DiPOSH, a multi‐network, distributed implementation of the OpenSHMEM standard. The core idea behind DiPOSH is to have an API‐to‐network …
Solving real-world problems such as climate simulation in a timely fashion requires High Performance Computing (HPC) systems with tens of thousands of processors running for …
The importance of fault tolerance continues to increase for HPC applications. The continued growth in size and complexity of HPC systems, and of the applications themselves, is …
The Partitioned Global Address Space (PGAS) has emerged recently for parallel programming at large scale. The PGAS ecosystem contains libraries, and languages (often …
Replay of parallel execution is required by HPC debuggers and resilience mechanisms. Up- to-date, there is no existing deterministic replay solution for one-sided communication. The …