SReplay: Deterministic sub-group replay for one-sided communication

X Qian, K Sen, P Hargrove, C Iancu - Proceedings of the 2016 …, 2016 - dl.acm.org
Replay of parallel execution is required by HPC debuggers and resilience mechanisms. Up-
to-date, there is no existing deterministic replay solution for one-sided communication. The …

Check-pointing approach for fault tolerance in openshmem

P Hao, S Pophale, P Shamis, T Curtis… - … 2015, Annapolis, MD …, 2015 - Springer
Fault tolerance for long running applications is critical to guard against failure of either
compute resources or a network. Accomplishing this task in software is non-trivial and there …

System-level transparent checkpointing for OpenSHMEM

R Garg, J Vienne, G Cooperman - … 2016, Baltimore, MD, USA, August 2–4 …, 2016 - Springer
Fault tolerance is an active area of research for OpenSHMEM programs. In this work, we
present the first approach using system-level transparent checkpointing. This complements …

On the road to diposh: Adventures in high-performance openshmem

C Coti, AD Malony - Parallel Processing and Applied Mathematics: 13th …, 2020 - Springer
Future HPC programming systems must address the challenge of how to integrate shared
and distributed memory parallelism. The growing number of server cores argues in favor of …

Distributed snapshot for rollback-recovery with one-sided communications

F Butelle, C Coti - 2018 International Conference on High …, 2018 - ieeexplore.ieee.org
Traditional interprocess communication requires cooperation and synchronization between
sender and receiver. The One-sided communication model is a new way and very promising …

DiPOSH: A portable OpenSHMEM implementation for short API‐to‐network path

C Coti, AD Malony - Concurrency and Computation: Practice …, 2021 - Wiley Online Library
In this article, we introduce DiPOSH, a multi‐network, distributed implementation of the
OpenSHMEM standard. The core idea behind DiPOSH is to have an API‐to‐network …

Achieving Resilience and Maintaining Performance in OpenSHMEM+ X Applications

MAS Bari - 2021 - search.proquest.com
Solving real-world problems such as climate simulation in a timely fashion requires High
Performance Computing (HPC) systems with tens of thousands of processors running for …

Checkpointing OpenSHMEM Programs Using Compiler Analysis

MAS Bari, D Basu, W Lu, T Curtis… - 2020 IEEE/ACM 10th …, 2020 - ieeexplore.ieee.org
The importance of fault tolerance continues to increase for HPC applications. The continued
growth in size and complexity of HPC systems, and of the applications themselves, is …

[PDF][PDF] A Checkpointing Restart Approach for OpenSHMEM Fault Tolerance

P Hao - 2016 - uh-ir.tdl.org
The Partitioned Global Address Space (PGAS) has emerged recently for parallel
programming at large scale. The PGAS ecosystem contains libraries, and languages (often …

SReplay

X Qian, K Sen, P Hargrove, C Iancu - 2016 - escholarship.org
Replay of parallel execution is required by HPC debuggers and resilience mechanisms. Up-
to-date, there is no existing deterministic replay solution for one-sided communication. The …