Fault tolerance for openshmem

X Qian, K Sen, P Hargrove, C Iancu - Proceedings of the 2016 …, 2016 - dl.acm.org

Replay of parallel execution is required by HPC debuggers and resilience mechanisms. Up-
to-date, there is no existing deterministic replay solution for one-sided communication. The …

被引用次数：11 相关文章所有 9 个版本

Check-pointing approach for fault tolerance in openshmem

P Hao, S Pophale, P Shamis, T Curtis… - … 2015, Annapolis, MD …, 2015 - Springer

Fault tolerance for long running applications is critical to guard against failure of either
compute resources or a network. Accomplishing this task in software is non-trivial and there …

被引用次数：7 相关文章所有 4 个版本

[PDF] northeastern.edu

System-level transparent checkpointing for OpenSHMEM

R Garg, J Vienne, G Cooperman - … 2016, Baltimore, MD, USA, August 2–4 …, 2016 - Springer

Fault tolerance is an active area of research for OpenSHMEM programs. In this work, we
present the first approach using system-level transparent checkpointing. This complements …

被引用次数：4 相关文章所有 4 个版本

On the road to diposh: Adventures in high-performance openshmem

C Coti, AD Malony - Parallel Processing and Applied Mathematics: 13th …, 2020 - Springer

Future HPC programming systems must address the challenge of how to integrate shared
and distributed memory parallelism. The growing number of server cores argues in favor of …

被引用次数：2 相关文章所有 4 个版本

[PDF] univ-paris13.fr

Distributed snapshot for rollback-recovery with one-sided communications

F Butelle, C Coti - 2018 International Conference on High …, 2018 - ieeexplore.ieee.org

Traditional interprocess communication requires cooperation and synchronization between
sender and receiver. The One-sided communication model is a new way and very promising …

被引用次数：2 相关文章所有 9 个版本

DiPOSH: A portable OpenSHMEM implementation for short API‐to‐network path

C Coti, AD Malony - Concurrency and Computation: Practice …, 2021 - Wiley Online Library

In this article, we introduce DiPOSH, a multi‐network, distributed implementation of the
OpenSHMEM standard. The core idea behind DiPOSH is to have an API‐to‐network …

被引用次数：1 相关文章所有 3 个版本

[PDF] stonybrook.edu

Achieving Resilience and Maintaining Performance in OpenSHMEM+ X Applications

MAS Bari - 2021 - search.proquest.com

Solving real-world problems such as climate simulation in a timely fashion requires High
Performance Computing (HPC) systems with tens of thousands of processors running for …

Checkpointing OpenSHMEM Programs Using Compiler Analysis

MAS Bari, D Basu, W Lu, T Curtis… - 2020 IEEE/ACM 10th …, 2020 - ieeexplore.ieee.org

The importance of fault tolerance continues to increase for HPC applications. The continued
growth in size and complexity of HPC systems, and of the applications themselves, is …

被引用次数：1 相关文章所有 4 个版本

[PDF] tdl.org

[PDF][PDF] A Checkpointing Restart Approach for OpenSHMEM Fault Tolerance

P Hao - 2016 - uh-ir.tdl.org

The Partitioned Global Address Space (PGAS) has emerged recently for parallel
programming at large scale. The PGAS ecosystem contains libraries, and languages (often …

被引用次数：1 相关文章所有 3 个版本

[PDF] escholarship.org

SReplay

X Qian, K Sen, P Hargrove, C Iancu - 2016 - escholarship.org