- 学术资源搜索

FTI: High performance fault tolerance interface for hybrid systems

L Bautista-Gomez, S Tsuboi, D Komatitsch… - Proceedings of 2011 …, 2011 - dl.acm.org

Large scientific applications deployed on current petascale systems expend a significant
amount of their execution time dumping checkpoint files to remote storage. New fault tolerant …

被引用次数：445 相关文章所有 10 个版本

[PDF] github.io

Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers

N Maruyama, T Nomura, K Sato… - Proceedings of 2011 …, 2011 - dl.acm.org

This paper proposes a compiler-based programming framework that automatically translates
user-written structured grid code into scalable parallel implementation code for GPU …

被引用次数：257 相关文章所有 10 个版本

[PDF] psu.edu

Efficient verification of real-time systems: Compact data structure and state-space reduction

KG Larsen, F Larsson, P Pettersson… - Proceedings Real-Time …, 1997 - ieeexplore.ieee.org

During the past few years, a number of verification tools have been developed for real-time
systems in the framework of timed automata (eg KRONOS and UPPAAL). One of the major …

被引用次数：220 相关文章所有 20 个版本

[PDF] academia.edu

Optimization of multi-level checkpoint model for large scale HPC applications

S Di, MS Bouguerra, L Bautista-Gomez… - 2014 IEEE 28th …, 2014 - ieeexplore.ieee.org

HPC community projects that future extreme scale systems will be much less stable than
current Petascale systems, thus requiring sophisticated fault tolerance to guarantee the …

被引用次数：114 相关文章所有 10 个版本

[PDF] hal.science

Checkpointing strategies for parallel jobs

M Bougeret, H Casanova, M Rabie, Y Robert… - Proceedings of 2011 …, 2011 - dl.acm.org

This work provides an analysis of checkpointing strategies for minimizing expected job
execution times in an environment that is subject to processor failures. In the case of both …

被引用次数：123 相关文章所有 17 个版本

[PDF] ieee.org

Toward an optimal online checkpoint solution under a two-level HPC checkpoint model

S Di, Y Robert, F Vivien… - IEEE Transactions on …, 2016 - ieeexplore.ieee.org

The traditional single-level checkpointing method suffers from significant overhead on large-
scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in …

被引用次数：74 相关文章所有 13 个版本

[PDF] arxiv.org

CRUM: Checkpoint-restart support for CUDA's unified memory

R Garg, A Mohan, M Sullivan… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org

Unified Virtual Memory (UVM) was recently introduced with CUDA version 8 and the Pascal
GPU. The older CUDA programming style is akin to older large-memory UNIX applications …

被引用次数：43 相关文章所有 6 个版本

Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing

MS Bouguerra, A Gainaru, LB Gomez… - 2013 IEEE 27th …, 2013 - ieeexplore.ieee.org

As the failure frequency is increasing with the components count in modern and future
supercomputers, resilience is becoming critical for extreme scale systems. The association …

被引用次数：78 相关文章所有 5 个版本

[PDF] arxiv.org

Crac: Checkpoint-restart architecture for cuda with streams and uvm

T Jain, G Cooperman - SC20: International Conference for High …, 2020 - ieeexplore.ieee.org

The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues
to grow. While fault tolerance is a critical issue for supercomputing, there does not currently …

被引用次数：24 相关文章所有 7 个版本

[PDF] sjtu.edu.cn

Asymmetric resilience: Exploiting task-level idempotency for transient error recovery in accelerator-based systems

J Leng, A Buyuktosunoglu, R Bertran… - … Symposium on High …, 2020 - ieeexplore.ieee.org

Accelerators make the task of building systems that are re-silient against transient errors like
voltage noise and soft errors hard. Architects integrate accelerators into the system as black …

被引用次数：23 相关文章所有 6 个版本

高级搜索

QQ 群