相关文章- 学术资源搜索

An efficient in-memory checkpoint method and its practice on fault-tolerant HPL

X Tang, J Zhai, B Yu, W Chen… - IEEE Transactions on …, 2017 - ieeexplore.ieee.org

Fault tolerance is increasingly important in high-performance computing due to the
substantial growth of system scale and decreasing system reliability. In-memory/diskless …

被引用次数：13 相关文章所有 2 个版本

[PDF] tsinghua.edu.cn

Self-checkpoint: An in-memory checkpoint method using less space and its practice on fault-tolerant HPL

X Tang, J Zhai, B Yu, W Chen, W Zheng - Acm Sigplan Notices, 2017 - dl.acm.org

Fault tolerance is increasingly important in high performance computing due to the
substantial growth of system scale and decreasing system reliability. In-memory/diskless …

被引用次数：15 相关文章所有 5 个版本

[PDF] uiuc.edu

A scalable double in-memory checkpoint and restart scheme towards exascale

G Zheng, X Ni, LV Kalé - IEEE/IFIP International Conference on …, 2012 - ieeexplore.ieee.org

As the size of supercomputers increases, the probability of system failure grows
substantially, posing an increasingly significant challenge for scalability. It is important to …

被引用次数：163 相关文章所有 10 个版本

Asynchronous checkpointing by dedicated checkpoint threads

F Shahzad, M Wittmann, T Zeiser, G Wellein - Recent Advances in the …, 2012 - Springer

Abstract Checkpoint/restart (C/R) is a classical approach to introduce fault tolerance in large
HPC applications. Although it is relatively easy as compared to other fault tolerance …

被引用次数：14 相关文章所有 5 个版本

[PDF] psu.edu

FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

G Zheng, L Shi, LV Kalé - … on cluster computing (ieee cat. no …, 2004 - ieeexplore.ieee.org

As high performance clusters continue to grow in size, the mean time between failures
shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the …

被引用次数：291 相关文章所有 11 个版本

[PDF] shiftleft.com

[PDF][PDF] A case study of incremental and background hybrid in-memory checkpointing

X Dong, Y Xie, N Muralimanohar… - Proc. of the 2010 Exascale …, 2010 - shiftleft.com

Future exascale computing systems will have high failure rates due to the sheer number of
components present in the system. A classic fault-tolerance technique used in today's …

被引用次数：19 相关文章所有 4 个版本

[PDF] osti.gov

libhashckpt: hash-based incremental checkpointing using gpu's

KB Ferreira, R Riesen, R Brighwell, P Bridges… - European MPI Users' …, 2011 - Springer

Concern is beginning to grow in the high-performance computing (HPC) community
regarding the reliability guarantees of future large-scale systems. Disk-based coordinated …

被引用次数：67 相关文章所有 11 个版本

[PDF] academia.edu

Low-overhead diskless checkpoint for hybrid computing systems

LB Gomez, A Nukada, N Maruyama… - … Conference on High …, 2010 - ieeexplore.ieee.org

As the size of new supercomputers scales to tens of thousands of sockets, the mean time
between failures (MTBF) is decreasing to just several hours and long executions need some …

被引用次数：32 相关文章所有 7 个版本

[PDF] google.com

Techniques for efficient in-memory checkpointing

D Vogt, C Giuffrida, H Bos, AS Tanenbaum - ACM SIGOPS Operating …, 2014 - dl.acm.org

Checkpointing is a pivotal technique in system research, with applications ranging from
crash recovery to replay debugging. In this paper, we evaluate a number of in-memory …

被引用次数：14 相关文章所有 18 个版本

[PDF] meacse.org

[PDF][PDF] Reducing checkpoint creation overhead using data similarity

A Kongmunvattana - Int J Comput, 2015 - meacse.org

Checkpoint/restart is a common technique deployed in the high-performance computing
(HPC) systems to provide a fault-tolerant capability. The most widely deployed …

被引用次数：3 相关文章所有 3 个版本

高级搜索

QQ 群