Compiler-enhanced incremental checkpointing for openmp applications

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org

Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

被引用次数：425 相关文章所有 14 个版本

[PDF] ucsb.edu

Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

X Dong, N Muralimanohar, N Jouppi… - Proceedings of the …, 2009 - dl.acm.org

The scalability of future massively parallel processing (MPP) systems is challenged by high
failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or …

被引用次数：207 相关文章所有 12 个版本

[PDF] uiuc.edu

A scalable double in-memory checkpoint and restart scheme towards exascale

G Zheng, X Ni, LV Kalé - IEEE/IFIP International Conference on …, 2012 - ieeexplore.ieee.org

As the size of supercomputers increases, the probability of system failure grows
substantially, posing an increasingly significant challenge for scalability. It is important to …

被引用次数：163 相关文章所有 10 个版本

[PDF] gglsp.com Full View

Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems

J Chung, I Lee, M Sullivan, JH Ryoo… - Scientific …, 2013 - content.iospress.com

This paper describes and evaluates a scalable and efficient resilience scheme based on the
concept of containment domains. Containment domains are a programming construct that …

被引用次数：133 相关文章所有 22 个版本

[PDF] researchgate.net

[PDF][PDF] Ensemble of Bayesian predictors and decision trees for proactive failure management in cloud computing systems.

Q Guan, Z Zhang, S Fu - J. Commun., 2012 - researchgate.net

In modern cloud computing systems, hundreds and even thousands of cloud servers are
interconnected by multi-layer networks. In such large-scale and complex systems, failures …

被引用次数：116 相关文章所有 4 个版本

[PDF] acm.org

Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

X Dong, Y Xie, N Muralimanohar… - ACM Transactions on …, 2011 - dl.acm.org

The scalability of future Massively Parallel Processing (MPP) systems is being severely
challenged by high failure rates. Current centralized Hard Disk Drive (HDD) checkpointing …

被引用次数：84 相关文章所有 7 个版本

[PDF] unm.edu

On the viability of compression for reducing the overheads of checkpoint/restart-based fault tolerance

D Ibtesham, D Arnold, PG Bridges… - 2012 41st …, 2012 - ieeexplore.ieee.org

The increasing size and complexity of high performance computing (HPC) systems have led
to major concerns over fault frequencies and the mechanisms necessary to tolerate these …

被引用次数：71 相关文章所有 14 个版本

Proactive failure management by integrated unsupervised and semi-supervised learning for dependable cloud systems

Q Guan, Z Zhang, S Fu - 2011 Sixth International Conference …, 2011 - ieeexplore.ieee.org

Cloud computing systems continue to grow in their scale and complexity. They are changing
dynamically as well due to the addition and removal of system components, changing …

被引用次数：61 相关文章所有 4 个版本

[PDF] osti.gov

libhashckpt: hash-based incremental checkpointing using gpu's

KB Ferreira, R Riesen, R Brighwell, P Bridges… - European MPI Users' …, 2011 - Springer

Concern is beginning to grow in the high-performance computing (HPC) community
regarding the reliability guarantees of future large-scale systems. Disk-based coordinated …

被引用次数：67 相关文章所有 11 个版本

[HTML] hep.com.cn

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer

With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

被引用次数：1 相关文章所有 3 个版本

高级搜索

QQ 群