[HTML][HTML] Toward exascale resilience: 2014 update

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

X Dong, N Muralimanohar, N Jouppi… - Proceedings of the …, 2009 - dl.acm.org
The scalability of future massively parallel processing (MPP) systems is challenged by high
failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or …

A scalable double in-memory checkpoint and restart scheme towards exascale

G Zheng, X Ni, LV Kalé - IEEE/IFIP International Conference on …, 2012 - ieeexplore.ieee.org
As the size of supercomputers increases, the probability of system failure grows
substantially, posing an increasingly significant challenge for scalability. It is important to …

Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems

J Chung, I Lee, M Sullivan, JH Ryoo… - Scientific …, 2013 - content.iospress.com
This paper describes and evaluates a scalable and efficient resilience scheme based on the
concept of containment domains. Containment domains are a programming construct that …

[PDF][PDF] Ensemble of Bayesian predictors and decision trees for proactive failure management in cloud computing systems.

Q Guan, Z Zhang, S Fu - J. Commun., 2012 - researchgate.net
In modern cloud computing systems, hundreds and even thousands of cloud servers are
interconnected by multi-layer networks. In such large-scale and complex systems, failures …

Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

X Dong, Y Xie, N Muralimanohar… - ACM Transactions on …, 2011 - dl.acm.org
The scalability of future Massively Parallel Processing (MPP) systems is being severely
challenged by high failure rates. Current centralized Hard Disk Drive (HDD) checkpointing …

On the viability of compression for reducing the overheads of checkpoint/restart-based fault tolerance

D Ibtesham, D Arnold, PG Bridges… - 2012 41st …, 2012 - ieeexplore.ieee.org
The increasing size and complexity of high performance computing (HPC) systems have led
to major concerns over fault frequencies and the mechanisms necessary to tolerate these …

Proactive failure management by integrated unsupervised and semi-supervised learning for dependable cloud systems

Q Guan, Z Zhang, S Fu - 2011 Sixth International Conference …, 2011 - ieeexplore.ieee.org
Cloud computing systems continue to grow in their scale and complexity. They are changing
dynamically as well due to the addition and removal of system components, changing …

libhashckpt: hash-based incremental checkpointing using gpu's

KB Ferreira, R Riesen, R Brighwell, P Bridges… - European MPI Users' …, 2011 - Springer
Concern is beginning to grow in the high-performance computing (HPC) community
regarding the reliability guarantees of future large-scale systems. Disk-based coordinated …

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …