Toward a smart cloud: A review of fault-tolerance methods in cloud systems

MA Mukwevho, T Celik - IEEE Transactions on Services …, 2018 - ieeexplore.ieee.org
This paper presents a comprehensive survey of the state-of-the-art work on fault tolerance
methods proposed for cloud computing. The survey classifies fault-tolerance methods into …

The landscape of exascale research: A data-driven literature analysis

S Heldens, P Hijma, BV Werkhoven… - ACM Computing …, 2020 - dl.acm.org
The next generation of supercomputers will break the exascale barrier. Soon we will have
systems capable of at least one quintillion (billion billion) floating-point operations per …

Fast error-bounded lossy HPC data compression with SZ

S Di, F Cappello - 2016 ieee international parallel and …, 2016 - ieeexplore.ieee.org
Today's HPC applications are producing extremely large amounts of data, thus it is
necessary to use an efficient compression before storing them to parallel file systems. In this …

{CheckFreq}: Frequent,{Fine-Grained}{DNN} Checkpointing

J Mohan, A Phanishayee, V Chidambaram - 19th USENIX Conference …, 2021 - usenix.org
Training Deep Neural Networks (DNNs) is a resource-hungry and time-consuming task.
During training, the model performs computation at the GPU to learn weights, repeatedly …

{Check-N-Run}: A checkpointing system for training deep learning recommendation models

A Eisenman, KK Matam, S Ingram, D Mudigere… - … USENIX Symposium on …, 2022 - usenix.org
Checkpoints play an important role in training long running machine learning (ML) models.
Checkpoints take a snapshot of an ML model and store it in a non-volatile memory so that …

Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

Hybrid workload scheduling on HPC systems

Y Fan, Z Lan, P Rich, W Allcock… - 2022 IEEE International …, 2022 - ieeexplore.ieee.org
Traditionally, on-demand, rigid, and malleable applications have been scheduled and
executed on separate systems. The ever-growing workload demands and rapidly …

Toward an optimal online checkpoint solution under a two-level HPC checkpoint model

S Di, Y Robert, F Vivien… - IEEE Transactions on …, 2016 - ieeexplore.ieee.org
The traditional single-level checkpointing method suffers from significant overhead on large-
scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in …

Adaptive impact-driven detection of silent data corruption for HPC applications

S Di, F Cappello - IEEE Transactions on Parallel and …, 2016 - ieeexplore.ieee.org
For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous
problems because there is no indication that there are errors during the execution. We …

Improving performance of iterative methods by lossy checkponting

D Tao, S Di, X Liang, Z Chen, F Cappello - Proceedings of the 27th …, 2018 - dl.acm.org
Iterative methods are commonly used approaches to solve large, sparse linear systems,
which are fundamental operations for many modern scientific simulations. When the large …