Optimization of multi-level checkpoint model for large scale HPC applications

MA Mukwevho, T Celik - IEEE Transactions on Services …, 2018 - ieeexplore.ieee.org

This paper presents a comprehensive survey of the state-of-the-art work on fault tolerance
methods proposed for cloud computing. The survey classifies fault-tolerance methods into …

被引用次数：114 相关文章所有 4 个版本

[PDF] acm.org

The landscape of exascale research: A data-driven literature analysis

S Heldens, P Hijma, BV Werkhoven… - ACM Computing …, 2020 - dl.acm.org

The next generation of supercomputers will break the exascale barrier. Soon we will have
systems capable of at least one quintillion (billion billion) floating-point operations per …

被引用次数：64 相关文章所有 4 个版本

[PDF] anl.gov

Fast error-bounded lossy HPC data compression with SZ

S Di, F Cappello - 2016 ieee international parallel and …, 2016 - ieeexplore.ieee.org

Today's HPC applications are producing extremely large amounts of data, thus it is
necessary to use an efficient compression before storing them to parallel file systems. In this …

被引用次数：505 相关文章所有 6 个版本

[PDF] usenix.org

{CheckFreq}: Frequent,{Fine-Grained}{DNN} Checkpointing

J Mohan, A Phanishayee, V Chidambaram - 19th USENIX Conference …, 2021 - usenix.org

Training Deep Neural Networks (DNNs) is a resource-hungry and time-consuming task.
During training, the model performs computation at the GPU to learn weights, repeatedly …

被引用次数：62 相关文章所有 6 个版本

[PDF] usenix.org

{Check-N-Run}: A checkpointing system for training deep learning recommendation models

A Eisenman, KK Matam, S Ingram, D Mudigere… - … USENIX Symposium on …, 2022 - usenix.org

Checkpoints play an important role in training long running machine learning (ML) models.
Checkpoints take a snapshot of an ML model and store it in a non-volatile memory so that …

被引用次数：49 相关文章所有 8 个版本

[PDF] upv.es

Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org

Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

被引用次数：31 相关文章所有 12 个版本

[PDF] arxiv.org

Hybrid workload scheduling on HPC systems

Y Fan, Z Lan, P Rich, W Allcock… - 2022 IEEE International …, 2022 - ieeexplore.ieee.org

Traditionally, on-demand, rigid, and malleable applications have been scheduled and
executed on separate systems. The ever-growing workload demands and rapidly …

被引用次数：23 相关文章所有 8 个版本

[PDF] ieee.org

Toward an optimal online checkpoint solution under a two-level HPC checkpoint model

S Di, Y Robert, F Vivien… - IEEE Transactions on …, 2016 - ieeexplore.ieee.org

The traditional single-level checkpointing method suffers from significant overhead on large-
scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in …

被引用次数：74 相关文章所有 13 个版本

Adaptive impact-driven detection of silent data corruption for HPC applications

S Di, F Cappello - IEEE Transactions on Parallel and …, 2016 - ieeexplore.ieee.org

For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous
problems because there is no indication that there are errors during the execution. We …

被引用次数：71 相关文章所有 5 个版本

[PDF] acm.org

Improving performance of iterative methods by lossy checkponting

D Tao, S Di, X Liang, Z Chen, F Cappello - Proceedings of the 27th …, 2018 - dl.acm.org

Iterative methods are commonly used approaches to solve large, sparse linear systems,
which are fundamental operations for many modern scientific simulations. When the large …

被引用次数：49 相关文章所有 10 个版本

高级搜索

QQ 群