Marriage between coordinated and uncoordinated checkpointing for the exascale era

A systematic survey on fault-tolerant solutions for distributed data analytics: Taxonomy, comparison, and future directions

S Isukapalli, SN Srirama - Computer Science Review, 2024 - Elsevier

Fault tolerance is becoming increasingly important for upcoming exascale systems,
supporting distributed data processing, due to the expected decrease in the Mean Time …

Unified fault-tolerance framework for hybrid task-parallel message-passing applications

O Subasi, T Martsinkevich… - … Journal of High …, 2018 - journals.sagepub.com

We present a unified fault-tolerance framework for task-parallel message-passing
applications to mitigate transient errors. First, we propose a fault-tolerant message-logging …

被引用次数：27 相关文章所有 7 个版本

Toward a general theory of optimal checkpoint placement

O Subasi, G Kestor… - 2017 IEEE International …, 2017 - ieeexplore.ieee.org

Checkpoint/restart has been widely used to cope with fail-stop errors. The checkpointing
frequency is most often optimized by assuming an exponential failure distribution. However …

被引用次数：24 相关文章所有 2 个版本

[PDF] upc.edu

Spatial support vector regression to detect silent errors in the exascale era

O Subasi, S Di, L Bautista-Gomez… - 2016 16th IEEE/ACM …, 2016 - ieeexplore.ieee.org

As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

被引用次数：31 相关文章所有 7 个版本

[PDF] sciencedirect.com

Exploring the capabilities of support vector machines in detecting silent data corruptions

O Subasi, S Di, L Bautista-Gomez… - … Informatics and Systems, 2018 - Elsevier

As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

被引用次数：18 相关文章所有 7 个版本

[PDF] uci.edu

FPGA architecture characterization for system level performance analysis

D Densmore, A Donlin… - Proceedings of the …, 2006 - ieeexplore.ieee.org

We present a modular and scalable approach for automatically extracting actual
performance information from a set of FPGA-based architecture topologies. This information …

被引用次数：28 相关文章所有 14 个版本

[PDF] nsf.gov

Incorporating fault-tolerance awareness into system-level modeling and simulation

T Johnson, H Lam - 2021 IEEE/ACM 11th Workshop on Fault …, 2021 - ieeexplore.ieee.org

As the design space for high-performance computer (HPC) systems grows larger and more
complex, modeling and simulation (MODSIM) techniques become more important to better …

被引用次数：3 相关文章所有 5 个版本

[PDF] academia.edu

Analysis of parallel application checkpoint storage for system configuration

B León, D Franco, D Rexachs, E Luque - The Journal of Supercomputing, 2021 - Springer

The use of fault tolerance strategies such as checkpoints is essential to maintain the
availability of systems and their applications in high-performance computing environments …

被引用次数：2 相关文章所有 4 个版本

[PDF] techrxiv.org

Enhancing Fault Tolerance in High-Performance Computing: A real hardware case study on a RISC-V Vector Processing Unit

M Barbirotta, F Minervini, CR Morales, A Cristal… - Authorea …, 2024 - techrxiv.org

High-Performance Computing (HPC) systems are designed for large-scale processing and
complex dataset analysis leveraging scalability, efficiency, and parallelism, often integrating …

被引用次数：3 相关文章所有 3 个版本

On the theory of speculative checkpointing: time and energy considerations

O Subasi, S Krishnamoorthy - Proceedings of the 15th ACM International …, 2018 - dl.acm.org

Collective checkpoint/rollback is the most popular approach for dealing with fail-stop errors
on high-performance computing platforms. Prior work has focused on choosing checkpoint …

被引用次数：3 相关文章所有 2 个版本

高级搜索

QQ 群