Checkpoint restart support for heterogeneous hpc applications

Towards Resilient Method: An exhaustive survey of fault tolerance methods in the cloud computing environment

MA Shahid, N Islam, MM Alam, MS Mazliham… - Computer Science …, 2021 - Elsevier

Fault Tolerance (FT) is one of the cloud's very critical problems for providing security
assistance. Due to the diverse service architecture, detailed architectures & multiple …

被引用次数：40 相关文章所有 2 个版本

[PDF] hal.science

Gpu-enabled asynchronous multi-level checkpoint caching and prefetching

A Maurya, MM Rafique, T Tonellot, HJ AlSalem… - Proceedings of the …, 2023 - dl.acm.org

Checkpointing is an I/O intensive operation increasingly used by High-Performance
Computing (HPC) applications to revisit previous intermediate datasets at scale. Unlike the …

被引用次数：5 相关文章所有 7 个版本

[PDF] arxiv.org

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

A Maurya, R Underwood, MM Rafique… - arXiv preprint arXiv …, 2024 - arxiv.org

LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-
performance computing (HPC) infrastructures and ingest massive amounts of input data …

被引用次数：1 相关文章所有 2 个版本

[PDF] nsf.gov

Canary: fault-tolerant faas for stateful time-sensitive applications

M Arif, K Assogba, MM Rafique - … : International Conference for …, 2022 - ieeexplore.ieee.org

Function-as-a-Service (FaaS) platforms have recently gained rapid popularity. Many stateful
applications have been migrated to FaaS platforms due to their ease of deployment …

被引用次数：7 相关文章所有 5 个版本

[PDF] hal.science

Towards efficient I/O scheduling for collaborative multi-level checkpointing

A Maurya, B Nicolae, MM Rafique… - … , and Simulation of …, 2021 - ieeexplore.ieee.org

Efficient checkpointing of distributed data structures periodically at key moments during
runtime is a recurring fundamental pattern in a large number of uses cases: fault tolerance …

被引用次数：8 相关文章所有 8 个版本

[PDF] wiley.com

Cricket: A virtualization layer for distributed execution of CUDA applications with checkpoint/restart support

N Eiling, J Baude, S Lankes… - … and Computation: Practice …, 2022 - Wiley Online Library

In high‐performance computing and cloud computing the introduction of heterogeneous
computing resources, such as GPU accelerator have led to a dramatic increase in …

被引用次数：9 相关文章所有 5 个版本

[PDF] hal.science

Towards Efficient Cache Allocation for High-Frequency Checkpointing

A Maurya, B Nicolae, MM Rafique… - 2022 IEEE 29th …, 2022 - ieeexplore.ieee.org

While many HPC applications are known to have long runtimes, this is not always because
of single large runs: in many cases, this is due to ensembles composed of many short runs …

被引用次数：4 相关文章所有 7 个版本

[PDF] acm.org

Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication

N Tan, J Luettgau, J Marquez, K Teranishi… - Proceedings of the …, 2023 - dl.acm.org

Writing large amounts of data concurrently to stable storage is a typical I/O pattern of many
HPC workflows. This pattern introduces high I/O overheads and results in increased storage …

被引用次数：1 相关文章所有 8 个版本

[HTML] sciencedirect.com

[HTML][HTML] DeLIA: A Dependability Library for Iterative Applications applied to parallel geophysical problems

C Santana, RCF Araújo, IM Sardina, ÍAS Assis… - Computers & …, 2024 - Elsevier

Many geophysical imaging applications, such as full-waveform inversion, often rely on high-
performance computing to meet their demanding computational requirements. The failure of …

Towards Efficient I/O Pipelines using Accumulated Compression

A Maurya, B Nicolae, MM Rafique… - 2023 IEEE 30th …, 2023 - ieeexplore.ieee.org

High-Performance Computing (HPC) workloads generate large volumes of data at high-
frequency during their execution, which needs to be captured concurrently at scale. These …

被引用次数：1 相关文章所有 8 个版本

高级搜索

QQ 群