Towards Resilient Method: An exhaustive survey of fault tolerance methods in the cloud computing environment

MA Shahid, N Islam, MM Alam, MS Mazliham… - Computer Science …, 2021 - Elsevier
Fault Tolerance (FT) is one of the cloud's very critical problems for providing security
assistance. Due to the diverse service architecture, detailed architectures & multiple …

Gpu-enabled asynchronous multi-level checkpoint caching and prefetching

A Maurya, MM Rafique, T Tonellot, HJ AlSalem… - Proceedings of the …, 2023 - dl.acm.org
Checkpointing is an I/O intensive operation increasingly used by High-Performance
Computing (HPC) applications to revisit previous intermediate datasets at scale. Unlike the …

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

A Maurya, R Underwood, MM Rafique… - arXiv preprint arXiv …, 2024 - arxiv.org
LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-
performance computing (HPC) infrastructures and ingest massive amounts of input data …

Canary: fault-tolerant faas for stateful time-sensitive applications

M Arif, K Assogba, MM Rafique - … : International Conference for …, 2022 - ieeexplore.ieee.org
Function-as-a-Service (FaaS) platforms have recently gained rapid popularity. Many stateful
applications have been migrated to FaaS platforms due to their ease of deployment …

Towards efficient I/O scheduling for collaborative multi-level checkpointing

A Maurya, B Nicolae, MM Rafique… - … , and Simulation of …, 2021 - ieeexplore.ieee.org
Efficient checkpointing of distributed data structures periodically at key moments during
runtime is a recurring fundamental pattern in a large number of uses cases: fault tolerance …

Cricket: A virtualization layer for distributed execution of CUDA applications with checkpoint/restart support

N Eiling, J Baude, S Lankes… - … and Computation: Practice …, 2022 - Wiley Online Library
In high‐performance computing and cloud computing the introduction of heterogeneous
computing resources, such as GPU accelerator have led to a dramatic increase in …

Towards Efficient Cache Allocation for High-Frequency Checkpointing

A Maurya, B Nicolae, MM Rafique… - 2022 IEEE 29th …, 2022 - ieeexplore.ieee.org
While many HPC applications are known to have long runtimes, this is not always because
of single large runs: in many cases, this is due to ensembles composed of many short runs …

Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication

N Tan, J Luettgau, J Marquez, K Teranishi… - Proceedings of the …, 2023 - dl.acm.org
Writing large amounts of data concurrently to stable storage is a typical I/O pattern of many
HPC workflows. This pattern introduces high I/O overheads and results in increased storage …

[HTML][HTML] DeLIA: A Dependability Library for Iterative Applications applied to parallel geophysical problems

C Santana, RCF Araújo, IM Sardina, ÍAS Assis… - Computers & …, 2024 - Elsevier
Many geophysical imaging applications, such as full-waveform inversion, often rely on high-
performance computing to meet their demanding computational requirements. The failure of …

Towards Efficient I/O Pipelines using Accumulated Compression

A Maurya, B Nicolae, MM Rafique… - 2023 IEEE 30th …, 2023 - ieeexplore.ieee.org
High-Performance Computing (HPC) workloads generate large volumes of data at high-
frequency during their execution, which needs to be captured concurrently at scale. These …