Gpu-enabled asynchronous multi-level checkpoint caching and prefetching

A Maurya, MM Rafique, T Tonellot, HJ AlSalem… - Proceedings of the …, 2023 - dl.acm.org
Checkpointing is an I/O intensive operation increasingly used by High-Performance
Computing (HPC) applications to revisit previous intermediate datasets at scale. Unlike the …

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

A Maurya, R Underwood, MM Rafique… - arXiv preprint arXiv …, 2024 - arxiv.org
LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-
performance computing (HPC) infrastructures and ingest massive amounts of input data …

HashCache: Accelerating Serverless Computing by Skipping Duplicated Function Execution

Z Wu, Y Deng, Y Zhou, L Cui… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Serverless computing is a leading force behind deploying and managing software in cloud
computing. One inherent challenge in serverless computing is the increased overall latency …

PredictDDL: Reusable Workload Performance Prediction for Distributed Deep Learning

K Assogba, E Lima, MM Rafique… - 2023 IEEE International …, 2023 - ieeexplore.ieee.org
Accurately predicting the training time of deep learning (DL) workloads is critical for
optimizing the utilization of data centers and allocating the required cluster resources for …

Optimizing the Training of Co-Located Deep Learning Models Using Cache-Aware Staggering

K Assogba, B Nicolae… - 2023 IEEE 30th …, 2023 - ieeexplore.ieee.org
Despite significant advances, training deep learning models remains a time-consuming and
resource-intensive task. One of the key challenges in this context is the ingestion of the …

An Intelligent Framework for Efficiently Utilizing Distributed Heterogeneous Resources to Improve HPC Application Performance

M Arif - 2024 - repository.rit.edu
Abstract High-Performance Computing (HPC) workloads are being widely used to solve
complex problems in scientific applications from diverse domains, such as weather …