DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

A Maurya, R Underwood, MM Rafique… - arXiv preprint arXiv …, 2024 - arxiv.org
LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-
performance computing (HPC) infrastructures and ingest massive amounts of input data …

Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication

N Tan, J Luettgau, J Marquez, K Teranishi… - Proceedings of the …, 2023 - dl.acm.org
Writing large amounts of data concurrently to stable storage is a typical I/O pattern of many
HPC workflows. This pattern introduces high I/O overheads and results in increased storage …

Towards Efficient I/O Pipelines using Accumulated Compression

A Maurya, B Nicolae, MM Rafique… - 2023 IEEE 30th …, 2023 - ieeexplore.ieee.org
High-Performance Computing (HPC) workloads generate large volumes of data at high-
frequency during their execution, which needs to be captured concurrently at scale. These …

Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibility using Checkpoint History Analytics

K Assogba, B Nicolae, H Van Dam… - … of the SC'23 Workshops of …, 2023 - dl.acm.org
High-performance computing applications are increasingly integrating checkpointing
libraries for reproducibility analytics. However, capturing an entire checkpoint history for …

[引用][C] 양자화기반의딥러닝체크포인팅기법

이상헌, 강동현 - 한국정보과학회학술발표논문집, 2023 - dbpia.co.kr
요 약딥러닝 모델의 규모가 증가함에 따라, 체크포인팅 비용이 점차 증가하고 있다. 이에 따라
체크포인팅 비용 문제를 해결하기 위한 다양한 기법이 제시되고 있지만, I/O 경합에 대한 최적화 …