[HTML][HTML] On the understandability of machine learning practices in deep learning and reinforcement learning based systems

E Ntentos, SJ Warnett, U Zdun - Journal of Systems and Software, 2025 - Elsevier
Abstract Machine learning (ML) has emerged as a transformative subject, using various
algorithms to help systems analyze data and make predictions. Deep Learning (DL) uses …

ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking

W Li, X Chen, H Shu, Y Tang, Y Wang - arXiv preprint arXiv:2406.11257, 2024 - arxiv.org
Large language models (LLM) have recently attracted significant attention in the field of
artificial intelligence. However, the training process of these models poses significant …

TranLogs: Lossless Failure Recovery Empowered by Training Logs

X Liu, L Zeng - … on Networking, Architecture and Storage (NAS), 2024 - ieeexplore.ieee.org
When running deep learning training jobs, in order to prevent training loss due to
softwarelhardware failures, a checkpointing mechanism is usually used to periodically store …