Mpress: Democratizing billion-scale model training on multi-gpu servers via memory-saving inter-operator parallelism

Q Zhou, H Wang, X Yu, C Li, Y Bai… - … Symposium on High …, 2023 - ieeexplore.ieee.org
It remains challenging to train billion-scale DNN models on a single modern multi-GPU
server due to the GPU memory wall. Unfortunately, existing memory-saving techniques such …

XEngine: Optimal tensor rematerialization for neural networks in heterogeneous environments

M Schuler, R Membarth, P Slusallek - ACM Transactions on Architecture …, 2022 - dl.acm.org
Memory efficiency is crucial in training deep learning networks on resource-restricted
devices. During backpropagation, forward tensors are used to calculate gradients. Despite …

Exploiting Input Tensor Dynamics in Activation Checkpointing for Efficient Training on GPU

J Liao, M Li, H Yang, Q Sun, B Sun… - 2023 IEEE …, 2023 - ieeexplore.ieee.org
Larger deep learning models usually lead to higher model quality, however with an ever-
increasing GPU memory footprint. Although several tensor checkpointing techniques have …

[图书][B] Compiler and Runtime Techniques for Optimizing Deep Learning Applications

SS Lyubomirsky - 2022 - search.proquest.com
As the scaling and performance demands for deep learning systems have grown, system
designers have struggled to incorporate innovations at opposite ends of the system stack …