S Tairin, H Shen,
A Iyer - 2024 IEEE International Parallel and …, 2024 - ieeexplore.ieee.org
Slower workers, known as stragglers, can signifi-cantly prolong training time in Machine
Learning (ML) clusters. We present SMS, a proactive straggler mitigation system with four …