Peta-scale embedded photonics architecture for distributed deep learning applications

Z Wu, LY Dai, A Novick, M Glick, Z Zhu… - Journal of Lightwave …, 2023 - ieeexplore.ieee.org
As Deep Learning (DL) models grow larger and more complex, training jobs are
increasingly distributed across multiple Computing Units (CU) such as GPUs and TPUs …

Impact of RoCE congestion control policies on distributed training of DNNs

T Khan, S Rashidi, S Sridharan… - … IEEE Symposium on …, 2022 - ieeexplore.ieee.org
Ahstract-RDMA over Converged Ethernet (RoCE) has gained significant attraction for
datacenter networks due to its compatibility with conventional Ethernet-based fabric …

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning

W Won, M Elavazhagan, S Srinivasan, A Durg… - arXiv preprint arXiv …, 2023 - arxiv.org
The surge of artificial intelligence, specifically large language models, has led to a rapid
advent towards the development of large-scale machine learning training clusters …