RAMP: a flat nanosecond optical network and MPI operations for distributed deep learning systems

A Ottino, J Benjamin, G Zervas - Optical Switching and Networking, 2024 - Elsevier
Distributed deep learning (DDL) systems strongly depend on network performance. Current
electronic packet switched (EPS) network architectures and technologies suffer from …

Modoru: Clos nanosecond optical switching for distributed deep training

C Wang, N Yoshikane, D Elson… - Journal of Optical …, 2023 - ieeexplore.ieee.org
Distributed deep training has become a significant consumer of bandwidth across
datacenter-scale networks. The diverse parallel strategies employed in deep training require …

Fast and scalable all-optical network architecture for distributed deep learning

W Li, G Yuan, Z Wang, G Tan, P Zhang… - Journal of Optical …, 2024 - opg.optica.org
With the ever-increasing size of training models and datasets, network communication has
emerged as a major bottleneck in distributed deep learning training. To address this …

Software-defined optical networking applications enabled by programmable integrated photonics

Z Xie, D Sánchez-Jácome, L Torrijos-Morán… - Journal of Optical …, 2024 - opg.optica.org
Data center networks are experiencing unprecedented exponential growth, mostly driven by
the continuous computing demands in machine learning and artificial intelligence …

Flexible silicon photonic architecture for accelerating distributed deep learning

Z Wu, LY Dai, Y Wang, S Wang… - Journal of Optical …, 2024 - opg.optica.org
The increasing size and complexity of deep learning (DL) models have led to the wide
adoption of distributed training methods in datacenters (DCs) and high-performance …

On the feasibility of hybrid electrical/optical switch architecture for large-scale training of distributed deep learning

TT Nguyen, R Takano - 2019 IEEE/ACM Workshop on …, 2019 - ieeexplore.ieee.org
Data parallelism is the dominant method used to train deep learning (DL) model on High-
Performance Computing systems such as large-scale GPU clusters. When training a DL …

[HTML][HTML] OSDL: dedicated optical slice provisioning in support of distributed deep learning

C Wang, N Yoshikane, F Balasis, T Tsuritani - Computer Networks, 2022 - Elsevier
Networks are the well-known bottlenecks for distributed deep learning (DDL) jobs. The DDL
jobs require topologies matching their communication patterns as well as high and stable …

Which can accelerate distributed machine learning faster: Hybrid optical/electrical or optical reconfigurable DCN?

H Yang, Z Zhu, R Proietti… - 2022 Optical Fiber …, 2022 - ieeexplore.ieee.org
We run various distributed machine learning (DML) architectures in a hybrid
optical/electrical DCN and an optical DCN based on Hyper-FleX-LION. Experimental results …

[HTML][HTML] Distributed deep learning training using silicon photonic switched architectures

Z Zhu, MY Teh, Z Wu, MS Glick, S Yan, M Hattink… - APL Photonics, 2022 - pubs.aip.org
The scaling trends of deep learning models and distributed training workloads are
challenging network capacities in today's datacenters and high-performance computing …

Peta-scale embedded photonics architecture for distributed deep learning applications

Z Wu, LY Dai, A Novick, M Glick, Z Zhu… - Journal of Lightwave …, 2023 - ieeexplore.ieee.org
As Deep Learning (DL) models grow larger and more complex, training jobs are
increasingly distributed across multiple Computing Units (CU) such as GPUs and TPUs …