Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

Fluid: Dataset abstraction and elastic acceleration for cloud-native deep learning training jobs

R Gu, K Zhang, Z Xu, Y Che, B Fan… - 2022 IEEE 38th …, 2022 - ieeexplore.ieee.org
Nowdays, it is prevalent to train deep learning (DL) models in cloud-native platforms that
actively leverage containerization and orchestration technologies for high elasticity, low and …

{SHADE}: Enable Fundamental Cacheability for Distributed Deep Learning Training

RIS Khan, AH Yazdani, Y Fu, AK Paul, B Ji… - … USENIX Conference on …, 2023 - usenix.org
Deep learning training (DLT) applications exhibit unique I/O workload behaviors that pose
new challenges for storage system design. DLT is I/O intensive since data samples need to …

icache: An importance-sampling-informed cache for accelerating i/o-bound dnn model training

W Chen, S He, Y Xu, X Zhang, S Yang… - … Symposium on High …, 2023 - ieeexplore.ieee.org
Fetching a large amount of DNN training data from storage systems incurs long I/O latency
and fetch stalls of GPUs. Importance sampling in DNN training can reduce the amount of …

Hvac: Removing i/o bottleneck for large-scale deep learning applications

A Khan, AK Paul, C Zimmer, S Oral… - 2022 IEEE …, 2022 - ieeexplore.ieee.org
Scientific communities are increasingly adopting deep learning (DL) models in their
applications to accelerate scientific discovery processes. However, with rapid growth in the …

A quantitative study of deep learning training on heterogeneous supercomputers

J Han, L Xu, M Rafique, AR Butt, SH Lim - 2019 - osti.gov
Deep learning (DL) has become a key technique for solving complex problems in scientific
research and discovery. DL training for science is substantially challenging because it has to …

Scaling HPC networks with co-packaged optics

P Maniotis, L Schares, BG Lee… - Optical Fiber …, 2020 - opg.optica.org
We propose an HPC network architecture with co-packaged optics enabling 128-port 51.2-
Tb/s switches. Simulations for a> 34,000-GPU system show up to 11.2 x throughput …

Quantifying and improving performance of distributed deep learning with cloud storage

N Krichevsky, R St Louis, T Guo - 2021 IEEE International …, 2021 - ieeexplore.ieee.org
Cloud computing provides a powerful yet low-cost environment for distributed deep learning
workloads. However, training complex deep learning models often requires accessing large …

Accelerating ml/dl applications with hierarchical caching on deduplication storage clusters

P Hamandawana, A Khan, J Kim… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Large scale machine learning (ML) and deep learning (DL) platforms face challenges when
integrated with deduplication enabled storage clusters. In the quest to achieve smart and …

Exploring the benefits of using co-packaged optics in data center and AI supercomputer networks: a simulation-based analysis

P Maniotis, DM Kuchta - Journal of Optical Communications …, 2024 - ieeexplore.ieee.org
We investigate the advantages of using co-packaged optics in next-generation data center
and AI supercomputer networks. The increased escape bandwidth offered by co-packaged …