Hoard: A distributed data caching system to accelerate deep learning training on the cloud

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org

Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

被引用次数：261 相关文章所有 8 个版本

Fluid: Dataset abstraction and elastic acceleration for cloud-native deep learning training jobs

R Gu, K Zhang, Z Xu, Y Che, B Fan… - 2022 IEEE 38th …, 2022 - ieeexplore.ieee.org

Nowdays, it is prevalent to train deep learning (DL) models in cloud-native platforms that
actively leverage containerization and orchestration technologies for high elasticity, low and …

被引用次数：54 相关文章所有 2 个版本

[PDF] usenix.org

{SHADE}: Enable Fundamental Cacheability for Distributed Deep Learning Training

RIS Khan, AH Yazdani, Y Fu, AK Paul, B Ji… - … USENIX Conference on …, 2023 - usenix.org

Deep learning training (DLT) applications exhibit unique I/O workload behaviors that pose
new challenges for storage system design. DLT is I/O intensive since data samples need to …

被引用次数：22 相关文章所有 12 个版本

[PDF] iit.edu

icache: An importance-sampling-informed cache for accelerating i/o-bound dnn model training

W Chen, S He, Y Xu, X Zhang, S Yang… - … Symposium on High …, 2023 - ieeexplore.ieee.org

Fetching a large amount of DNN training data from storage systems incurs long I/O latency
and fetch stalls of GPUs. Importance sampling in DNN training can reduce the amount of …

被引用次数：15 相关文章所有 7 个版本

[PDF] osti.gov

Hvac: Removing i/o bottleneck for large-scale deep learning applications

A Khan, AK Paul, C Zimmer, S Oral… - 2022 IEEE …, 2022 - ieeexplore.ieee.org

Scientific communities are increasingly adopting deep learning (DL) models in their
applications to accelerate scientific discovery processes. However, with rapid growth in the …

被引用次数：20 相关文章所有 4 个版本

[PDF] osti.gov

A quantitative study of deep learning training on heterogeneous supercomputers

J Han, L Xu, M Rafique, AR Butt, SH Lim - 2019 - osti.gov

Deep learning (DL) has become a key technique for solving complex problems in scientific
research and discovery. DL training for science is substantially challenging because it has to …

被引用次数：23 相关文章所有 7 个版本

[PDF] google.com

Scaling HPC networks with co-packaged optics

P Maniotis, L Schares, BG Lee… - Optical Fiber …, 2020 - opg.optica.org

We propose an HPC network architecture with co-packaged optics enabling 128-port 51.2-
Tb/s switches. Simulations for a> 34,000-GPU system show up to 11.2 x throughput …

被引用次数：19 相关文章所有 4 个版本

[PDF] arxiv.org

Quantifying and improving performance of distributed deep learning with cloud storage

N Krichevsky, R St Louis, T Guo - 2021 IEEE International …, 2021 - ieeexplore.ieee.org

Cloud computing provides a powerful yet low-cost environment for distributed deep learning
workloads. However, training complex deep learning models often requires accessing large …

被引用次数：9 相关文章所有 4 个版本

[PDF] ieee.org

Accelerating ml/dl applications with hierarchical caching on deduplication storage clusters

P Hamandawana, A Khan, J Kim… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org

Large scale machine learning (ML) and deep learning (DL) platforms face challenges when
integrated with deduplication enabled storage clusters. In the quest to achieve smart and …

被引用次数：8 相关文章所有 7 个版本

[PDF] osti.gov

Exploring the benefits of using co-packaged optics in data center and AI supercomputer networks: a simulation-based analysis

P Maniotis, DM Kuchta - Journal of Optical Communications …, 2024 - ieeexplore.ieee.org

We investigate the advantages of using co-packaged optics in next-generation data center
and AI supercomputer networks. The increased escape bandwidth offered by co-packaged …

被引用次数：9 相关文章所有 5 个版本

高级搜索

QQ 群