Influenced by the great success of deep learning via cloud computing and the rapid development of edge chips, research in artificial intelligence (AI) has shifted to both of the …
Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success and become a milestone in the field of artificial intelligence (AI). Owing to …
Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a …
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often involving tens of thousands of GPUs running continuously for months. These models are …
Distributed deep neural network training (DT) systems are widely deployed in clusters where the network is shared across multiple tenants, ie, multiple DT jobs. Each DT job computes …
Z Wang, L Luo, Q Ning, C Zeng, W Li, X Wan… - … USENIX Symposium on …, 2023 - usenix.org
RDMA is expected to be highly scalable: to perform well in large-scale data center networks where packet losses are inevitable (ie, high network scalability), and to support a large …
Deep learning recommendation models (DLRMs) have been used across many business- critical services at Meta and are the single largest AI application in terms of infrastructure …
R Gu, Y Chen, S Liu, H Dai, G Chen… - … on Parallel and …, 2021 - ieeexplore.ieee.org
Deep learning (DL) is becoming increasingly popular in many domains, including computer vision, speech recognition, self-driving automobiles, etc. GPU can train DL models efficiently …
Y Zhao, Y Liu, Y Peng, Y Zhu, X Liu, X Jin - Proceedings of the ACM …, 2022 - dl.acm.org
Training Deep Learning (DL) model requires multiple resource types, including CPUs, GPUs, storage IO, and network IO. Advancements in DL have produced a wide spectrum of …