Characterization of large language model development in the datacenter

Q Hu, Z Ye, Z Wang, G Wang, M Zhang… - … USENIX Symposium on …, 2024 - usenix.org
Large Language Models (LLMs) have presented impressive performance across several
transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster …

{MegaScale}: Scaling Large Language Model Training to More Than 10,000 {GPUs}

Z Jiang, H Lin, Y Zhong, Q Huang, Y Chen… - … USENIX Symposium on …, 2024 - usenix.org
We present the design, implementation and engineering experience in building and
deploying MegaScale, a production system for training large language models (LLMs) at the …

Towards a Manageable Intra-Host Network

X Kong, J Lou, W Bai, NS Kim, D Zhuo - … of the 19th Workshop on Hot …, 2023 - dl.acm.org
Intra-host networks, including heterogeneous devices and interconnect fabrics, have
become increasingly complex and crucial. However, intra-host networks today do not …

ProactMP: A Proactive Multipath Transport Protocol for Low-Latency Datacenters

R Zhuang, J Han, K Xue, J Li, Q Sun… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
With the development of datacenter networks (DCNs) towards high bandwidth and low
latency, the demands of high-level datacenter applications are heading towards high …

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Z Jiang, H Lin, Y Zhong, Q Huang, Y Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
We present the design, implementation and engineering experience in building and
deploying MegaScale, a production system for training large language models (LLMs) at the …