Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

W Won, T Heo, S Rashidi, S Sridharan… - … Analysis of Systems …, 2023 - ieeexplore.ieee.org
As deep learning models and input data continue to scale at an unprecedented rate, it has
become inevitable to move towards distributed training platforms to fit the models and …

Fault and self-repair for high reliability in die-to-die interconnection of 2.5 D/3D IC

R Song, J Zhang, Z Zhu, G Shan, Y Yang - Microelectronics Reliability, 2024 - Elsevier
Bringing dies closer by die-to-die interconnection is a way that reduces latency and energy
per bit transmitted, while increasing bandwidth per mm of chip. Heterogeneous integration …

Heterogeneous Die-to-Die Interfaces: Enabling More Flexible Chiplet Interconnection Systems

Y Feng, D Xiang, K Ma - Proceedings of the 56th Annual IEEE/ACM …, 2023 - dl.acm.org
The chiplet architecture is one of the emerging methodologies and is believed to be scalable
and economical. However, most current multi-chiplet systems are based on one uniform die …

MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

S Hsia, A Golden, B Acun, N Ardalani… - 2024 ACM/IEEE 51st …, 2024 - ieeexplore.ieee.org
Training and deploying large-scale machine learning models is time-consuming, requires
significant distributed computing infrastructures, and incurs high operational costs. Our …

Leveraging Memory Expansion to Accelerate Large-Scale DL Training

D Kadiyala, S Rashidi, T Heo… - … Analysis of Systems …, 2024 - ieeexplore.ieee.org
Modern Deep Learning (DL) models require massive clusters of specialized, high-end
nodes to train. Designing such clusters to maximize both performance and utilization is a …