Blink: Fast and generic collectives for distributed ml- 学术资源搜索

文章

学术资源搜索

Blink: Fast and generic collectives for distributed ml

G Wang, S Venkataraman… - Proceedings of …, 2020 - proceedings.mlsys.org

G Wang, S Venkataraman, A Phanishayee, N Devanur, J Thelin, I Stoica

Proceedings of Machine Learning and Systems, 2020•proceedings.mlsys.org

Abstract Model parameter synchronization across GPUs introduces high overheads for data-
parallel training at scale. Existing parameter synchronization protocols cannot effectively
leverage available network resources in the face of ever increasing hardware heterogeneity.
To address this issue, we propose Blink, a collective communication library that dynamically
generates optimal communication primitives by packing spanning trees. We propose
techniques to minimize the number of trees generated and extend Blink to leverage …

Abstract

Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale. Existing parameter synchronization protocols cannot effectively leverage available network resources in the face of ever increasing hardware heterogeneity. To address this issue, we propose Blink, a collective communication library that dynamically generates optimal communication primitives by packing spanning trees. We propose techniques to minimize the number of trees generated and extend Blink to leverage heterogeneous communication channels for hybrid, and faster, data transfers. Evaluations show that compared to the state-of-the-art (NCCL), Blink can achieve up to 8× faster model synchronization (AllReduce), and reduce end-to-end DNN training time for image classification tasks by up to 40%.

proceedings.mlsys.org

展开收起

被引用次数：126 相关文章所有 11 个版本

以上显示的是最相近的搜索结果。查看全部搜索结果

高级搜索

QQ 群

Blink: Fast and generic collectives for distributed ml

引用