Oobleck: Resilient distributed training of large models using pipeline templates

I Jang, Z Yang, Z Zhang, X Jin… - Proceedings of the 29th …, 2023 - dl.acm.org
Oobleck enables resilient distributed training of large DNN models with guaranteed fault
tolerance. It takes a planning-execution co-design approach, where it first generates a set of …

ElasticFlow: An elastic serverless training platform for distributed deep learning

D Gu, Y Zhao, Y Zhong, Y Xiong, Z Han… - Proceedings of the 28th …, 2023 - dl.acm.org
This paper proposes ElasticFlow, an elastic serverless training platform for distributed deep
learning. ElasticFlow provides a serverless interface with two distinct features:(i) users …

Unicron: Economizing self-healing llm training at scale

T He, X Li, Z Wang, K Qian, J Xu, W Yu… - arXiv preprint arXiv …, 2023 - arxiv.org
Training large-scale language models is increasingly critical in various domains, but it is
hindered by frequent failures, leading to significant time and economic costs. Current failure …

DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster

Z Chen, X Zhao, C Zhi, J Yin - IEEE Transactions on Parallel …, 2023 - ieeexplore.ieee.org
Deep learning tasks (DLT) include training and inference tasks, where training DLTs have
requirements on minimizing average job completion time (JCT) and inference tasks need …

Transom: An efficient fault-tolerant system for training llms

B Wu, L Xia, Q Li, K Li, X Chen, Y Guo, T Xiang… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) represented by chartGPT have achieved profound
applications and breakthroughs in various fields. This demonstrates that LLMs with …

A Comprehensive Study of Deep Learning and Performance Comparison of Deep Neural Network Models (YOLO, RetinaNet).

NI Nife, M Chtourou - International Journal of Online & …, 2023 - search.ebscohost.com
This paper presents the latest advances in machine learning techniques and highlights
deep learning (DL) methods in recent studies. This technology has recently received great …

Elastic deep learning through resilient collective operations

J Li, G Bosilca, A Bouteiller, B Nicolae - … of the SC'23 Workshops of The …, 2023 - dl.acm.org
A robust solution that incorporates fault tolerance and elastic scaling capabilities for
distributed deep learning. Taking advantage of MPI resilient capabilities, aka. User-Level …

Optimizing Collective Communication for Scalable Scientific Computing and Deep Learning

J Li - 2023 - trace.tennessee.edu
In the realm of distributed computing, collective operations involve coordinated
communication and synchronization among multiple processing units, enabling efficient …

[PDF][PDF] Design a border Surveillance System based on Autonomous Unmanned Aerial Vehicles (UAV)

SS Abood, KQ Hussein, MT Gaata - Al-Iraqia Journal for Scientific …, 2023 - iasj.net
After the spread of autonomous driving systems in cars, the next step is the development of
autonomous drone systems. This will have an application in many military and civil fields …

분산학습클러스터의동적스케일링중발생하는학습중단원인분석과이의완화기법

임영훈, 유준열, 서의성 - 정보과학회논문지, 2023 - dbpia.co.kr
GPU 클러스터의 자원을 효율적으로 관리하기 위해 동적으로 스케일링하는 것이 중요하다.
체크포인트 기반의 중지 후 재개 스케일링이 널리 사용되어 왔지만, 최근 프레임워크는 학습된 …