Fluid: Dataset abstraction and elastic acceleration for cloud-native deep learning training jobs

R Gu, K Zhang, Z Xu, Y Che, B Fan… - 2022 IEEE 38th …, 2022 - ieeexplore.ieee.org
R Gu, K Zhang, Z Xu, Y Che, B Fan, H Hou, H Dai, L Yi, Y Ding, G Chen, Y Huang
2022 IEEE 38th International Conference on Data Engineering (ICDE), 2022ieeexplore.ieee.org
Nowdays, it is prevalent to train deep learning (DL) models in cloud-native platforms that
actively leverage containerization and orchestration technologies for high elasticity, low and
flexible operation cost, and many other benefits. However, it also faces new challenges and
our work is focusing on those related to I/O throughput for training, including complex data
access with complicated performance tuning, lack of cache capacity with specialized
hardware to match its high and dynamic I/O requirement, and inefficient I/O resource sharing …
Nowdays, it is prevalent to train deep learning (DL) models in cloud-native platforms that actively leverage containerization and orchestration technologies for high elasticity, low and flexible operation cost, and many other benefits. However, it also faces new challenges and our work is focusing on those related to I/O throughput for training, including complex data access with complicated performance tuning, lack of cache capacity with specialized hardware to match its high and dynamic I/O requirement, and inefficient I/O resource sharing across different training jobs. We propose Fluid, a cloud-native platform that provides DL training jobs with a data abstraction called Fluid Dataset to access training data from heterogeneous sources in a unified manner with transparent and elastic data acceleration powered by auto-tuned cache runtimes. In addition, it comes with an on-the-fly cache system autoscaler that can intelligently scale up and down the cache capacity to match the online training speed of each individual DL job. To improve the overall performance of multiple DL jobs, Fluid can co-orchestrate the data cache and DL jobs by arranging job scheduling in an appropriate order. Our experimental results show significant performance improvement of each individual DL job which uses dynamic computing resources with Fluid. In addition, for scheduling multiple DL jobs with same datasets, Fluid gives around 2x performance speedup when integrated with existing widely-used and cutting-edge scheduling solutions. Fluid is now an open source project hosted by Cloud Native Computing Foundation (CNCF) with adopters in production including Alibaba Cloud, Tencent Cloud, Weibo.com, China Telecom, etc.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果