Optimizing throughput of Seq2Seq model training on the IPU platform for AI-accelerated CFD simulations

P Rościszewski, A Krzywaniak, S Iserte, K Rojek… - Future Generation …, 2023 - Elsevier
Future Generation Computer Systems, 2023Elsevier
Abstract Intelligence Processing Units (IPU) have proven useful for many AI applications. In
this paper, we evaluate them within the emerging field of AI for simulation, where traditional
numerical simulations are supported by artificial intelligence approaches. We focus
specifically on a program for training machine learning models supporting a computational
fluid dynamics application. We use custom TensorFlow provided by the Poplar Software
Development Kit to adapt the program for the IPU-POD16 platform and investigate its ease …
Abstract
Intelligence Processing Units (IPU) have proven useful for many AI applications. In this paper, we evaluate them within the emerging field of AI for simulation, where traditional numerical simulations are supported by artificial intelligence approaches. We focus specifically on a program for training machine learning models supporting a computational fluid dynamics application. We use custom TensorFlow provided by the Poplar Software Development Kit to adapt the program for the IPU-POD16 platform and investigate its ease of use and performance scalability. Training a model on data from OpenFOAM simulations allows us to get accurate simulation state predictions in test time. We describe how to optimize multi-threading runtime options and utilize the popdist library to overcome a performance bottleneck in feeding training data to the IPU on the host side. Due to communication overheads, using data parallelism to utilize two IPUs instead of one does not improve the throughput. However, once the intra-IPU costs have been paid, the hardware capabilities for inter-IPU communication allow for good scalability. Increasing the number of IPUs from two to 16 improves the throughput from 560.8 to 2805.8 samples/s. Additionally, the experimental results show that reducing the precision of input data storage from FP32 to FP16 allows to improve training throughput by 12%, while tuning selected runtime variables, by up to 6.3%.
Elsevier
以上显示的是最相近的搜索结果。 查看全部搜索结果