Hpc ai500: The methodology, tools, roofline performance models, and metrics for benchmarking hpc ai systems

Z Jiang, L Wang, X Xiong, W Gao, C Luo… - arXiv preprint arXiv …, 2020 - arxiv.org
The recent years witness a trend of applying large-scale distributed deep learning in both
business and scientific computing areas, whose goal is to speed up the training time to …

MMBench: Benchmarking End-to-End Multi-modal DNNs and Understanding Their Hardware-Software Implications

C Xu, X Hou, J Liu, C Li, T Huang, X Zhu… - 2023 IEEE …, 2023 - ieeexplore.ieee.org
The explosive growth of various types of big data and advances in AI technologies have
catalyzed a new type of workloads called multi-modal DNNs. Multi-modal DNNs are capable …

Coordination-aware assurance for end-to-end machine learning systems: the R3E approach

HL Truong - AI Assurance, 2023 - Elsevier
Concerns of robustness, reliability, resilience, and elasticity in Machine Learning (ML)
systems are important, and they must be considered in trade-off with efficiency factors …

[引用][C] R3E‑An Approach to Robustness, Reliability, Resilience and Elasticity Engineering for End‑to‑End Machine Learning Systems

HL Truong - 2020