Reproducible BLAS routines with tunable accuracy using ozaki scheme for many-core architectures

D Mukunoki, T Ogita, K Ozaki - … 2019, Bialystok, Poland, September 8–11 …, 2020 - Springer
Parallel Processing and Applied Mathematics: 13th International Conference …, 2020Springer
Generally, floating-point computations comprise rounding errors; the result may be
inaccurate and not identical (non-reproducible). Particularly, heterogeneous computing has
many factors that affect reproducibility. The loss of accuracy and reproducibility could be a
crucial issue in debugging complex codes and the reliability of computations. In this paper,
we propose high-performance implementations of reproducible basic linear algebra
subprograms (BLAS) routines with tunable accuracy for many-core architectures. Our …
Abstract
Generally, floating-point computations comprise rounding errors; the result may be inaccurate and not identical (non-reproducible). Particularly, heterogeneous computing has many factors that affect reproducibility. The loss of accuracy and reproducibility could be a crucial issue in debugging complex codes and the reliability of computations. In this paper, we propose high-performance implementations of reproducible basic linear algebra subprograms (BLAS) routines with tunable accuracy for many-core architectures. Our approach is based on an accurate matrix-multiplication method, Ozaki scheme, which can be constructed on level-3 BLAS that performs standard floating-point operations. We demonstrate the performance of three routines: inner product (DOT), matrix-vector multiplication (GEMV), and matrix-multiplication (GEMM) on NVIDIA’s Volta GPU by comparing these with the standard routines provided by the vendor. Furthermore, we demonstrate the reproducibility between CPU and GPU and its accuracy.
Springer
以上显示的是最相近的搜索结果。 查看全部搜索结果