Matrix engines for high performance computing: A paragon of performance or grasping at straws?

J Domke, E Vatai, A Drozd, P ChenT… - 2021 IEEE …, 2021 - ieeexplore.ieee.org
Matrix engines or units, in different forms and affinities, are becoming a reality in modern
processors; CPUs and otherwise. The current and dominant algorithmic approach to Deep …

Reducing numerical precision requirements in quantum chemistry calculations

W Dawson, K Ozaki, J Domke… - Journal of Chemical …, 2024 - ACS Publications
The abundant demand for deep learning compute resources has created a renaissance in
low-precision hardware. Going forward, it will be essential for simulation software to run on …

DGEMM using tensor cores, and its accurate and reproducible versions

D Mukunoki, K Ozaki, T Ogita, T Imamura - International Conference on …, 2020 - Springer
This paper proposes a method for implementing dense matrix multiplication on FP64
(DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA's graphics processing units …

General framework for re-assuring numerical reliability in parallel Krylov solvers: A case of bi-conjugate gradient stabilized methods

R Iakymchuk, S Graillat… - The international journal …, 2024 - journals.sagepub.com
Parallel implementations of Krylov subspace methods often help to accelerate the procedure
of finding an approximate solution of a linear system. However, such parallelization coupled …

DGEMM on integer matrix multiplication unit

H Ootomo, K Ozaki, R Yokota - The International Journal of …, 2024 - journals.sagepub.com
Deep learning hardware achieves high throughput and low power consumption by reducing
computing precision and specializing in matrix multiplication. For machine learning …

Accurate matrix multiplication on binary128 format accelerated by ozaki scheme

D Mukunoki, K Ozaki, T Ogita, T Imamura - Proceedings of the 50th …, 2021 - dl.acm.org
Although IEEE 754-2008 binary128 (with a 15-bit exponent and 113-bit significand, ie,
quadruple-precision) is not currently implemented on x86 in hardware, software emulation is …

[HTML][HTML] Reproducibility strategies for parallel preconditioned conjugate gradient

R Iakymchuk, M Barreda, M Wiesenberger… - … of Computational and …, 2020 - Elsevier
Abstract The Preconditioned Conjugate Gradient method is often used in numerical
simulations. While being widely used, the solver is also known for its lack of accuracy while …

Reproducibility of parallel preconditioned conjugate gradient in hybrid programming environments

R Iakymchuk, MB Vayá, S Graillat… - … Journal of High …, 2020 - journals.sagepub.com
The Preconditioned Conjugate Gradient method is often employed for the solution of linear
systems of equations arising in numerical simulations of physical phenomena. While being …

Compensated summation and dot product algorithms for floating-point vectors on parallel architectures: Error bounds, implementation and application in the Krylov …

NM Evstigneev, OI Ryabkov, AN Bocharov… - … of Computational and …, 2022 - Elsevier
The aim of the paper is to improve parallel algorithms that obtain higher precision in floating
point reduction-type operations while working within the basic floating point type. The …

Conjugate gradient solvers with high accuracy and bit-wise reproducibility between CPU and GPU using Ozaki scheme

D Mukunoki, K Ozaki, T Ogita… - … Conference on High …, 2021 - dl.acm.org
On Krylov subspace methods such as the Conjugate Gradient (CG) method, the number of
iterations until convergence may increase due to the loss of computational accuracy caused …