Reproducible BLAS routines with tunable accuracy using ozaki scheme for many-core architectures

J Domke, E Vatai, A Drozd, P ChenT… - 2021 IEEE …, 2021 - ieeexplore.ieee.org

Matrix engines or units, in different forms and affinities, are becoming a reality in modern
processors; CPUs and otherwise. The current and dominant algorithmic approach to Deep …

被引用次数：39 相关文章所有 7 个版本

[PDF] arxiv.org

Reducing numerical precision requirements in quantum chemistry calculations

W Dawson, K Ozaki, J Domke… - Journal of Chemical …, 2024 - ACS Publications

The abundant demand for deep learning compute resources has created a renaissance in
low-precision hardware. Going forward, it will be essential for simulation software to run on …

被引用次数：1 相关文章所有 4 个版本

[HTML] nih.gov

DGEMM using tensor cores, and its accurate and reproducible versions

D Mukunoki, K Ozaki, T Ogita, T Imamura - International Conference on …, 2020 - Springer

This paper proposes a method for implementing dense matrix multiplication on FP64
(DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA's graphics processing units …

被引用次数：29 相关文章所有 9 个版本

[PDF] sagepub.com

General framework for re-assuring numerical reliability in parallel Krylov solvers: A case of bi-conjugate gradient stabilized methods

R Iakymchuk, S Graillat… - The international journal …, 2024 - journals.sagepub.com

Parallel implementations of Krylov subspace methods often help to accelerate the procedure
of finding an approximate solution of a linear system. However, such parallelization coupled …

被引用次数：4 相关文章所有 6 个版本

[PDF] arxiv.org

DGEMM on integer matrix multiplication unit

H Ootomo, K Ozaki, R Yokota - The International Journal of …, 2024 - journals.sagepub.com

Deep learning hardware achieves high throughput and low power consumption by reducing
computing precision and specializing in matrix multiplication. For machine learning …

被引用次数：8 相关文章所有 3 个版本

[PDF] acm.org

Accurate matrix multiplication on binary128 format accelerated by ozaki scheme

D Mukunoki, K Ozaki, T Ogita, T Imamura - Proceedings of the 50th …, 2021 - dl.acm.org

Although IEEE 754-2008 binary128 (with a 15-bit exponent and 113-bit significand, ie,
quadruple-precision) is not currently implemented on x86 in hardware, software emulation is …

被引用次数：8 相关文章所有 6 个版本

[HTML] sciencedirect.com

[HTML][HTML] Reproducibility strategies for parallel preconditioned conjugate gradient

R Iakymchuk, M Barreda, M Wiesenberger… - … of Computational and …, 2020 - Elsevier

Abstract The Preconditioned Conjugate Gradient method is often used in numerical
simulations. While being widely used, the solver is also known for its lack of accuracy while …

被引用次数：11 相关文章所有 12 个版本

[PDF] arxiv.org

Reproducibility of parallel preconditioned conjugate gradient in hybrid programming environments

R Iakymchuk, MB Vayá, S Graillat… - … Journal of High …, 2020 - journals.sagepub.com

The Preconditioned Conjugate Gradient method is often employed for the solution of linear
systems of equations arising in numerical simulations of physical phenomena. While being …

被引用次数：9 相关文章所有 19 个版本

Compensated summation and dot product algorithms for floating-point vectors on parallel architectures: Error bounds, implementation and application in the Krylov …

NM Evstigneev, OI Ryabkov, AN Bocharov… - … of Computational and …, 2022 - Elsevier

The aim of the paper is to improve parallel algorithms that obtain higher precision in floating
point reduction-type operations while working within the basic floating point type. The …

被引用次数：3 相关文章所有 3 个版本

[PDF] acm.org

Conjugate gradient solvers with high accuracy and bit-wise reproducibility between CPU and GPU using Ozaki scheme

D Mukunoki, K Ozaki, T Ogita… - … Conference on High …, 2021 - dl.acm.org

On Krylov subspace methods such as the Conjugate Gradient (CG) method, the number of
iterations until convergence may increase due to the loss of computational accuracy caused …

被引用次数：7 相关文章所有 11 个版本

高级搜索

QQ 群