Performance, design, and autotuning of batched GEMM for GPUs

A Abdelfattah, A Haidar, S Tomov… - … Conference, ISC High …, 2016 - Springer
The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in
dense linear algebra, and is the key component for obtaining high performance in most …

RETRACTED: Batched matrix computations on hardware accelerators based on GPUs

A Haidar, T Dong, P Luszczek… - … Journal of High …, 2015 - journals.sagepub.com
Scientific applications require solvers that work on many small size problems that are
independent from each other. At the same time, the high-end hardware evolves rapidly and …

Parallel programming models for dense linear algebra on heterogeneous systems

J Dongarra, M Abalenkovs, A Abdelfattah… - Supercomputing …, 2015 - superfri.susu.ru
We present a review of the current best practices in parallel programming models for dense
linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand …

A guide for achieving high performance with very small matrices on GPU: a case study of batched LU and Cholesky factorizations

A Haidar, A Abdelfattah, M Zounon… - … on Parallel and …, 2017 - ieeexplore.ieee.org
We present a high-performance GPU kernel with a substantial speedup over vendor
libraries for very small matrix computations. In addition, we discuss most of the challenges …

[PDF][PDF] A proposed API for batched basic linear algebra subprograms

J Dongarra, I Duff, M Gates, A Haidar, S Hammarling… - 2016 - drive.google.com
This paper proposes an API for Batched Basic Linear Algebra Subprograms (Batched
BLAS). We focus on many independent BLAS operations on small matrices that are grouped …

A process model to support continuous certification of cloud services

I Kunz, P Stephanow - 2017 IEEE 31st International …, 2017 - ieeexplore.ieee.org
Current research on cloud service certification is working on techniques to continuously, ie
automatically and repeatedly, assess whether cloud services satisfy certification criteria …

Optimization for performance and energy for batched matrix computations on GPUs

A Haidar, T Dong, P Luszczek, S Tomov… - Proceedings of the 8th …, 2015 - dl.acm.org
As modern hardware keeps evolving, an increasingly effective approach to develop energy
efficient and high-performance solvers is to design them to work on many small size …

On the development of variable size batched computation for heterogeneous parallel architectures

A Abdelfattah, A Haidar, S Tomov… - 2016 IEEE International …, 2016 - ieeexplore.ieee.org
Many scientific applications, ranging from national security to medical advances, require
solving a number of relatively small-size independent problems. As the size of each …

Optimizing the SVD bidiagonalization process for a batch of small matrices

T Dong, A Haidar, S Tomov, J Dongarra - Procedia Computer Science, 2017 - Elsevier
A challenging class of problems arising in many GPU applications, called batched problems,
involves linear algebra operations on many small-sized matrices. We designed batched …

On the design, development, and analysis of optimized matrix-vector multiplication routines for coprocessors

K Kabir, A Haidar, S Tomov, J Dongarra - High Performance Computing …, 2015 - Springer
The manycore paradigm shift, and the resulting change in modern computer architectures,
has made the development of optimal numerical routines extremely challenging. In this …