BLIS: A framework for rapidly instantiating BLAS functionality

FG Van Zee, RA Van De Geijn - ACM Transactions on Mathematical …, 2015 - dl.acm.org
The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for
rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental …

DAGuE: A generic distributed DAG engine for high performance computing

G Bosilca, A Bouteiller, A Danalis, T Herault… - Parallel Computing, 2012 - Elsevier
The frenetic development of the current architectures places a strain on the current state-of-
the-art programming environments. Harnessing the full potential of such architectures is a …

A class of parallel tiled linear algebra algorithms for multicore architectures

A Buttari, J Langou, J Kurzak, J Dongarra - Parallel computing, 2009 - Elsevier
As multicore systems continue to gain ground in the high performance computing world,
linear algebra algorithms have to be reformulated or new algorithms have to be developed …

Elemental: A new framework for distributed memory dense matrix computations

J Poulson, B Marker, RA Van de Geijn… - ACM Transactions on …, 2013 - dl.acm.org
Parallelizing dense matrix computations to distributed memory architectures is a well-
studied subject and generally considered to be among the best understood domains of …

Hierarchical task-based programming with StarSs

J Planas, RM Badia, E Ayguadé… - … International Journal of …, 2009 - journals.sagepub.com
Programming models for multicore and many-core systems are listed as one of the main
challenges in the near future for computing research. These programming models should be …

Parallel tiled QR factorization for multicore architectures

A Buttari, J Langou, J Kurzak… - … Practice and Experience, 2008 - Wiley Online Library
As multicore systems continue to gain ground in the high‐performance computing world,
linear algebra algorithms have to be reformulated or new algorithms have to be developed …

Programming matrix algorithms-by-blocks for thread-level parallelism

G Quintana-Ortí, ES Quintana-Ortí… - ACM Transactions on …, 2009 - dl.acm.org
With the emergence of thread-level parallelism as the primary means for continued
performance improvement, the programmability issue has reemerged as an obstacle to the …

[PDF][PDF] Provably good multicore cache performance for divide-and-conquer algorithms

GE Blelloch, RA Chowdhury, PB Gibbons… - Proceedings of the …, 2008 - cs.cmu.edu
This paper presents a multicore-cache model that reflects the reality that multicore
processors have both per-processor private (L1) caches and a large shared (L2) cache on …

Extreme-scale task-based cholesky factorization toward climate and weather prediction applications

Q Cao, Y Pei, K Akbudak, A Mikhalev… - Proceedings of the …, 2020 - dl.acm.org
Climate and weather can be predicted statistically via geospatial Maximum Likelihood
Estimates (MLE), as an alternative to running large ensembles of forward models. The MLE …

Supermatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

E Chan, FG Van Zee, P Bientinesi… - Proceedings of the 13th …, 2008 - dl.acm.org
This paper describes SuperMatrix, a runtime system that parallelizes matrix operations for
SMP and/or multi-core architectures. We use this system to demonstrate how code …