Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication

G Kwasniewski, M Kabić, M Besta… - Proceedings of the …, 2019 - dl.acm.org
We propose COSMA: a parallel matrix-matrix multiplication algorithm that is near
communication-optimal for all combinations of matrix dimensions, processor counts, and …

On the parallel i/o optimality of linear algebra kernels: Near-optimal matrix factorizations

G Kwasniewski, M Kabic, T Ben-Nun… - Proceedings of the …, 2021 - dl.acm.org
Matrix factorizations are among the most important building blocks of scientific computing.
However, state-of-the-art libraries are not communication-optimal, underutilizing current …

Pebbles, graphs, and a pinch of combinatorics: Towards tight I/O lower bounds for statically analyzable programs

G Kwasniewski, T Ben-Nun, L Gianinazzi… - Proceedings of the 33rd …, 2021 - dl.acm.org
Determining I/O lower bounds is a crucial step in obtaining communication-efficient parallel
algorithms, both across the memory hierarchy and between processors. Current approaches …

Reducing communication in the conjugate gradient method: a case study on high-order finite elements

M Karp, N Jansson, A Podobas, P Schlatter… - Proceedings of the …, 2022 - dl.acm.org
Currently, a major bottleneck for several scientific computations is communication, both
communication between different processors, so-called horizontal communication, and …

On characterizing the data access complexity of programs

V Elango, F Rastello, LN Pouchet… - Proceedings of the …, 2015 - dl.acm.org
Technology trends will cause data movement to account for the majority of energy
expenditure and execution time on emerging computers. Therefore, computational …

Deinsum: Practically I/O optimal multi-linear algebra

AN Ziogas, G Kwasniewski, T Ben-Nun… - … Conference for High …, 2022 - ieeexplore.ieee.org
Multilinear algebra kernel performance on modern massively-parallel systems is determined
mainly by data movement. However, deriving data movement-optimal distributed schedules …

A tight i/o lower bound for matrix multiplication

TM Smith, B Lowery, J Langou… - arXiv preprint arXiv …, 2017 - arxiv.org
A tight lower bound for required I/O when computing an ordinary matrix-matrix multiplication
on a processor with two layers of memory is established. Prior work obtained weaker lower …

On the parallel I/O optimality of linear algebra kernels: Near-optimal LU factorization

G Kwasniewski, T Ben-Nun, AN Ziogas… - Proceedings of the 26th …, 2021 - dl.acm.org
Dense linear algebra kernels are fundamental components of many scientific computing
applications. In this work we present a novel method of deriving parallel I/O lower bounds for …

On characterizing the data movement complexity of computational DAGs for parallel execution

V Elango, F Rastello, LN Pouchet… - Proceedings of the 26th …, 2014 - dl.acm.org
Technology trends are making the cost of data movement increasingly dominant, both in
terms of energy and time, over the cost of performing arithmetic operations in computer …

On the hardness of red-blue pebble games

PA Papp, R Wattenhofer - Proceedings of the 32nd ACM Symposium on …, 2020 - dl.acm.org
Red-blue pebble games model the computation cost of a two-level memory hierarchy. We
present various hardness results in different red-blue pebbling variants, with a focus on the …