Mesh-tensorflow: Deep learning for supercomputers

N Shazeer, Y Cheng, N Parmar… - Advances in neural …, 2018 - proceedings.neurips.cc
Abstract Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network
(DNN) training strategy, due to its universal applicability and its amenability to Single …

Flexible communication avoiding matrix multiplication on FPGA with high-level synthesis

J de Fine Licht, G Kwasniewski, T Hoefler - Proceedings of the 2020 …, 2020 - dl.acm.org
Data movement is the dominating factor affecting performance and energy in modern
computing systems. Consequently, many algorithms have been developed to minimize the …

Algorithmic and optimization aspects of Brascamp-Lieb inequalities, via operator scaling

A Garg, L Gurvits, R Oliveira, A Wigderson - Proceedings of the 49th …, 2017 - dl.acm.org
The celebrated Brascamp-Lieb (BL) inequalities [BL76, Lie90], and their reverse form of
Barthe [Bar98], are an important mathematical tool, unifying and generalizing numerous in …

[图书][B] Communication-avoiding Krylov subspace methods in theory and practice

EC Carson - 2015 - search.proquest.com
Advancements in the field of high-performance scientific computing are necessary to
address the most important challenges we face in the 21st century. From physical modeling …

On the parallel i/o optimality of linear algebra kernels: Near-optimal matrix factorizations

G Kwasniewski, M Kabic, T Ben-Nun… - Proceedings of the …, 2021 - dl.acm.org
Matrix factorizations are among the most important building blocks of scientific computing.
However, state-of-the-art libraries are not communication-optimal, underutilizing current …

Parallel matrix multiplication: A systematic journey

MD Schatz, RA Van de Geijn, J Poulson - SIAM Journal on Scientific …, 2016 - SIAM
We expose a systematic approach for developing distributed-memory parallel matrix-matrix
multiplication algorithms. The journey starts with a description of how matrices are …

Communication lower bounds for matricized tensor times Khatri-Rao product

G Ballard, N Knight, K Rouse - 2018 IEEE International Parallel …, 2018 - ieeexplore.ieee.org
The matricized-tensor times Khatri-Rao product (MTTKRP) computation is the typical
bottleneck in algorithms for computing a CP decomposition of a tensor. In order to develop …

Numerical algorithms for high-performance computational science

J Dongarra, L Grigori… - … Transactions of the …, 2020 - royalsocietypublishing.org
A number of features of today's high-performance computers make it challenging to exploit
these machines fully for computational science. These include increasing core counts but …

Pebbles, graphs, and a pinch of combinatorics: Towards tight I/O lower bounds for statically analyzable programs

G Kwasniewski, T Ben-Nun, L Gianinazzi… - Proceedings of the 33rd …, 2021 - dl.acm.org
Determining I/O lower bounds is a crucial step in obtaining communication-efficient parallel
algorithms, both across the memory hierarchy and between processors. Current approaches …

Write-avoiding algorithms

E Carson, J Demmel, L Grigori, N Knight… - 2016 IEEE …, 2016 - ieeexplore.ieee.org
Communication, ie, moving data between levels of a memory hierarchy or between
processors over a network, is much more expensive (in time or energy) than arithmetic …