Communication lower bounds and optimal algorithms for programs that reference arrays--Part 1

N Shazeer, Y Cheng, N Parmar… - Advances in neural …, 2018 - proceedings.neurips.cc

Abstract Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network
(DNN) training strategy, due to its universal applicability and its amenability to Single …

被引用次数：434 相关文章所有 9 个版本

[PDF] arxiv.org

Flexible communication avoiding matrix multiplication on FPGA with high-level synthesis

J de Fine Licht, G Kwasniewski, T Hoefler - Proceedings of the 2020 …, 2020 - dl.acm.org

Data movement is the dominating factor affecting performance and energy in modern
computing systems. Consequently, many algorithms have been developed to minimize the …

被引用次数：71 相关文章所有 24 个版本

[PDF] acm.org

Algorithmic and optimization aspects of Brascamp-Lieb inequalities, via operator scaling

A Garg, L Gurvits, R Oliveira, A Wigderson - Proceedings of the 49th …, 2017 - dl.acm.org

The celebrated Brascamp-Lieb (BL) inequalities [BL76, Lie90], and their reverse form of
Barthe [Bar98], are an important mathematical tool, unifying and generalizing numerous in …

被引用次数：103 相关文章所有 10 个版本

[PDF] escholarship.org

[图书][B] Communication-avoiding Krylov subspace methods in theory and practice

EC Carson - 2015 - search.proquest.com

Advancements in the field of high-performance scientific computing are necessary to
address the most important challenges we face in the 21st century. From physical modeling …

被引用次数：108 相关文章所有 9 个版本

[PDF] arxiv.org

On the parallel i/o optimality of linear algebra kernels: Near-optimal matrix factorizations

G Kwasniewski, M Kabic, T Ben-Nun… - Proceedings of the …, 2021 - dl.acm.org

Matrix factorizations are among the most important building blocks of scientific computing.
However, state-of-the-art libraries are not communication-optimal, underutilizing current …

被引用次数：22 相关文章所有 22 个版本

[PDF] utexas.edu

Parallel matrix multiplication: A systematic journey

MD Schatz, RA Van de Geijn, J Poulson - SIAM Journal on Scientific …, 2016 - SIAM

We expose a systematic approach for developing distributed-memory parallel matrix-matrix
multiplication algorithms. The journey starts with a description of how matrices are …

被引用次数：67 相关文章所有 5 个版本

[PDF] arxiv.org

Communication lower bounds for matricized tensor times Khatri-Rao product

G Ballard, N Knight, K Rouse - 2018 IEEE International Parallel …, 2018 - ieeexplore.ieee.org

The matricized-tensor times Khatri-Rao product (MTTKRP) computation is the typical
bottleneck in algorithms for computing a CP decomposition of a tensor. In order to develop …

被引用次数：44 相关文章所有 5 个版本

[PDF] royalsocietypublishing.org Full View

Numerical algorithms for high-performance computational science

J Dongarra, L Grigori… - … Transactions of the …, 2020 - royalsocietypublishing.org

A number of features of today's high-performance computers make it challenging to exploit
these machines fully for computational science. These include increasing core counts but …

被引用次数：33 相关文章所有 17 个版本

[PDF] arxiv.org

Pebbles, graphs, and a pinch of combinatorics: Towards tight I/O lower bounds for statically analyzable programs

G Kwasniewski, T Ben-Nun, L Gianinazzi… - Proceedings of the 33rd …, 2021 - dl.acm.org

Determining I/O lower bounds is a crucial step in obtaining communication-efficient parallel
algorithms, both across the memory hierarchy and between processors. Current approaches …

被引用次数：18 相关文章所有 22 个版本

[PDF] core.ac.uk

Write-avoiding algorithms

E Carson, J Demmel, L Grigori, N Knight… - 2016 IEEE …, 2016 - ieeexplore.ieee.org

Communication, ie, moving data between levels of a memory hierarchy or between
processors over a network, is much more expensive (in time or energy) than arithmetic …

被引用次数：53 相关文章所有 15 个版本

高级搜索

QQ 群