There's plenty of room at the Top: What will drive computer performance after Moore's law?

CE Leiserson, NC Thompson, JS Emer, BC Kuszmaul… - Science, 2020 - science.org
BACKGROUND Improvements in computing power can claim a large share of the credit for
many of the things that we take for granted in our modern lives: cellphones that are more …

External memory algorithms and data structures: Dealing with massive data

JS Vitter - ACM Computing surveys (CsUR), 2001 - dl.acm.org
Data sets in large applications are often too massive to fit completely inside the computers
internal memory. The resulting input/output communication (or I/O) between fast internal …

Flexgen: High-throughput generative inference of large language models with a single gpu

Y Sheng, L Zheng, B Yuan, Z Li… - International …, 2023 - proceedings.mlr.press
The high computational and memory requirements of large language model (LLM) inference
make it feasible only with multiple high-end accelerators. Motivated by the emerging …

Data movement is all you need: A case study on optimizing transformers

A Ivanov, N Dryden, T Ben-Nun, S Li… - … of Machine Learning …, 2021 - proceedings.mlsys.org
Transformers are one of the most important machine learning workloads today. Training one
is a very compute-intensive task, often taking days or weeks, and significant attention has …

[图书][B] Applied numerical linear algebra

JW Demmel - 1997 - SIAM
This textbook covers both direct and iterative methods for the solution of linear systems, least
squares problems, eigenproblems, and the singular value decomposition. Earlier versions …

[图书][B] Why systolic architecture?

HT Kung - 1982 - eecs.harvard.edu
Roughly, the cycle for developing a special-purpose system can be divided into three
phases–task definition, design, and implementation. During task definition, some system …

The input/output complexity of sorting and related problems

A Aggarwal, JS Vitter - Communications of the ACM, 1988 - dl.acm.org
We provide tight upper and lower bounds, up to a constant factor, for the number of inputs
and outputs (I/OS) between internal memory and secondary storage required for five sorting …

Cache-oblivious algorithms

M Frigo, CE Leiserson, H Prokop… - … on Foundations of …, 1999 - ieeexplore.ieee.org
This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT,
and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms …

[图书][B] Space-filling curves: an introduction with applications in scientific computing

M Bader - 2012 - books.google.com
The present book provides an introduction to using space-filling curves (SFC) as tools in
scientific computing. Special focus is laid on the representation of SFC and on resulting …

[PDF][PDF] The cache performance and optimizations of blocked algorithms

MD Lam, EE Rothberg, ME Wolf - ACM SIGOPS Operating Systems …, 1991 - dl.acm.org
Blocking is a well-known optimization technique for improving the effectiveness of memory
hierarchies. Instead of operating on entire rows or columns of an array, blocked algorithms …