Halide: Decoupling algorithms from schedules for high-performance image processing

C Lattner, M Amini, U Bondhugula… - 2021 IEEE/ACM …, 2021 - ieeexplore.ieee.org

This work presents MLIR, a novel approach to building reusable and extensible compiler
infrastructure. MLIR addresses software fragmentation, compilation for heterogeneous …

被引用次数：541 相关文章所有 10 个版本

[PDF] arxiv.org

MLIR: A compiler infrastructure for the end of Moore's law

C Lattner, M Amini, U Bondhugula, A Cohen… - arXiv preprint arXiv …, 2020 - arxiv.org

This work presents MLIR, a novel approach to building reusable and extensible compiler
infrastructure. MLIR aims to address software fragmentation, improve compilation for …

被引用次数：322 相关文章所有 2 个版本

[PDF] jmlr.org

Kernel operations on the GPU, with autodiff, without memory overflows

B Charlier, J Feydy, JA Glaunes, FD Collin… - Journal of Machine …, 2021 - jmlr.org

The KeOps library provides a fast and memory-efficient GPU support for tensors whose
entries are given by a mathematical formula, such as kernel and distance matrices. KeOps …

被引用次数：208 相关文章所有 14 个版本

[PDF] acm.org

Graphit: A high-performance graph dsl

Y Zhang, M Yang, R Baghdadi, S Kamil… - Proceedings of the …, 2018 - dl.acm.org

The performance bottlenecks of graph applications depend not only on the algorithm and
the underlying hardware, but also on the size and structure of the input graph. As a result …

被引用次数：206 相关文章所有 8 个版本

[PDF] acm.org

Exocompilation for productive programming of hardware accelerators

Y Ikarashi, GL Bernstein, A Reinking, H Genc… - Proceedings of the 43rd …, 2022 - dl.acm.org

High-performance kernel libraries are critical to exploiting accelerators and specialized
instructions in many applications. Because compilers are difficult to extend to support …

被引用次数：56 相关文章所有 8 个版本

[PDF] mlsys.org

DietCode: Automatic optimization for dynamic tensor programs

B Zheng, Z Jiang, CH Yu, H Shen… - Proceedings of …, 2022 - proceedings.mlsys.org

Achieving high performance for compute-intensive operators in machine learning (ML)
workloads is a crucial but challenging task. Many ML and system practitioners rely on …

被引用次数：37 相关文章所有 3 个版本

[PDF] arxiv.org

Accelerating reduction and scan using tensor core units

A Dakkak, C Li, J Xiong, I Gelado, W Hwu - Proceedings of the ACM …, 2019 - dl.acm.org

Driven by deep learning, there has been a surge of specialized processors for matrix
multiplication, referred to as Tensor Core Units (TCUs). These TCUs are capable of …

被引用次数：109 相关文章所有 10 个版本

[PDF] acm.org

Achieving high-performance the functional way: a functional pearl on expressing high-performance optimizations as rewrite strategies

B Hagedorn, J Lenfers, T Koehler, X Qin… - Proceedings of the …, 2020 - dl.acm.org

Optimizing programs to run efficiently on modern parallel hardware is hard but crucial for
many applications. The predominantly used imperative languages-like C or OpenCL-force …

被引用次数：62 相关文章所有 8 个版本

[PDF] arxiv.org

Optimizing tensor programs on flexible storage

M Schleich, A Shaikhha, D Suciu - … of the ACM on Management of Data, 2023 - dl.acm.org

Tensor programs often need to process large tensors (vectors, matrices, or higher order
tensors) that require a specialized storage format for their memory layout. Several such …

被引用次数：26 相关文章所有 4 个版本

[PDF] acm.org

Optimizing ordered graph algorithms with graphit

Y Zhang, A Brahmakshatriya, X Chen… - Proceedings of the 18th …, 2020 - dl.acm.org

Many graph problems can be solved using ordered parallel graph algorithms that achieve
significant speedup over their unordered counterparts by reducing redundant work. This …

被引用次数：49 相关文章所有 11 个版本

高级搜索

QQ 群