MLIR: Scaling compiler infrastructure for domain specific computation

C Lattner, M Amini, U Bondhugula… - 2021 IEEE/ACM …, 2021 - ieeexplore.ieee.org
This work presents MLIR, a novel approach to building reusable and extensible compiler
infrastructure. MLIR addresses software fragmentation, compilation for heterogeneous …

MLIR: A compiler infrastructure for the end of Moore's law

C Lattner, M Amini, U Bondhugula, A Cohen… - arXiv preprint arXiv …, 2020 - arxiv.org
This work presents MLIR, a novel approach to building reusable and extensible compiler
infrastructure. MLIR aims to address software fragmentation, improve compilation for …

Kernel operations on the GPU, with autodiff, without memory overflows

B Charlier, J Feydy, JA Glaunes, FD Collin… - Journal of Machine …, 2021 - jmlr.org
The KeOps library provides a fast and memory-efficient GPU support for tensors whose
entries are given by a mathematical formula, such as kernel and distance matrices. KeOps …

Graphit: A high-performance graph dsl

Y Zhang, M Yang, R Baghdadi, S Kamil… - Proceedings of the …, 2018 - dl.acm.org
The performance bottlenecks of graph applications depend not only on the algorithm and
the underlying hardware, but also on the size and structure of the input graph. As a result …

Exocompilation for productive programming of hardware accelerators

Y Ikarashi, GL Bernstein, A Reinking, H Genc… - Proceedings of the 43rd …, 2022 - dl.acm.org
High-performance kernel libraries are critical to exploiting accelerators and specialized
instructions in many applications. Because compilers are difficult to extend to support …

DietCode: Automatic optimization for dynamic tensor programs

B Zheng, Z Jiang, CH Yu, H Shen… - Proceedings of …, 2022 - proceedings.mlsys.org
Achieving high performance for compute-intensive operators in machine learning (ML)
workloads is a crucial but challenging task. Many ML and system practitioners rely on …

Accelerating reduction and scan using tensor core units

A Dakkak, C Li, J Xiong, I Gelado, W Hwu - Proceedings of the ACM …, 2019 - dl.acm.org
Driven by deep learning, there has been a surge of specialized processors for matrix
multiplication, referred to as Tensor Core Units (TCUs). These TCUs are capable of …

Achieving high-performance the functional way: a functional pearl on expressing high-performance optimizations as rewrite strategies

B Hagedorn, J Lenfers, T Koehler, X Qin… - Proceedings of the …, 2020 - dl.acm.org
Optimizing programs to run efficiently on modern parallel hardware is hard but crucial for
many applications. The predominantly used imperative languages-like C or OpenCL-force …

Optimizing tensor programs on flexible storage

M Schleich, A Shaikhha, D Suciu - … of the ACM on Management of Data, 2023 - dl.acm.org
Tensor programs often need to process large tensors (vectors, matrices, or higher order
tensors) that require a specialized storage format for their memory layout. Several such …

Optimizing ordered graph algorithms with graphit

Y Zhang, A Brahmakshatriya, X Chen… - Proceedings of the 18th …, 2020 - dl.acm.org
Many graph problems can be solved using ordered parallel graph algorithms that achieve
significant speedup over their unordered counterparts by reducing redundant work. This …