AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures

Z Zheng, X Yang, P Zhao, G Long, K Zhu… - Proceedings of the 27th …, 2022 - dl.acm.org
This work reveals that memory-intensive computation is a rising performance-critical factor in
recent machine learning models. Due to a unique set of new challenges, existing ML …

A full-stack search technique for domain optimized deep learning accelerators

D Zhang, S Huda, E Songhori, K Prabhu, Q Le… - Proceedings of the 27th …, 2022 - dl.acm.org
The rapidly-changing deep learning landscape presents a unique opportunity for building
inference accelerators optimized for specific datacenter-scale workloads. We propose Full …

Bolt: Bridging the gap between auto-tuners and hardware-native performance

J Xing, L Wang, S Zhang, J Chen… - … of Machine Learning …, 2022 - proceedings.mlsys.org
Today's auto-tuners (eg, AutoTVM, Ansor) generate efficient tensor programs by navigating
a large search space to identify effective implementations, but they do so with opaque …

Fusionstitching: boosting memory intensive computations for deep learning workloads

Z Zheng, P Zhao, G Long, F Zhu, K Zhu, W Zhao… - arXiv preprint arXiv …, 2020 - arxiv.org
We show in this work that memory intensive computations can result in severe performance
problems due to off-chip memory access and CPU-GPU context switch overheads in a wide …

Optimizing DNN compilation for distributed training with joint OP and tensor fusion

X Yi, S Zhang, L Diao, C Wu, Z Zheng… - … on Parallel and …, 2022 - ieeexplore.ieee.org
This article proposes DisCo, an automatic deep learning compilation module for data-
parallel distributed training. Unlike most deep learning compilers that focus on training or …

Collage: Seamless integration of deep learning backends with automatic placement

B Jeon, S Park, P Liao, S Xu, T Chen, Z Jia - Proceedings of the …, 2022 - dl.acm.org
The strong demand for efficient and performant deployment of Deep Learning (DL)
applications prompts the rapid development of a rich DL ecosystem. To keep up with this fast …

[PDF][PDF] A Literature Review on Combining Neural Architecture Search and Compiler Optimizations for Neural Network Acceleration

I Bachiri, R Baghdadi, PS Niar, H Ouarnoughi, AA ESI - researchgate.net
Designing efficient deep learning architectures is a challenging task that requires balancing
performance and hardware efficiency. Neural Architecture Search (NAS) has emerged as a …

[PDF][PDF] Hardware Aware Neural Architecture Search with Automatic Code Optimization in the MLIR Compiler

I Bachiri, R Baghdadi, PS Niar, H Ouarnoughi, AA ESI - researchgate.net
Deep learning has achieved remarkable success across various domains, leading to the
development of increasingly complex and resource-intensive models. For that, these models …

[PDF][PDF] Accelerating a Deep Learning Framework with Tiramisu

H Benmeziane - 2020 - researchgate.net
Today, machine learning offers a variety of services in the industry; including research,
translation, recommendation systems and security. Deep learning in particular has led to …

[PDF][PDF] Learning Local Advantage Functions for Generalizable Graph Optimizations

Y Wu, Y Zhou, PM Phothilimthana, H Liu, S Roy… - cs.cmu.edu
Abstract Machine learning compilers rely on making optimized decisions in order to
generate efficient code for a given computation graph. Many of these decision making …