Learning to fuse

AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures

Z Zheng, X Yang, P Zhao, G Long, K Zhu… - Proceedings of the 27th …, 2022 - dl.acm.org

This work reveals that memory-intensive computation is a rising performance-critical factor in
recent machine learning models. Due to a unique set of new challenges, existing ML …

被引用次数：66 相关文章所有 2 个版本

[PDF] acm.org

A full-stack search technique for domain optimized deep learning accelerators

D Zhang, S Huda, E Songhori, K Prabhu, Q Le… - Proceedings of the 27th …, 2022 - dl.acm.org

The rapidly-changing deep learning landscape presents a unique opportunity for building
inference accelerators optimized for specific datacenter-scale workloads. We propose Full …

被引用次数：66 相关文章所有 3 个版本

[PDF] mlsys.org

Bolt: Bridging the gap between auto-tuners and hardware-native performance

J Xing, L Wang, S Zhang, J Chen… - … of Machine Learning …, 2022 - proceedings.mlsys.org

Today's auto-tuners (eg, AutoTVM, Ansor) generate efficient tensor programs by navigating
a large search space to identify effective implementations, but they do so with opaque …

被引用次数：42 相关文章所有 8 个版本

[PDF] arxiv.org

Fusionstitching: boosting memory intensive computations for deep learning workloads

Z Zheng, P Zhao, G Long, F Zhu, K Zhu, W Zhao… - arXiv preprint arXiv …, 2020 - arxiv.org

We show in this work that memory intensive computations can result in severe performance
problems due to off-chip memory access and CPU-GPU context switch overheads in a wide …

被引用次数：33 相关文章所有 3 个版本

[PDF] arxiv.org

Optimizing DNN compilation for distributed training with joint OP and tensor fusion

X Yi, S Zhang, L Diao, C Wu, Z Zheng… - … on Parallel and …, 2022 - ieeexplore.ieee.org

This article proposes DisCo, an automatic deep learning compilation module for data-
parallel distributed training. Unlike most deep learning compilers that focus on training or …

被引用次数：5 相关文章所有 10 个版本

[PDF] acm.org

Collage: Seamless integration of deep learning backends with automatic placement

B Jeon, S Park, P Liao, S Xu, T Chen, Z Jia - Proceedings of the …, 2022 - dl.acm.org

The strong demand for efficient and performant deployment of Deep Learning (DL)
applications prompts the rapid development of a rich DL ecosystem. To keep up with this fast …

被引用次数：4 相关文章所有 5 个版本

[PDF] researchgate.net

[PDF][PDF] A Literature Review on Combining Neural Architecture Search and Compiler Optimizations for Neural Network Acceleration

I Bachiri, R Baghdadi, PS Niar, H Ouarnoughi, AA ESI - researchgate.net

Designing efficient deep learning architectures is a challenging task that requires balancing
performance and hardware efficiency. Neural Architecture Search (NAS) has emerged as a …

[PDF] researchgate.net

[PDF][PDF] Hardware Aware Neural Architecture Search with Automatic Code Optimization in the MLIR Compiler

I Bachiri, R Baghdadi, PS Niar, H Ouarnoughi, AA ESI - researchgate.net

Deep learning has achieved remarkable success across various domains, leading to the
development of increasingly complex and resource-intensive models. For that, these models …

[PDF] researchgate.net

[PDF][PDF] Accelerating a Deep Learning Framework with Tiramisu

H Benmeziane - 2020 - researchgate.net

Today, machine learning offers a variety of services in the industry; including research,
translation, recommendation systems and security. Deep learning in particular has led to …

[PDF] cmu.edu

[PDF][PDF] Learning Local Advantage Functions for Generalizable Graph Optimizations

Y Wu, Y Zhou, PM Phothilimthana, H Liu, S Roy… - cs.cmu.edu

Abstract Machine learning compilers rely on making optimized decisions in order to
generate efficient code for a given computation graph. Many of these decision making …

高级搜索

QQ 群

AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures

A full-stack search technique for domain optimized deep learning accelerators

Bolt: Bridging the gap between auto-tuners and hardware-native performance

Fusionstitching: boosting memory intensive computations for deep learning workloads

Optimizing DNN compilation for distributed training with joint OP and tensor fusion

Collage: Seamless integration of deep learning backends with automatic placement

[PDF][PDF] A Literature Review on Combining Neural Architecture Search and Compiler Optimizations for Neural Network Acceleration

[PDF][PDF] Hardware Aware Neural Architecture Search with Automatic Code Optimization in the MLIR Compiler

[PDF][PDF] Accelerating a Deep Learning Framework with Tiramisu

[PDF][PDF] Learning Local Advantage Functions for Generalizable Graph Optimizations

引用