Romou: Rapidly generate high-performance tensor kernels for mobile gpus

H Zhu, R Wu, Y Diao, S Ke, H Li, C Zhang… - … USENIX Symposium on …, 2022 - usenix.org

Despite recent advances in tensor compilers, it often costs hours to generate an efficient
kernel for an operator, a compute-intensive sub-task in a deep neural network (DNN), on …

被引用次数：78 相关文章所有 7 个版本

[PDF] arxiv.org

Edgemoe: Fast on-device inference of moe-based large language models

R Yi, L Guo, S Wei, A Zhou, S Wang, M Xu - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) such as GPTs and LLaMa have ushered in a revolution in
machine intelligence, owing to their exceptional capabilities in a wide range of machine …

被引用次数：60 相关文章所有 2 个版本

[PDF] acm.org

MobiDepth: Real-time depth estimation using on-device dual cameras

J Zhang, H Yang, J Ren, D Zhang, B He, T Cao… - Proceedings of the 28th …, 2022 - dl.acm.org

Real-time depth estimation is critical for the increasingly popular augmented reality and
virtual reality applications on mobile devices. Yet existing solutions are insufficient as they …

被引用次数：19 相关文章所有 2 个版本

[PDF] arxiv.org

Lut-nn: Empower efficient neural network inference with centroid learning and table lookup

X Tang, Y Wang, T Cao, LL Zhang, Q Chen… - Proceedings of the 29th …, 2023 - dl.acm.org

On-device Deep Neural Network (DNN) inference consumes significant computing
resources and development efforts. To alleviate that, we propose LUT-NN, the first system to …

被引用次数：14 相关文章所有 3 个版本

[PDF] acm.org

Safe and Practical GPU Computation in TrustZone

H Park, FX Lin - Proceedings of the Eighteenth European Conference …, 2023 - dl.acm.org

For mobile devices, it is compelling to run sensitive GPU computation within a TrustZone
trusted execution environment (TEE). To minimize GPU software deployed in TEE, the …

被引用次数：8 相关文章所有 2 个版本

Heterogeneous Parallel Acceleration for Edge Intelligence Systems: Challenges and Solutions

J Zhang, C Zhou, H Yang, D Zhang… - IEEE Consumer …, 2024 - ieeexplore.ieee.org

The rapid advancement of edge artificial intelligence (AI) can be attributed to the widespread
use of edge consumer devices and the enhancement in System-on-Chip (SoC) capabilities …

被引用次数：1 相关文章

[PDF] acm.org

DynaSpa: Exploiting Spatial Sparsity for Efficient Dynamic DNN Inference on Devices

R Liu, Y Leng, S Tian, S Hu, CF Chen… - Proceedings of the 22nd …, 2024 - dl.acm.org

Recent advancements in exploring machine learning models' dynamic spatial sparsity have
demonstrated great potential for superior efficiency and adaptability without compromising …

Empowering In-Browser Deep Learning Inference on Edge Through Just-In-Time Kernel Optimization

F Jia, S Jiang, T Cao, W Cui, T Xia, X Cao, Y Li… - Proceedings of the …, 2024 - dl.acm.org

Web is increasingly becoming the primary platform to deliver AI services onto edge devices,
making in-browser deep learning (DL) inference more prominent. Nevertheless, the …

被引用次数：4 相关文章

[PDF] acm.org

SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile

W Niu, MMR Sanim, Z Shu, J Guan, X Shen… - Proceedings of the 29th …, 2024 - dl.acm.org

This work is motivated by recent developments in Deep Neural Networks, particularly the
Transformer architectures underlying applications such as ChatGPT, and the need for …

被引用次数：3 相关文章所有 5 个版本

NLTSP: A cost model for tensor program tuning using nested loop trees

X Qin, Y Li, F Lin, W Li - Journal of Systems Architecture, 2025 - Elsevier

This paper introduces NLTSP, a deep learning-based cost model designed to optimize
tensor program performance in deep learning compilers. NLTSP, short for Nested Loop Tree …

高级搜索

QQ 群