Squeezeformer: An efficient transformer for automatic speech recognition

S Kim, A Gholami, A Shaw, N Lee… - Advances in …, 2022 - proceedings.neurips.cc
The recently proposed Conformer model has become the de facto backbone model for
various downstream speech tasks based on its hybrid attention-convolution architecture that …

Full stack optimization of transformer inference: a survey

S Kim, C Hooper, T Wattanawong, M Kang… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent advances in state-of-the-art DNN architecture design have been moving toward
Transformer models. These models achieve superior accuracy across a wide range of …

A 95.6-TOPS/W deep learning inference accelerator with per-vector scaled 4-bit quantization in 5 nm

B Keller, R Venkatesan, S Dai, SG Tell… - IEEE Journal of Solid …, 2023 - ieeexplore.ieee.org
The energy efficiency of deep neural network (DNN) inference can be improved with custom
accelerators. DNN inference accelerators often employ specialized hardware techniques to …

Approximate computing and the efficient machine learning expedition

J Henkel, H Li, A Raghunathan, MB Tahoori… - Proceedings of the 41st …, 2022 - dl.acm.org
Approximate computing (AxC) has been long accepted as a design alternative for efficient
system implementation at the cost of relaxed accuracy requirements. Despite the AxC …

A Mixed-Precision Transformer Accelerator With Vector Tiling Systolic Array for License Plate Recognition in Unconstrained Scenarios

J Li, D Yan, F He, Z Dong… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Power efficiency for license plate recognition (LPR) under unconstrained scenarios is a
crucial factor in many edge-based real-world applications, eg, autonomous vehicles whose …

SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference

W Wang, S Zhou, W Sun, P Sun… - 2023 IEEE/ACM …, 2023 - ieeexplore.ieee.org
Transformers have shown remarkable performance in both natural language processing
(NLP) and computer vision (CV) tasks. However, their real-time inference speed and …

Enabling and accelerating dynamic vision transformer inference for real-time applications

K Sreedhar, J Clemons, R Venkatesan… - arXiv preprint arXiv …, 2022 - arxiv.org
Many state-of-the-art deep learning models for computer vision tasks are based on the
transformer architecture. Such models can be computationally expensive and are typically …

Genetic quantization-aware approximation for non-linear operations in Transformers

P Dong, Y Tan, D Zhang, T Ni, X Liu, Y Liu… - Proceedings of the 61st …, 2024 - dl.acm.org
Non-linear functions are prevalent in Transformers and their lightweight variants, incurring
substantial and frequently underestimated hardware costs. Previous state-of-the-art works …

ViTA: A Highly Efficient Dataflow and Architecture for Vision Transformers

C Chen, L Li, MMS Aly - 2024 Design, Automation & Test in …, 2024 - ieeexplore.ieee.org
Transformer-based DNNs have dominated several AI fields with remarkable performance.
However, the scaling up of Transformer models up to trillions of parameters and computation …

Auto-LUT: Auto Approximation of Non-Linear Operations for Neural Networks on FPGA

H Lu, Q Mei, K Wang - 2023 IEEE International Symposium on …, 2023 - ieeexplore.ieee.org
The approximation of non-linear operation can simplify the logic design and save the system
resources during the neural network inference on Field-Programmable Gate Array (FPGA) …