Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs

X Fu, W Yang, D Dong, X Su - Proceedings of the 38th ACM International …, 2024 - dl.acm.org
Transformers reign supreme in natural language processing, representing a milestone
innovation in deep learning. For high-performance model inference, optimizing the time …

Characterizing and optimizing transformer inference on arm many-core processor

J Jiang, J Du, D Huang, D Li, J Zheng… - Proceedings of the 51st …, 2022 - dl.acm.org
Transformer has experienced tremendous success and revolutionized the field of natural
language processing (NLP). While GPU has become the de facto standard for deep learning …

Full-stack optimizing transformer inference on ARM many-core CPU

J Jiang, J Du, D Huang, Z Chen, Y Lu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
The past several years have witnessed tremendous success of transformer models in
natural language processing (NLP), and their current landscape is increasingly diverse …

Improving Transformers with Dynamically Composable Multi-Head Attention

D Xiao, Q Meng, S Li, X Yuan - arXiv preprint arXiv:2405.08553, 2024 - arxiv.org
Multi-Head Attention (MHA) is a key component of Transformer. In MHA, attention heads
work independently, causing problems such as low-rank bottleneck of attention score …

FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs

E Kabir, MA Kabir, ARJ Downey, JD Bakos… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformer neural networks (TNNs) are being applied across a widening range of
application domains, including natural language processing (NLP), machine translation, and …

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

J Shah, G Bikshandi, Y Zhang, V Thakkar… - arXiv preprint arXiv …, 2024 - arxiv.org
Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for
large language models and long-context applications. FlashAttention elaborated an …

Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors

C Zhang, B Sun, X Yu, Z Xie, W Zheng… - Proceedings of the SC' …, 2023 - dl.acm.org
Transformer models have achieved remarkable success in various machine learning tasks
but suffer from high computational complexity and resource requirements. The quadratic …

Layer-wise pruning of transformer attention heads for efficient language modeling

K Shim, I Choi, W Sung, J Choi - 2021 18th International SoC …, 2021 - ieeexplore.ieee.org
Recently, the necessity of multiple attention heads in transformer architecture has been
questioned [1]. Removing less important heads from a large network is a promising strategy …

Dtqatten: Leveraging dynamic token-based quantization for efficient attention architecture

T Yang, D Li, Z Song, Y Zhao, F Liu… - … , Automation & Test …, 2022 - ieeexplore.ieee.org
Models based on the attention mechanism, ie transformers, have shown extraordinary
performance in Natural Language Processing (NLP) tasks. However, their memory footprint …

Software and Hardware Fusion Multi-Head Attention

W Hu, D Xu, F Liu, Z Fan - International Conference on Knowledge …, 2022 - Springer
Recently, Transformer has achieved state-of-the-arts results in several research areas such
as Natural Language Processing and Computer Vision. Due to Transformer has a very large …