相关文章- 学术资源搜索

Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs

X Fu, W Yang, D Dong, X Su - Proceedings of the 38th ACM International …, 2024 - dl.acm.org

Transformers reign supreme in natural language processing, representing a milestone
innovation in deep learning. For high-performance model inference, optimizing the time …

被引用次数：1 相关文章

Characterizing and optimizing transformer inference on arm many-core processor

J Jiang, J Du, D Huang, D Li, J Zheng… - Proceedings of the 51st …, 2022 - dl.acm.org

Transformer has experienced tremendous success and revolutionized the field of natural
language processing (NLP). While GPU has become the de facto standard for deep learning …

被引用次数：5 相关文章

[PDF] researchgate.net

Full-stack optimizing transformer inference on ARM many-core CPU

J Jiang, J Du, D Huang, Z Chen, Y Lu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

The past several years have witnessed tremendous success of transformer models in
natural language processing (NLP), and their current landscape is increasingly diverse …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

Improving Transformers with Dynamically Composable Multi-Head Attention

D Xiao, Q Meng, S Li, X Yuan - arXiv preprint arXiv:2405.08553, 2024 - arxiv.org

Multi-Head Attention (MHA) is a key component of Transformer. In MHA, attention heads
work independently, causing problems such as low-rank bottleneck of attention score …

FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs

E Kabir, MA Kabir, ARJ Downey, JD Bakos… - arXiv preprint arXiv …, 2024 - arxiv.org

Transformer neural networks (TNNs) are being applied across a widening range of
application domains, including natural language processing (NLP), machine translation, and …

[PDF] arxiv.org

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

J Shah, G Bikshandi, Y Zhang, V Thakkar… - arXiv preprint arXiv …, 2024 - arxiv.org

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for
large language models and long-context applications. FlashAttention elaborated an …

被引用次数：12 相关文章所有 3 个版本

[PDF] arxiv.org

Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors

C Zhang, B Sun, X Yu, Z Xie, W Zheng… - Proceedings of the SC' …, 2023 - dl.acm.org

Transformer models have achieved remarkable success in various machine learning tasks
but suffer from high computational complexity and resource requirements. The quadratic …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

Layer-wise pruning of transformer attention heads for efficient language modeling

K Shim, I Choi, W Sung, J Choi - 2021 18th International SoC …, 2021 - ieeexplore.ieee.org

Recently, the necessity of multiple attention heads in transformer architecture has been
questioned [1]. Removing less important heads from a large network is a promising strategy …

被引用次数：12 相关文章所有 4 个版本

[PDF] github.io

Dtqatten: Leveraging dynamic token-based quantization for efficient attention architecture

T Yang, D Li, Z Song, Y Zhao, F Liu… - … , Automation & Test …, 2022 - ieeexplore.ieee.org

Models based on the attention mechanism, ie transformers, have shown extraordinary
performance in Natural Language Processing (NLP) tasks. However, their memory footprint …

被引用次数：10 相关文章所有 4 个版本

Software and Hardware Fusion Multi-Head Attention

W Hu, D Xu, F Liu, Z Fan - International Conference on Knowledge …, 2022 - Springer

Recently, Transformer has achieved state-of-the-arts results in several research areas such
as Natural Language Processing and Computer Vision. Due to Transformer has a very large …

被引用次数：1 相关文章所有 2 个版本

高级搜索

QQ 群

Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs

Characterizing and optimizing transformer inference on arm many-core processor

Full-stack optimizing transformer inference on ARM many-core CPU

Improving Transformers with Dynamically Composable Multi-Head Attention

FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors

Layer-wise pruning of transformer attention heads for efficient language modeling

Dtqatten: Leveraging dynamic token-based quantization for efficient attention architecture

Software and Hardware Fusion Multi-Head Attention

引用