Accelerating transformer-based deep learning models on fpgas using column balanced block pruning

H Peng, S Huang, T Geng, A Li, W Jiang… - … on Quality Electronic …, 2021 - ieeexplore.ieee.org
Although Transformer-based language representations achieve state-of-the-art accuracy on
various natural language processing (NLP) tasks, the large model size has been …

Accommodating transformer onto fpga: Coupling the balanced model compression and fpga-implementation optimization

P Qi, Y Song, H Peng, S Huang, Q Zhuge… - Proceedings of the 2021 …, 2021 - dl.acm.org
Recently, Transformers gradually gain popularity and perform outstanding for many Natural
Language Processing (NLP) tasks. However, Transformers suffer from heavy computation …

A length adaptive algorithm-hardware co-design of transformer on fpga through sparse attention and dynamic pipelining

H Peng, S Huang, S Chen, B Li, T Geng, A Li… - Proceedings of the 59th …, 2022 - dl.acm.org
Transformers are considered one of the most important deep learning models since 2018, in
part because it establishes state-of-the-art (SOTA) records and could potentially replace …

Hardware acceleration of fully quantized bert for efficient natural language processing

Z Liu, G Li, J Cheng - 2021 Design, Automation & Test in …, 2021 - ieeexplore.ieee.org
BERT is the most recent Transformer-based model that achieves state-of-the-art
performance in various NLP tasks. In this paper, we investigate the hardware acceleration of …

Et: re-thinking self-attention for transformer models on gpus

S Chen, S Huang, S Pandey, B Li, GR Gao… - Proceedings of the …, 2021 - dl.acm.org
Transformer-based deep learning models have become a ubiquitous vehicle to drive a
variety of Natural Language Processing (NLP) related tasks beyond their accuracy ceiling …

Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity

S Cao, C Zhang, Z Yao, W Xiao, L Nie, D Zhan… - Proceedings of the …, 2019 - dl.acm.org
Neural networks based on Long Short-Term Memory (LSTM) are widely deployed in latency-
sensitive language and speech applications. To speed up LSTM inference, previous …

Ftrans: energy-efficient acceleration of transformers using fpga

B Li, S Pandey, H Fang, Y Lyv, J Li, J Chen… - Proceedings of the …, 2020 - dl.acm.org
In natural language processing (NLP), the" Transformer" architecture was proposed as the
first transduction model replying entirely on self-attention mechanisms without using …

Q8bert: Quantized 8bit bert

O Zafrir, G Boudoukh, P Izsak… - 2019 Fifth Workshop on …, 2019 - ieeexplore.ieee.org
Recently, pre-trained Transformer [1] based language models such as BERT [2] and GPT [3],
have shown great improvement in many Natural Language Processing (NLP) tasks …

Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer

S Lu, M Wang, S Liang, J Lin… - 2020 IEEE 33rd …, 2020 - ieeexplore.ieee.org
Designing hardware accelerators for deep neural networks (DNNs) has been much desired.
Nonetheless, most of these existing accelerators are built for either convolutional neural …

ReTransformer: ReRAM-based processing-in-memory architecture for transformer acceleration

X Yang, B Yan, H Li, Y Chen - … of the 39th International Conference on …, 2020 - dl.acm.org
Transformer has emerged as a popular deep neural network (DNN) model for Neural
Language Processing (NLP) applications and demonstrated excellent performance in …