Sharing attention weights for fast transformer

JO Neill - arXiv preprint arXiv:2006.03669, 2020 - arxiv.org

Overparameterized networks trained to convergence have shown impressive performance
in domains such as computer vision and natural language processing. Pushing state of the …

被引用次数：144 相关文章所有 3 个版本

[PDF] arxiv.org

Rethinking attention with performers

K Choromanski, V Likhosherstov, D Dohan… - arXiv preprint arXiv …, 2020 - arxiv.org

We introduce Performers, Transformer architectures which can estimate regular (softmax)
full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to …

被引用次数：1767 相关文章所有 8 个版本

[PDF] iop.org

CTformer: convolution-free Token2Token dilated vision transformer for low-dose CT denoising

D Wang, F Fan, Z Wu, R Liu, F Wang… - Physics in Medicine & …, 2023 - iopscience.iop.org

Objective. Low-dose computed tomography (LDCT) denoising is an important problem in CT
research. Compared to the normal dose CT, LDCT images are subjected to severe noise …

被引用次数：140 相关文章所有 5 个版本

[PDF] arxiv.org

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H Jin… - arXiv preprint arXiv …, 2023 - arxiv.org

In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

被引用次数：70 相关文章所有 2 个版本

[PDF] arxiv.org

Transformers in speech processing: A survey

S Latif, A Zaidi, H Cuayahuitl, F Shamshad… - arXiv preprint arXiv …, 2023 - arxiv.org

The remarkable success of transformers in the field of natural language processing has
sparked the interest of the speech-processing community, leading to an exploration of their …

被引用次数：68 相关文章所有 4 个版本

[PDF] arxiv.org

Audio albert: A lite bert for self-supervised learning of audio representation

PH Chi, PH Chung, TH Wu, CC Hsieh… - 2021 IEEE Spoken …, 2021 - ieeexplore.ieee.org

Self-supervised speech models are powerful speech representation extractors for
downstream applications. Recently, larger models have been utilized in acoustic model …

被引用次数：191 相关文章所有 6 个版本

[PDF] acm.org

ReTransformer: ReRAM-based processing-in-memory architecture for transformer acceleration

X Yang, B Yan, H Li, Y Chen - … of the 39th International Conference on …, 2020 - dl.acm.org

Transformer has emerged as a popular deep neural network (DNN) model for Neural
Language Processing (NLP) applications and demonstrated excellent performance in …

被引用次数：106 相关文章所有 3 个版本

[PDF] arxiv.org

Lessons on parameter sharing across layers in transformers

S Takase, S Kiyono - arXiv preprint arXiv:2104.06022, 2021 - arxiv.org

We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The
proposed approach relaxes a widely used technique, which shares parameters for one layer …

被引用次数：83 相关文章所有 5 个版本

[PDF] arxiv.org

Masked language modeling for proteins via linearly scalable long-context transformers

K Choromanski, V Likhosherstov, D Dohan… - arXiv preprint arXiv …, 2020 - arxiv.org

Transformer models have achieved state-of-the-art results across a diverse range of
domains. However, concern over the cost of training the attention mechanism to learn …

被引用次数：105 相关文章所有 2 个版本

[PDF] ed.ac.uk

Losing Heads in the Lottery: Pruning Transformer

M Behnke, K Heafield - The 2020 Conference on Empirical …, 2020 - research.ed.ac.uk

The attention mechanism is the crucial component of the transformer architecture. Recent
research shows that most attention heads are not confident in their decisions and can be …

被引用次数：82 相关文章所有 6 个版本

高级搜索

QQ 群