相关文章- 学术资源搜索

The impact of positional encoding on length generalization in transformers

A Kazemnejad, I Padhi… - Advances in …, 2024 - proceedings.neurips.cc

Length generalization, the ability to generalize from small training context sizes to larger
ones, is a critical challenge in the development of Transformer-based language models …

被引用次数：52 相关文章所有 6 个版本

[PDF] arxiv.org

Transformers can achieve length generalization but not robustly

Y Zhou, U Alon, X Chen, X Wang, R Agarwal… - arXiv preprint arXiv …, 2024 - arxiv.org

Length generalization, defined as the ability to extrapolate from shorter training sequences
to longer test ones, is a significant challenge for language models. This issue persists even …

被引用次数：7 相关文章所有 3 个版本

[PDF] arxiv.org

Clex: Continuous length extrapolation for large language models

G Chen, X Li, Z Meng, S Liang, L Bing - arXiv preprint arXiv:2310.16450, 2023 - arxiv.org

Transformer-based Large Language Models (LLMs) are pioneering advances in many
natural language processing tasks, however, their exceptional capabilities are restricted …

被引用次数：11 相关文章所有 4 个版本

[PDF] neurips.cc

Cape: Encoding relative positions with continuous augmented positional embeddings

T Likhomanenko, Q Xu, G Synnaeve… - Advances in …, 2021 - proceedings.neurips.cc

Without positional information, attention-based Transformer neural networks are permutation-
invariant. Absolute or relative positional embeddings are the most popular ways to feed …

被引用次数：43 相关文章所有 9 个版本

[PDF] neurips.cc

Your transformer may not be as powerful as you expect

S Luo, S Li, S Zheng, TY Liu… - Advances in Neural …, 2022 - proceedings.neurips.cc

Abstract Relative Positional Encoding (RPE), which encodes the relative distance between
any pair of tokens, is one of the most successful modifications to the original Transformer. As …

被引用次数：40 相关文章所有 9 个版本

[PDF] arxiv.org

Mega: moving average equipped gated attention

X Ma, C Zhou, X Kong, J He, L Gui, G Neubig… - arXiv preprint arXiv …, 2022 - arxiv.org

The design choices in the Transformer attention mechanism, including weak inductive bias
and quadratic computational complexity, have limited its application for modeling long …

被引用次数：102 相关文章所有 3 个版本

[PDF] arxiv.org

Functional interpolation for relative positions improves long context transformers

S Li, C You, G Guruganesh, J Ainslie… - arXiv preprint arXiv …, 2023 - arxiv.org

Preventing the performance decay of Transformers on inputs longer than those used for
training has been an important challenge in extending the context length of these models …

被引用次数：14 相关文章所有 3 个版本

[PDF] arxiv.org

Randomized positional encodings boost length generalization of transformers

A Ruoss, G Delétang, T Genewein… - arXiv preprint arXiv …, 2023 - arxiv.org

Transformers have impressive generalization capabilities on tasks with a fixed context
length. However, they fail to generalize to sequences of arbitrary length, even for seemingly …

被引用次数：48 相关文章所有 4 个版本

[PDF] neurips.cc

Combiner: Full attention transformer with sparse computation cost

H Ren, H Dai, Z Dai, M Yang… - Advances in …, 2021 - proceedings.neurips.cc

Transformers provide a class of expressive architectures that are extremely effective for
sequence modeling. However, the key limitation of transformers is their quadratic memory …

被引用次数：68 相关文章所有 10 个版本

[PDF] neurips.cc

Stable, fast and accurate: Kernelized attention with relative positional encoding

S Luo, S Li, T Cai, D He, D Peng… - Advances in …, 2021 - proceedings.neurips.cc

The attention module, which is a crucial component in Transformer, cannot scale efficiently
to long sequences due to its quadratic complexity. Many works focus on approximating the …

被引用次数：42 相关文章所有 9 个版本

高级搜索

QQ 群

The impact of positional encoding on length generalization in transformers

Transformers can achieve length generalization but not robustly

Clex: Continuous length extrapolation for large language models

Cape: Encoding relative positions with continuous augmented positional embeddings

Your transformer may not be as powerful as you expect

Mega: moving average equipped gated attention

Functional interpolation for relative positions improves long context transformers

Randomized positional encodings boost length generalization of transformers

Combiner: Full attention transformer with sparse computation cost

Stable, fast and accurate: Kernelized attention with relative positional encoding

相关搜索

引用