The impact of positional encoding on length generalization in transformers

A Kazemnejad, I Padhi… - Advances in …, 2024 - proceedings.neurips.cc
Length generalization, the ability to generalize from small training context sizes to larger
ones, is a critical challenge in the development of Transformer-based language models …

Transformers can achieve length generalization but not robustly

Y Zhou, U Alon, X Chen, X Wang, R Agarwal… - arXiv preprint arXiv …, 2024 - arxiv.org
Length generalization, defined as the ability to extrapolate from shorter training sequences
to longer test ones, is a significant challenge for language models. This issue persists even …

Clex: Continuous length extrapolation for large language models

G Chen, X Li, Z Meng, S Liang, L Bing - arXiv preprint arXiv:2310.16450, 2023 - arxiv.org
Transformer-based Large Language Models (LLMs) are pioneering advances in many
natural language processing tasks, however, their exceptional capabilities are restricted …

Cape: Encoding relative positions with continuous augmented positional embeddings

T Likhomanenko, Q Xu, G Synnaeve… - Advances in …, 2021 - proceedings.neurips.cc
Without positional information, attention-based Transformer neural networks are permutation-
invariant. Absolute or relative positional embeddings are the most popular ways to feed …

Your transformer may not be as powerful as you expect

S Luo, S Li, S Zheng, TY Liu… - Advances in Neural …, 2022 - proceedings.neurips.cc
Abstract Relative Positional Encoding (RPE), which encodes the relative distance between
any pair of tokens, is one of the most successful modifications to the original Transformer. As …

Mega: moving average equipped gated attention

X Ma, C Zhou, X Kong, J He, L Gui, G Neubig… - arXiv preprint arXiv …, 2022 - arxiv.org
The design choices in the Transformer attention mechanism, including weak inductive bias
and quadratic computational complexity, have limited its application for modeling long …

Functional interpolation for relative positions improves long context transformers

S Li, C You, G Guruganesh, J Ainslie… - arXiv preprint arXiv …, 2023 - arxiv.org
Preventing the performance decay of Transformers on inputs longer than those used for
training has been an important challenge in extending the context length of these models …

Randomized positional encodings boost length generalization of transformers

A Ruoss, G Delétang, T Genewein… - arXiv preprint arXiv …, 2023 - arxiv.org
Transformers have impressive generalization capabilities on tasks with a fixed context
length. However, they fail to generalize to sequences of arbitrary length, even for seemingly …

Combiner: Full attention transformer with sparse computation cost

H Ren, H Dai, Z Dai, M Yang… - Advances in …, 2021 - proceedings.neurips.cc
Transformers provide a class of expressive architectures that are extremely effective for
sequence modeling. However, the key limitation of transformers is their quadratic memory …

Stable, fast and accurate: Kernelized attention with relative positional encoding

S Luo, S Li, T Cai, D He, D Peng… - Advances in …, 2021 - proceedings.neurips.cc
The attention module, which is a crucial component in Transformer, cannot scale efficiently
to long sequences due to its quadratic complexity. Many works focus on approximating the …