Rethinking attention with performers

K Choromanski, V Likhosherstov, D Dohan… - arXiv preprint arXiv …, 2020 - arxiv.org
We introduce Performers, Transformer architectures which can estimate regular (softmax)
full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to …

Masked language modeling for proteins via linearly scalable long-context transformers

K Choromanski, V Likhosherstov, D Dohan… - arXiv preprint arXiv …, 2020 - arxiv.org
Transformer models have achieved state-of-the-art results across a diverse range of
domains. However, concern over the cost of training the attention mechanism to learn …

Learning with hyperspherical uniformity

W Liu, R Lin, Z Liu, L Xiong… - International …, 2021 - proceedings.mlr.press
Due to the over-parameterization nature, neural networks are a powerful tool for nonlinear
function approximation. In order to achieve good generalization on unseen data, a suitable …

Recurrent neural filters: Learning independent bayesian filtering steps for time series prediction

B Lim, S Zohren, S Roberts - 2020 International Joint …, 2020 - ieeexplore.ieee.org
Despite the recent popularity of deep generative state space models, few comparisons have
been made between network architectures and the inference steps of the Bayesian filtering …

Taming graph kernels with random features

KM Choromanski - International Conference on Machine …, 2023 - proceedings.mlr.press
We introduce in this paper the mechanism of graph random features (GRFs). GRFs can be
used to construct unbiased randomized estimators of several important kernels defined on …

On the expressive power of self-attention matrices

V Likhosherstov, K Choromanski, A Weller - arXiv preprint arXiv …, 2021 - arxiv.org
Transformer networks are able to capture patterns in data coming from many domains (text,
images, videos, proteins, etc.) with little or no change to architecture components. We …

Dense-exponential random features: sharp positive estimators of the Gaussian kernel

V Likhosherstov, KM Choromanski… - Advances in …, 2024 - proceedings.neurips.cc
The problem of efficient approximation of a linear operator induced by the Gaussian or
softmax kernel is often addressed using random features (RFs) which yield an unbiased …

General value function networks

M Schlegel, A Jacobsen, Z Abbas, A Patterson… - Journal of Artificial …, 2021 - jair.org
State construction is important for learning in partially observable environments. A general
purpose strategy for state construction is to learn the state update using a Recurrent Neural …

On the expressive flexibility of self-attention matrices

V Likhosherstov, K Choromanski… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Transformer networks are able to capture patterns in data coming from many domains (text,
images, videos, proteins, etc.) with little or no change to architecture components. We …

Geometrically coupled monte carlo sampling

M Rowland, KM Choromanski… - Advances in …, 2018 - proceedings.neurips.cc
Monte Carlo sampling in high-dimensional, low-sample settings is important in many
machine learning tasks. We improve current methods for sampling in Euclidean spaces by …