Initialization matters: Orthogonal predictive state recurrent neural networks

K Choromanski, V Likhosherstov, D Dohan… - arXiv preprint arXiv …, 2020 - arxiv.org

We introduce Performers, Transformer architectures which can estimate regular (softmax)
full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to …

被引用次数：1612 相关文章所有 8 个版本

[PDF] arxiv.org

Masked language modeling for proteins via linearly scalable long-context transformers

K Choromanski, V Likhosherstov, D Dohan… - arXiv preprint arXiv …, 2020 - arxiv.org

Transformer models have achieved state-of-the-art results across a diverse range of
domains. However, concern over the cost of training the attention mechanism to learn …

被引用次数：94 相关文章所有 2 个版本

[PDF] mlr.press

Learning with hyperspherical uniformity

W Liu, R Lin, Z Liu, L Xiong… - International …, 2021 - proceedings.mlr.press

Due to the over-parameterization nature, neural networks are a powerful tool for nonlinear
function approximation. In order to achieve good generalization on unseen data, a suitable …

被引用次数：38 相关文章所有 7 个版本

[PDF] arxiv.org

Recurrent neural filters: Learning independent bayesian filtering steps for time series prediction

B Lim, S Zohren, S Roberts - 2020 International Joint …, 2020 - ieeexplore.ieee.org

Despite the recent popularity of deep generative state space models, few comparisons have
been made between network architectures and the inference steps of the Bayesian filtering …

被引用次数：62 相关文章所有 8 个版本

[PDF] mlr.press

Taming graph kernels with random features

KM Choromanski - International Conference on Machine …, 2023 - proceedings.mlr.press

We introduce in this paper the mechanism of graph random features (GRFs). GRFs can be
used to construct unbiased randomized estimators of several important kernels defined on …

被引用次数：9 相关文章所有 6 个版本

[PDF] arxiv.org

On the expressive power of self-attention matrices

V Likhosherstov, K Choromanski, A Weller - arXiv preprint arXiv …, 2021 - arxiv.org

Transformer networks are able to capture patterns in data coming from many domains (text,
images, videos, proteins, etc.) with little or no change to architecture components. We …

被引用次数：27 相关文章所有 3 个版本

[PDF] neurips.cc

Dense-exponential random features: sharp positive estimators of the Gaussian kernel

V Likhosherstov, KM Choromanski… - Advances in …, 2024 - proceedings.neurips.cc

The problem of efficient approximation of a linear operator induced by the Gaussian or
softmax kernel is often addressed using random features (RFs) which yield an unbiased …

被引用次数：2 相关文章所有 4 个版本

[PDF] jair.org Full View

General value function networks

M Schlegel, A Jacobsen, Z Abbas, A Patterson… - Journal of Artificial …, 2021 - jair.org

State construction is important for learning in partially observable environments. A general
purpose strategy for state construction is to learn the state update using a Recurrent Neural …

被引用次数：44 相关文章所有 6 个版本

[PDF] aaai.org

On the expressive flexibility of self-attention matrices

V Likhosherstov, K Choromanski… - Proceedings of the AAAI …, 2023 - ojs.aaai.org

Transformer networks are able to capture patterns in data coming from many domains (text,
images, videos, proteins, etc.) with little or no change to architecture components. We …

被引用次数：3 相关文章所有 2 个版本

[PDF] neurips.cc

Geometrically coupled monte carlo sampling

M Rowland, KM Choromanski… - Advances in …, 2018 - proceedings.neurips.cc

Monte Carlo sampling in high-dimensional, low-sample settings is important in many
machine learning tasks. We improve current methods for sampling in Euclidean spaces by …

被引用次数：32 相关文章所有 8 个版本

高级搜索

QQ 群