Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges

BN Patro, VS Agneeswaran - arXiv preprint arXiv:2404.16112, 2024 - arxiv.org
Sequence modeling is a crucial area across various domains, including Natural Language
Processing (NLP), speech recognition, time series forecasting, music generation, and …

Griffin: Mixing gated linear recurrences with local attention for efficient language models

S De, SL Smith, A Fernando, A Botev… - arXiv preprint arXiv …, 2024 - arxiv.org
Recurrent neural networks (RNNs) have fast inference and scale efficiently on long
sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with …

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

T Dao, A Gu - arXiv preprint arXiv:2405.21060, 2024 - arxiv.org
While Transformers have been the main architecture behind deep learning's success in
language modeling, state-space models (SSMs) such as Mamba have recently been shown …

Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues

A Orvieto, S De, C Gulcehre, R Pascanu… - Forty-first International …, 2024 - openreview.net
Deep neural networks based on linear RNNs interleaved with position-wise MLPs are
gaining traction as competitive approaches for sequence modeling. Examples of such …

Zamba: A Compact 7B SSM Hybrid Model

P Glorioso, Q Anthony, Y Tokpanov… - arXiv preprint arXiv …, 2024 - arxiv.org
In this technical report, we present Zamba, a novel 7B SSM-transformer hybrid model which
achieves competitive performance against leading open-weight models at a comparable …

Memllm: Finetuning llms to use an explicit read-write memory

A Modarressi, A Köksal, A Imani, M Fayyaz… - arXiv preprint arXiv …, 2024 - arxiv.org
While current large language models (LLMs) demonstrate some capabilities in knowledge-
intensive tasks, they are limited by relying on their parameters as an implicit storage …

Rnns are not transformers (yet): The key bottleneck on in-context retrieval

K Wen, X Dang, K Lyu - arXiv preprint arXiv:2402.18510, 2024 - arxiv.org
This paper investigates the gap in representation powers of Recurrent Neural Networks
(RNNs) and Transformers in the context of solving algorithmic problems. We focus on …

Separations in the Representational Capabilities of Transformers and Recurrent Architectures

S Bhattamishra, M Hahn, P Blunsom… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformer architectures have been widely adopted in foundation models. Due to their high
inference costs, there is renewed interest in exploring the potential of efficient recurrent …

Linear Transformers with Learnable Kernel Functions are Better In-Context Models

Y Aksenov, N Balagansky, SMLC Vaina… - arXiv preprint arXiv …, 2024 - arxiv.org
Advancing the frontier of subquadratic architectures for Language Models (LMs) is crucial in
the rapidly evolving field of natural language processing. Current innovations, including …

An Empirical Study of Mamba-based Language Models

R Waleffe, W Byeon, D Riach, B Norick… - arXiv preprint arXiv …, 2024 - arxiv.org
Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of
Transformers, such as quadratic computational complexity with sequence length and large …