Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges

BN Patro, VS Agneeswaran - arXiv preprint arXiv:2404.16112, 2024 - arxiv.org
Sequence modeling is a crucial area across various domains, including Natural Language
Processing (NLP), speech recognition, time series forecasting, music generation, and …

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

T Dao, A Gu - arXiv preprint arXiv:2405.21060, 2024 - arxiv.org
While Transformers have been the main architecture behind deep learning's success in
language modeling, state-space models (SSMs) such as Mamba have recently been shown …

xLSTM: Extended Long Short-Term Memory

M Beck, K Pöppel, M Spanring, A Auer… - arXiv preprint arXiv …, 2024 - arxiv.org
In the 1990s, the constant error carousel and gating were introduced as the central ideas of
the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and …

Mambamixer: Efficient selective state space models with dual token and channel selection

A Behrouz, M Santacatterina, R Zabih - arXiv preprint arXiv:2403.19888, 2024 - arxiv.org
Recent advances in deep learning have mainly relied on Transformers due to their data
dependency and ability to learn at scale. The attention module in these architectures …

Hgrn2: Gated linear rnns with state expansion

Z Qin, S Yang, W Sun, X Shen, D Li, W Sun… - arXiv preprint arXiv …, 2024 - arxiv.org
Hierarchically gated linear RNN (HGRN, Qin et al. 2023) has demonstrated competitive
training speed and performance in language modeling, while offering efficient inference …

Learning to (learn at test time): Rnns with expressive hidden states

Y Sun, X Li, K Dalal, J Xu, A Vikram, G Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Self-attention performs well in long context but has quadratic complexity. Existing RNN
layers have linear complexity, but their performance in long context is limited by the …

Zamba: A Compact 7B SSM Hybrid Model

P Glorioso, Q Anthony, Y Tokpanov… - arXiv preprint arXiv …, 2024 - arxiv.org
In this technical report, we present Zamba, a novel 7B SSM-transformer hybrid model which
achieves competitive performance against leading open-weight models at a comparable …

Retrieval head mechanistically explains long-context factuality

W Wu, Y Wang, G Xiao, H Peng, Y Fu - arXiv preprint arXiv:2404.15574, 2024 - arxiv.org
Despite the recent progress in long-context language models, it remains elusive how
transformer-based models exhibit the capability to retrieve relevant information from arbitrary …

From Large Language Models to Large Multimodal Models: A Literature Review

D Huang, C Yan, Q Li, X Peng - Applied Sciences, 2024 - mdpi.com
With the deepening of research on Large Language Models (LLMs), significant progress has
been made in recent years on the development of Large Multimodal Models (LMMs), which …

Scalable MatMul-free Language Modeling

RJ Zhu, Y Zhang, E Sifferman, T Sheaves… - arXiv preprint arXiv …, 2024 - arxiv.org
Matrix multiplication (MatMul) typically dominates the overall computational cost of large
language models (LLMs). This cost only grows as LLMs scale to larger embedding …