Griffin: Mixing gated linear recurrences with local attention for efficient language models

Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges

BN Patro, VS Agneeswaran - arXiv preprint arXiv:2404.16112, 2024 - arxiv.org

Sequence modeling is a crucial area across various domains, including Natural Language
Processing (NLP), speech recognition, time series forecasting, music generation, and …

被引用次数：12 相关文章

[PDF] arxiv.org

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

T Dao, A Gu - arXiv preprint arXiv:2405.21060, 2024 - arxiv.org

While Transformers have been the main architecture behind deep learning's success in
language modeling, state-space models (SSMs) such as Mamba have recently been shown …

被引用次数：28 相关文章所有 3 个版本

[PDF] arxiv.org

xLSTM: Extended Long Short-Term Memory

M Beck, K Pöppel, M Spanring, A Auer… - arXiv preprint arXiv …, 2024 - arxiv.org

In the 1990s, the constant error carousel and gating were introduced as the central ideas of
the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and …

被引用次数：26 相关文章所有 2 个版本

[PDF] arxiv.org

Mambamixer: Efficient selective state space models with dual token and channel selection

A Behrouz, M Santacatterina, R Zabih - arXiv preprint arXiv:2403.19888, 2024 - arxiv.org

Recent advances in deep learning have mainly relied on Transformers due to their data
dependency and ability to learn at scale. The attention module in these architectures …

被引用次数：13 相关文章所有 2 个版本

[PDF] arxiv.org

Hgrn2: Gated linear rnns with state expansion

Z Qin, S Yang, W Sun, X Shen, D Li, W Sun… - arXiv preprint arXiv …, 2024 - arxiv.org

Hierarchically gated linear RNN (HGRN, Qin et al. 2023) has demonstrated competitive
training speed and performance in language modeling, while offering efficient inference …

被引用次数：14 相关文章所有 2 个版本

[PDF] arxiv.org

Learning to (learn at test time): Rnns with expressive hidden states

Y Sun, X Li, K Dalal, J Xu, A Vikram, G Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

Self-attention performs well in long context but has quadratic complexity. Existing RNN
layers have linear complexity, but their performance in long context is limited by the …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Zamba: A Compact 7B SSM Hybrid Model

P Glorioso, Q Anthony, Y Tokpanov… - arXiv preprint arXiv …, 2024 - arxiv.org

In this technical report, we present Zamba, a novel 7B SSM-transformer hybrid model which
achieves competitive performance against leading open-weight models at a comparable …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

Retrieval head mechanistically explains long-context factuality

W Wu, Y Wang, G Xiao, H Peng, Y Fu - arXiv preprint arXiv:2404.15574, 2024 - arxiv.org

Despite the recent progress in long-context language models, it remains elusive how
transformer-based models exhibit the capability to retrieve relevant information from arbitrary …

被引用次数：5 相关文章所有 2 个版本

[PDF] mdpi.com

From Large Language Models to Large Multimodal Models: A Literature Review

D Huang, C Yan, Q Li, X Peng - Applied Sciences, 2024 - mdpi.com

With the deepening of research on Large Language Models (LLMs), significant progress has
been made in recent years on the development of Large Multimodal Models (LMMs), which …

[PDF] arxiv.org

Scalable MatMul-free Language Modeling

RJ Zhu, Y Zhang, E Sifferman, T Sheaves… - arXiv preprint arXiv …, 2024 - arxiv.org

Matrix multiplication (MatMul) typically dominates the overall computational cost of large
language models (LLMs). This cost only grows as LLMs scale to larger embedding …

被引用次数：2 相关文章所有 2 个版本

高级搜索

QQ 群