Foundational models defining a new era in vision: A survey and outlook

M Awais, M Naseer, S Khan, RM Anwer… - arXiv preprint arXiv …, 2023 - arxiv.org
Vision systems to see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

A survey on visual mamba

H Zhang, Y Zhu, D Wang, L Zhang, T Chen, Z Wang… - Applied Sciences, 2024 - mdpi.com
State space models (SSM) with selection mechanisms and hardware-aware architectures,
namely Mamba, have recently shown significant potential in long-sequence modeling. Since …

A survey of large language models

WX Zhao, K Zhou, J Li, T Tang, X Wang, Y Hou… - arXiv preprint arXiv …, 2023 - arxiv.org
Language is essentially a complex, intricate system of human expressions governed by
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …

Mamba: Linear-time sequence modeling with selective state spaces

A Gu, T Dao - arXiv preprint arXiv:2312.00752, 2023 - arxiv.org
Foundation models, now powering most of the exciting applications in deep learning, are
almost universally based on the Transformer architecture and its core attention module …

Vision mamba: Efficient visual representation learning with bidirectional state space model

L Zhu, B Liao, Q Zhang, X Wang, W Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently the state space models (SSMs) with efficient hardware-aware designs, ie, the
Mamba deep learning model, have shown great potential for long sequence modeling …

Rmt: Retentive networks meet vision transformers

Q Fan, H Huang, M Chen, H Liu… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract Vision Transformer (ViT) has gained increasing attention in the computer vision
community in recent years. However the core component of ViT Self-Attention lacks explicit …

Scaling transformer to 1m tokens and beyond with rmt

A Bulatov, Y Kuratov, Y Kapushev… - arXiv preprint arXiv …, 2023 - arxiv.org
A major limitation for the broader scope of problems solvable by transformers is the
quadratic scaling of computational complexity with input size. In this study, we investigate …

Videomamba: State space model for efficient video understanding

K Li, X Li, Y Wang, Y He, Y Wang, L Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Addressing the dual challenges of local redundancy and global dependencies in video
understanding, this work innovatively adapts the Mamba to the video domain. The proposed …

Gated linear attention transformers with hardware-efficient training

S Yang, B Wang, Y Shen, R Panda, Y Kim - arXiv preprint arXiv …, 2023 - arxiv.org
Transformers with linear attention allow for efficient parallel training but can simultaneously
be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear (with …

Towards graph foundation models: A survey and beyond

J Liu, C Yang, Z Lu, J Chen, Y Li, M Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Emerging as fundamental building blocks for diverse artificial intelligence applications,
foundation models have achieved notable success across natural language processing and …