Omnimae: Single model masked pretraining on images and videos

R Girdhar, A El-Nouby, M Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com
Transformer-based architectures have become competitive across a variety of visual
domains, most notably images and videos. While prior work studies these modalities in …

Videollm: Modeling video sequence with large language models

G Chen, YD Zheng, J Wang, J Xu, Y Huang… - arXiv preprint arXiv …, 2023 - arxiv.org
With the exponential growth of video data, there is an urgent need for automated technology
to analyze and comprehend video content. However, existing video understanding models …

Marlin: Masked autoencoder for facial video representation learning

Z Cai, S Ghosh, K Stefanov, A Dhall… - Proceedings of the …, 2023 - openaccess.thecvf.com
This paper proposes a self-supervised approach to learn universal facial representations
from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute …

Rethinking video vits: Sparse video tubes for joint image and video learning

AJ Piergiovanni, W Kuo… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
We present a simple approach which can turn a ViT encoder into an efficient video model,
which can seamlessly work with both image and video inputs. By sparsely sampling the …

Hydra attention: Efficient attention with many heads

D Bolya, CY Fu, X Dai, P Zhang, J Hoffman - European Conference on …, 2022 - Springer
While transformers have begun to dominate many tasks in vision, applying them to large
images is still computationally difficult. A large reason for this is that self-attention scales …

Transformer-based visual segmentation: A survey

X Li, H Ding, H Yuan, W Zhang, J Pang… - arXiv preprint arXiv …, 2023 - arxiv.org
Visual segmentation seeks to partition images, video frames, or point clouds into multiple
segments or groups. This technique has numerous real-world applications, such as …

Diffusion models as masked autoencoders

C Wei, K Mangalam, PY Huang, Y Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
There has been a longstanding belief that generation can facilitate a true understanding of
visual data. In line with this, we revisit generatively pre-training visual representations in light …

Videomamba: State space model for efficient video understanding

K Li, X Li, Y Wang, Y He, Y Wang, L Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Addressing the dual challenges of local redundancy and global dependencies in video
understanding, this work innovatively adapts the Mamba to the video domain. The proposed …

Core: Cooperative reconstruction for multi-agent perception

B Wang, L Zhang, Z Wang, Y Zhao… - Proceedings of the …, 2023 - openaccess.thecvf.com
This paper presents CORE, a conceptually simple, effective and communication-efficient
model for multi-agent cooperative perception. It addresses the task from a novel perspective …

Knowledge graph self-supervised rationalization for recommendation

Y Yang, C Huang, L Xia, C Huang - … of the 29th ACM SIGKDD conference …, 2023 - dl.acm.org
In this paper, we introduce a new self-supervised rationalization method, called KGRec, for
knowledge-aware recommender systems. To effectively identify informative knowledge …