Masked autoencoders as spatiotemporal learners

R Girdhar, A El-Nouby, M Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com

Transformer-based architectures have become competitive across a variety of visual
domains, most notably images and videos. While prior work studies these modalities in …

被引用次数：75 相关文章所有 6 个版本

[PDF] arxiv.org

Videollm: Modeling video sequence with large language models

G Chen, YD Zheng, J Wang, J Xu, Y Huang… - arXiv preprint arXiv …, 2023 - arxiv.org

With the exponential growth of video data, there is an urgent need for automated technology
to analyze and comprehend video content. However, existing video understanding models …

被引用次数：56 相关文章所有 2 个版本

[PDF] thecvf.com

Marlin: Masked autoencoder for facial video representation learning

Z Cai, S Ghosh, K Stefanov, A Dhall… - Proceedings of the …, 2023 - openaccess.thecvf.com

This paper proposes a self-supervised approach to learn universal facial representations
from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute …

被引用次数：49 相关文章所有 10 个版本

[PDF] thecvf.com

Rethinking video vits: Sparse video tubes for joint image and video learning

AJ Piergiovanni, W Kuo… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

We present a simple approach which can turn a ViT encoder into an efficient video model,
which can seamlessly work with both image and video inputs. By sparsely sampling the …

被引用次数：56 相关文章所有 7 个版本

[PDF] arxiv.org

Hydra attention: Efficient attention with many heads

D Bolya, CY Fu, X Dai, P Zhang, J Hoffman - European Conference on …, 2022 - Springer

While transformers have begun to dominate many tasks in vision, applying them to large
images is still computationally difficult. A large reason for this is that self-attention scales …

被引用次数：54 相关文章所有 4 个版本

[PDF] arxiv.org

Transformer-based visual segmentation: A survey

X Li, H Ding, H Yuan, W Zhang, J Pang… - arXiv preprint arXiv …, 2023 - arxiv.org

Visual segmentation seeks to partition images, video frames, or point clouds into multiple
segments or groups. This technique has numerous real-world applications, such as …

被引用次数：59 相关文章所有 3 个版本

[PDF] thecvf.com

Diffusion models as masked autoencoders

C Wei, K Mangalam, PY Huang, Y Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

There has been a longstanding belief that generation can facilitate a true understanding of
visual data. In line with this, we revisit generatively pre-training visual representations in light …

被引用次数：31 相关文章所有 6 个版本

[PDF] arxiv.org

Videomamba: State space model for efficient video understanding

K Li, X Li, Y Wang, Y He, Y Wang, L Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

Addressing the dual challenges of local redundancy and global dependencies in video
understanding, this work innovatively adapts the Mamba to the video domain. The proposed …

被引用次数：51 相关文章所有 2 个版本

[PDF] thecvf.com

Core: Cooperative reconstruction for multi-agent perception

B Wang, L Zhang, Z Wang, Y Zhao… - Proceedings of the …, 2023 - openaccess.thecvf.com

This paper presents CORE, a conceptually simple, effective and communication-efficient
model for multi-agent cooperative perception. It addresses the task from a novel perspective …

被引用次数：20 相关文章所有 7 个版本

[PDF] arxiv.org

Knowledge graph self-supervised rationalization for recommendation

Y Yang, C Huang, L Xia, C Huang - … of the 29th ACM SIGKDD conference …, 2023 - dl.acm.org

In this paper, we introduce a new self-supervised rationalization method, called KGRec, for
knowledge-aware recommender systems. To effectively identify informative knowledge …

被引用次数：37 相关文章所有 5 个版本

高级搜索

QQ 群