相关文章- 学术资源搜索

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

X Li, X Yin, C Li, P Zhang, X Hu, L Zhang… - Computer Vision–ECCV …, 2020 - Springer

Large-scale pre-training methods of learning cross-modal representations on image-text
pairs are becoming popular for vision-language tasks. While existing methods simply …

被引用次数：1882 相关文章所有 6 个版本

[PDF] thecvf.com

Timechat: A time-sensitive multimodal large language model for long video understanding

S Ren, L Yao, S Li, X Sun… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …

被引用次数：26 相关文章所有 4 个版本

[PDF] arxiv.org

Internvideo2: Scaling video foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-
the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our …

被引用次数：18 相关文章所有 3 个版本

[PDF] thecvf.com

Acav100m: Automatic curation of large-scale datasets for audio-visual video representation learning

S Lee, J Chung, Y Yu, G Kim, T Breuel… - Proceedings of the …, 2021 - openaccess.thecvf.com

The natural association between visual observations and their corresponding sound
provides powerful self-supervisory signals for learning video representations, which makes …

被引用次数：37 相关文章所有 8 个版本

[PDF] thecvf.com

Masking modalities for cross-modal video retrieval

V Gabeur, A Nagrani, C Sun… - Proceedings of the …, 2022 - openaccess.thecvf.com

Pre-training on large scale unlabelled datasets has shown impressive performance
improvements in the fields of computer vision and natural language processing. Given the …

被引用次数：37 相关文章所有 18 个版本

[PDF] thecvf.com

Map: Multimodal uncertainty-aware vision-language pre-training model

Y Ji, J Wang, Y Gong, L Zhang, Y Zhu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Multimodal semantic understanding often has to deal with uncertainty, which means the
obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our …

被引用次数：13 相关文章所有 3 个版本

[PDF] arxiv.org

Understanding chinese video and language via contrastive multimodal pre-training

C Lei, S Luo, Y Liu, W He, J Wang, G Wang… - Proceedings of the 29th …, 2021 - dl.acm.org

The pre-trained neural models have recently achieved impressive performance in
understanding multimodal content. However, it is still very challenging to pre-train neural …

被引用次数：40 相关文章所有 3 个版本

[PDF] thecvf.com

Audiovisual masked autoencoders

MI Georgescu, E Fonseca, RT Ionescu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Can we leverage the audiovisual information already present in video to improve self-
supervised representation learning? To answer this question, we study various pretraining …

被引用次数：32 相关文章所有 5 个版本

[PDF] neurips.cc

Omnivl: One foundation model for image-language and video-language tasks

J Wang, D Chen, Z Wu, C Luo, L Zhou… - Advances in neural …, 2022 - proceedings.neurips.cc

This paper presents OmniVL, a new foundation model to support both image-language and
video-language tasks using one universal architecture. It adopts a unified transformer-based …

被引用次数：116 相关文章所有 7 个版本

[PDF] thecvf.com

Learning video representations from large language models

Y Zhao, I Misra, P Krähenbühl… - Proceedings of the …, 2023 - openaccess.thecvf.com

We introduce LAVILA, a new approach to learning video-language representations by
leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be …

被引用次数：104 相关文章所有 7 个版本

高级搜索

QQ 群

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Timechat: A time-sensitive multimodal large language model for long video understanding

Internvideo2: Scaling video foundation models for multimodal video understanding

Acav100m: Automatic curation of large-scale datasets for audio-visual video representation learning

Masking modalities for cross-modal video retrieval

Map: Multimodal uncertainty-aware vision-language pre-training model

Understanding chinese video and language via contrastive multimodal pre-training

Audiovisual masked autoencoders

Omnivl: One foundation model for image-language and video-language tasks

Learning video representations from large language models

引用