Lavender: Unifying video-language understanding as masked language modeling

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：142 相关文章所有 7 个版本

[PDF] arxiv.org

Videochat: Chat-centric video understanding

KC Li, Y He, Y Wang, Y Li, W Wang, P Luo… - arXiv preprint arXiv …, 2023 - arxiv.org

In this paper, we initiate an attempt of developing an end-to-end chat-centric video
understanding system, coined as VideoChat. It integrates video foundation models and …

被引用次数：294 相关文章所有 4 个版本

[PDF] openreview.net

Unified-io: A unified model for vision, language, and multi-modal tasks

J Lu, C Clark, R Zellers, R Mottaghi… - The Eleventh …, 2022 - openreview.net

We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical
computer vision tasks, including pose estimation, object detection, depth estimation and …

被引用次数：305 相关文章所有 3 个版本

[PDF] thecvf.com

Moviechat: From dense token to sparse memory for long video understanding

E Song, W Chai, G Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

被引用次数：65 相关文章所有 3 个版本

[PDF] arxiv.org

Internvideo: General video foundation models via generative and discriminative learning

Y Wang, K Li, Y Li, Y He, B Huang, Z Zhao… - arXiv preprint arXiv …, 2022 - arxiv.org

The foundation models have recently shown excellent performance on a variety of
downstream tasks in computer vision. However, most existing vision foundation models …

被引用次数：200 相关文章所有 2 个版本

[PDF] mlr.press

mplug-2: A modularized multi-modal foundation model across text, image and video

H Xu, Q Ye, M Yan, Y Shi, J Ye, Y Xu… - International …, 2023 - proceedings.mlr.press

Recent years have witnessed a big convergence of language, vision, and multi-modal
pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized …

被引用次数：87 相关文章所有 6 个版本

[PDF] thecvf.com

Unmasked teacher: Towards training-efficient video foundation models

K Li, Y Wang, Y Li, Y Wang, Y He… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Video Foundation Models (VFMs) have received limited exploration due to high
computational costs and data scarcity. Previous VFMs rely on Image Foundation Models …

被引用次数：72 相关文章所有 5 个版本

[PDF] neurips.cc

Self-chained image-language model for video localization and question answering

S Yu, J Cho, P Yadav, M Bansal - Advances in Neural …, 2024 - proceedings.neurips.cc

Recent studies have shown promising results on utilizing large pre-trained image-language
models for video question answering. While these image-language models can efficiently …

被引用次数：66 相关文章所有 7 个版本

[PDF] arxiv.org

Internvid: A large-scale video-text dataset for multimodal understanding and generation

Y Wang, Y He, Y Li, K Li, J Yu, X Ma, X Li… - arXiv preprint arXiv …, 2023 - arxiv.org

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables
learning powerful and transferable video-text representations for multimodal understanding …

被引用次数：85 相关文章所有 4 个版本

[PDF] mlr.press

Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation

H Luo, J Bao, Y Wu, X He, T Li - International Conference on …, 2023 - proceedings.mlr.press

Recently, the contrastive language-image pre-training, eg, CLIP, has demonstrated
promising results on various downstream tasks. The pre-trained model can capture enriched …

被引用次数：87 相关文章所有 6 个版本

高级搜索

QQ 群