Vindlu: A recipe for effective video-and-language pretraining

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

被引用次数：118 相关文章所有 3 个版本

[PDF] neurips.cc

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

S Chen, H Li, Q Wang, Z Zhao… - Advances in Neural …, 2023 - proceedings.neurips.cc

Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …

被引用次数：101 相关文章所有 6 个版本

[PDF] thecvf.com

Unmasked teacher: Towards training-efficient video foundation models

K Li, Y Wang, Y Li, Y Wang, Y He… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Video Foundation Models (VFMs) have received limited exploration due to high
computational costs and data scarcity. Previous VFMs rely on Image Foundation Models …

被引用次数：142 相关文章所有 5 个版本

[PDF] arxiv.org

Internvid: A large-scale video-text dataset for multimodal understanding and generation

Y Wang, Y He, Y Li, K Li, J Yu, X Ma, X Li… - arXiv preprint arXiv …, 2023 - arxiv.org

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables
learning powerful and transferable video-text representations for multimodal understanding …

被引用次数：214 相关文章所有 4 个版本

[PDF] arxiv.org

Pali-x: On scaling up a multilingual vision and language model

X Chen, J Djolonga, P Padlewski, B Mustafa… - arXiv preprint arXiv …, 2023 - arxiv.org

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and
language model, both in terms of size of the components and the breadth of its training task …

被引用次数：162 相关文章所有 4 个版本

[PDF] thecvf.com

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

TS Chen, A Siarohin, W Menapace… - Proceedings of the …, 2024 - openaccess.thecvf.com

The quality of the data and annotation upper-bounds the quality of a downstream model.
While there exist large text corpora and image-text pairs high-quality video-text data is much …

被引用次数：129 相关文章所有 3 个版本

[PDF] arxiv.org

Longvlm: Efficient long video understanding via large language models

Y Weng, M Han, H He, X Chang, B Zhuang - European Conference on …, 2024 - Springer

Abstract Empowered by Large Language Models (LLMs), recent advancements in Video-
based LLMs (VideoLLMs) have driven progress in various video understanding tasks. These …

被引用次数：24 相关文章所有 3 个版本

[PDF] arxiv.org

Large models for time series and spatio-temporal data: A survey and outlook

M Jin, Q Wen, Y Liang, C Zhang, S Xue, X Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

Temporal data, notably time series and spatio-temporal data, are prevalent in real-world
applications. They capture dynamic system measurements and are produced in vast …

被引用次数：109 相关文章所有 3 个版本

[PDF] thecvf.com

A simple recipe for contrastively pre-training video-first encoders beyond 16 frames

P Papalampidi, S Koppula, S Pathak… - Proceedings of the …, 2024 - openaccess.thecvf.com

Understanding long real-world videos requires modeling of long-range visual
dependencies. To this end we explore video-first architectures building on the common …

被引用次数：17 相关文章所有 3 个版本

[PDF] thecvf.com

Unified coarse-to-fine alignment for video-text retrieval

Z Wang, YL Sung, F Cheng… - Proceedings of the …, 2023 - openaccess.thecvf.com

The canonical approach to video-text retrieval leverages a coarse-grained or fine-grained
alignment between visual and textual information. However, retrieving the correct video …

被引用次数：46 相关文章所有 6 个版本

高级搜索

QQ 群