Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

K Kawaharazuka, T Matsushima… - Advanced …, 2024 - Taylor & Francis

Recent developments in foundation models, like Large Language Models (LLMs) and Vision-
Language Models (VLMs), trained on extensive data, facilitate flexible application across …

被引用次数：37 相关文章所有 2 个版本

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

被引用次数：117 相关文章所有 3 个版本

[PDF] arxiv.org

Internvid: A large-scale video-text dataset for multimodal understanding and generation

Y Wang, Y He, Y Li, K Li, J Yu, X Ma, X Li… - arXiv preprint arXiv …, 2023 - arxiv.org

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables
learning powerful and transferable video-text representations for multimodal understanding …

被引用次数：212 相关文章所有 4 个版本

[PDF] thecvf.com

Onellm: One framework to align all modalities with language

J Han, K Gong, Y Zhang, J Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …

被引用次数：91 相关文章所有 3 个版本

[PDF] arxiv.org

Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment

B Zhu, B Lin, M Ning, Y Yan, J Cui, HF Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

The video-language (VL) pretraining has achieved remarkable improvement in multiple
downstream tasks. However, the current VL pretraining framework is hard to extend to …

被引用次数：153 相关文章所有 3 个版本

[PDF] arxiv.org

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

被引用次数：56 相关文章所有 2 个版本

[PDF] arxiv.org

Large models for time series and spatio-temporal data: A survey and outlook

M Jin, Q Wen, Y Liang, C Zhang, S Xue, X Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

Temporal data, notably time series and spatio-temporal data, are prevalent in real-world
applications. They capture dynamic system measurements and are produced in vast …

被引用次数：104 相关文章所有 3 个版本

[PDF] arxiv.org

A comprehensive survey of foundation models in medicine

W Khan, S Leem, KB See, JK Wong… - IEEE Reviews in …, 2025 - ieeexplore.ieee.org

Foundation models (FMs) are large-scale deeplearning models that are developed using
large datasets and self-supervised learning methods. These models serve as a base for …

被引用次数：3 相关文章

[PDF] thecvf.com

Omnivec: Learning robust representations with cross modal sharing

S Srivastava, G Sharma - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com

Majority of research in learning based methods has been towards designing and training
networks for specific tasks. However, many of the learning based tasks, across modalities …

被引用次数：68 相关文章所有 5 个版本

[PDF] arxiv.org

A survey of multimodal large language model from a data-centric perspective

T Bai, H Liang, B Wan, Y Xu, X Li, S Li, L Yang… - arXiv preprint arXiv …, 2024 - arxiv.org

Multimodal large language models (MLLMs) enhance the capabilities of standard large
language models by integrating and processing data from multiple modalities, including text …

被引用次数：31 相关文章所有 2 个版本

高级搜索

QQ 群