Real-world robot applications of foundation models: A review

K Kawaharazuka, T Matsushima… - Advanced …, 2024 - Taylor & Francis
Recent developments in foundation models, like Large Language Models (LLMs) and Vision-
Language Models (VLMs), trained on extensive data, facilitate flexible application across …

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Internvid: A large-scale video-text dataset for multimodal understanding and generation

Y Wang, Y He, Y Li, K Li, J Yu, X Ma, X Li… - arXiv preprint arXiv …, 2023 - arxiv.org
This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables
learning powerful and transferable video-text representations for multimodal understanding …

Onellm: One framework to align all modalities with language

J Han, K Gong, Y Zhang, J Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …

Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment

B Zhu, B Lin, M Ning, Y Yan, J Cui, HF Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
The video-language (VL) pretraining has achieved remarkable improvement in multiple
downstream tasks. However, the current VL pretraining framework is hard to extend to …

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Large models for time series and spatio-temporal data: A survey and outlook

M Jin, Q Wen, Y Liang, C Zhang, S Xue, X Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
Temporal data, notably time series and spatio-temporal data, are prevalent in real-world
applications. They capture dynamic system measurements and are produced in vast …

A comprehensive survey of foundation models in medicine

W Khan, S Leem, KB See, JK Wong… - IEEE Reviews in …, 2025 - ieeexplore.ieee.org
Foundation models (FMs) are large-scale deeplearning models that are developed using
large datasets and self-supervised learning methods. These models serve as a base for …

Omnivec: Learning robust representations with cross modal sharing

S Srivastava, G Sharma - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
Majority of research in learning based methods has been towards designing and training
networks for specific tasks. However, many of the learning based tasks, across modalities …

A survey of multimodal large language model from a data-centric perspective

T Bai, H Liang, B Wan, Y Xu, X Li, S Li, L Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) enhance the capabilities of standard large
language models by integrating and processing data from multiple modalities, including text …