Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

When does Sora show: The beginning of TAO to imaginative intelligence and scenarios engineering

FY Wang, Q Miao, L Li, Q Ni, X Li, J Li… - IEEE/CAA Journal of …, 2024 - ieeexplore.ieee.org
During our discussion at workshops for writing “What Does ChatGPT Say: The DAO from
Algorithmic Intelligence to Linguistic Intelligence”[1], we had expected the next milestone for …

Prospective role of foundation models in advancing autonomous vehicles

J Wu, B Gao, J Gao, J Yu, H Chu, Q Yu, X Gong… - Research, 2024 - spj.science.org
With the development of artificial intelligence and breakthroughs in deep learning, large-
scale foundation models (FMs), such as generative pre-trained transformer (GPT), Sora, etc …

Modeling caption diversity in contrastive vision-language pretraining

S Lavoie, P Kirichenko, M Ibrahim, M Assran… - arXiv preprint arXiv …, 2024 - arxiv.org
There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP)
on the other hand, works by mapping an image and its caption to a single vector--limiting …

Apollo: An exploration of video understanding in large multimodal models

O Zohar, X Wang, Y Dubois, N Mehta, T Xiao… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …

World models for autonomous driving: An initial survey

Y Guan, H Liao, Z Li, J Hu, R Yuan, Y Li… - IEEE Transactions …, 2024 - ieeexplore.ieee.org
In the rapidly evolving landscape of autonomous driving, the capability to accurately predict
future events and assess their implications is paramount for both safety and efficiency …

Vic-mae: Self-supervised representation learning from images and video with contrastive masked autoencoders

J Hernandez, R Villegas, V Ordonez - European Conference on Computer …, 2024 - Springer
We propose ViC-MAE, a model that combines both Masked AutoEncoders (MAE) and
contrastive learning. ViC-MAE is trained using a global representation obtained by pooling …

Lexicon3d: Probing visual foundation models for complex 3d scene understanding

Y Man, S Zheng, Z Bao, M Hebert, LY Gui… - arXiv preprint arXiv …, 2024 - arxiv.org
Complex 3D scene understanding has gained increasing attention, with scene encoding
strategies playing a crucial role in this success. However, the optimal scene encoding …

Igor: Image-goal representations are the atomic control units for foundation models in embodied ai

X Chen, J Guo, T He, C Zhang, P Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce Image-GOal Representations (IGOR), aiming to learn a unified, semantically
consistent action space across human and various robots. Through this unified latent action …

Exploring the interplay between video generation and world models in autonomous driving: A survey

A Fu, Y Zhou, T Zhou, Y Yang, B Gao, Q Li… - arXiv preprint arXiv …, 2024 - arxiv.org
World models and video generation are pivotal technologies in the domain of autonomous
driving, each playing a critical role in enhancing the robustness and reliability of …