- 学术资源搜索

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

被引用次数：118 相关文章所有 3 个版本

[PDF] ieee.org

When does Sora show: The beginning of TAO to imaginative intelligence and scenarios engineering

FY Wang, Q Miao, L Li, Q Ni, X Li, J Li… - IEEE/CAA Journal of …, 2024 - ieeexplore.ieee.org

During our discussion at workshops for writing “What Does ChatGPT Say: The DAO from
Algorithmic Intelligence to Linguistic Intelligence”[1], we had expected the next milestone for …

被引用次数：57 相关文章所有 4 个版本

[PDF] science.org

Prospective role of foundation models in advancing autonomous vehicles

J Wu, B Gao, J Gao, J Yu, H Chu, Q Yu, X Gong… - Research, 2024 - spj.science.org

With the development of artificial intelligence and breakthroughs in deep learning, large-
scale foundation models (FMs), such as generative pre-trained transformer (GPT), Sora, etc …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

Modeling caption diversity in contrastive vision-language pretraining

S Lavoie, P Kirichenko, M Ibrahim, M Assran… - arXiv preprint arXiv …, 2024 - arxiv.org

There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP)
on the other hand, works by mapping an image and its caption to a single vector--limiting …

被引用次数：15 相关文章所有 3 个版本

[PDF] arxiv.org

Apollo: An exploration of video understanding in large multimodal models

O Zohar, X Wang, Y Dubois, N Mehta, T Xiao… - arXiv preprint arXiv …, 2024 - arxiv.org

Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

World models for autonomous driving: An initial survey

Y Guan, H Liao, Z Li, J Hu, R Yuan, Y Li… - IEEE Transactions …, 2024 - ieeexplore.ieee.org

In the rapidly evolving landscape of autonomous driving, the capability to accurately predict
future events and assess their implications is paramount for both safety and efficiency …

被引用次数：23 相关文章所有 3 个版本

[PDF] arxiv.org

Vic-mae: Self-supervised representation learning from images and video with contrastive masked autoencoders

J Hernandez, R Villegas, V Ordonez - European Conference on Computer …, 2024 - Springer

We propose ViC-MAE, a model that combines both Masked AutoEncoders (MAE) and
contrastive learning. ViC-MAE is trained using a global representation obtained by pooling …

被引用次数：8 相关文章所有 2 个版本

[PDF] arxiv.org

Lexicon3d: Probing visual foundation models for complex 3d scene understanding

Y Man, S Zheng, Z Bao, M Hebert, LY Gui… - arXiv preprint arXiv …, 2024 - arxiv.org

Complex 3D scene understanding has gained increasing attention, with scene encoding
strategies playing a crucial role in this success. However, the optimal scene encoding …

被引用次数：4 相关文章所有 3 个版本

[PDF] arxiv.org

Igor: Image-goal representations are the atomic control units for foundation models in embodied ai

X Chen, J Guo, T He, C Zhang, P Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce Image-GOal Representations (IGOR), aiming to learn a unified, semantically
consistent action space across human and various robots. Through this unified latent action …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

Exploring the interplay between video generation and world models in autonomous driving: A survey

A Fu, Y Zhou, T Zhou, Y Yang, B Gao, Q Li… - arXiv preprint arXiv …, 2024 - arxiv.org

World models and video generation are pivotal technologies in the domain of autonomous
driving, each playing a critical role in enhancing the robustness and reliability of …

被引用次数：2 相关文章所有 2 个版本

高级搜索

QQ 群

Internvideo2: Scaling foundation models for multimodal video understanding

When does Sora show: The beginning of TAO to imaginative intelligence and scenarios engineering

Prospective role of foundation models in advancing autonomous vehicles

Modeling caption diversity in contrastive vision-language pretraining

Apollo: An exploration of video understanding in large multimodal models

World models for autonomous driving: An initial survey

Vic-mae: Self-supervised representation learning from images and video with contrastive masked autoencoders

Lexicon3d: Probing visual foundation models for complex 3d scene understanding

Igor: Image-goal representations are the atomic control units for foundation models in embodied ai

Exploring the interplay between video generation and world models in autonomous driving: A survey

引用