We describe a DNN for fine-grained action classification and video captioning. It gives state- of-the-art performance on the challenging Something-Something dataset, with over 220, 000 …
Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube. Whereas most existing approaches …
Video content is multifaceted, consisting of objects, scenes, interactions or actions. The existing datasets mostly label only one of the facets for model training, resulting in the video …
With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video …
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of- the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our …
We are creating multimedia contents everyday and everywhere. While automatic content generation has played a fundamental challenge to multimedia community for decades …
Y Liu, L Xu, P Xiong, Q Jin - Proceedings of the AAAI Conference on …, 2023 - ojs.aaai.org
Applying large scale pre-trained image-language model to video-language tasks has recently become a trend, which brings two challenges. One is how to effectively transfer …
The recent growth in the consumption of online media by children during early childhood necessitates data-driven tools enabling educators to filter out appropriate educational …
Joint video-language learning has received increasing attention in recent years. However, existing works mainly focus on single or multiple trimmed video clips (events), which makes …