Generating an image/video caption has always been a fundamental problem of Artificial Intelligence, which is usually performed using the potential of Deep Learning Methods …
Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data (eg, images, texts, or data collected from different sensors), feature engineering (eg …
Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process (eg, feature extraction and/or …
H Fang, Z Yang, Y Wei, X Zang, C Ban… - Proceedings of the …, 2023 - openaccess.thecvf.com
Pre-trained models have demonstrated considerable performance, especially in enhancing cross-modal understanding between videos and text. However, fine-tuning them at scale …
Y Zhang, Z Chen, L Guo, Y Xu, B Hu, Z Liu… - Proceedings of the 47th …, 2024 - dl.acm.org
Multi-modal knowledge graph completion (MMKGC) aims to automatically discover the unobserved factual knowledge from a given multi-modal knowledge graph by collaboratively …
Video captioning is a multi-modal task across computer vision and natural language processing. Previous methods generally follow two paradigms, ie template-based and …
X Gu, H Fan, Y Huang, T Luo… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Spatio-temporal video grounding (or STVG) task aims at locating a spatio-temporal tube for a specific instance given a text query. Despite advancements current methods easily suffer …
X Wang, Y Li, T Gan, Z Zhang, J Lv, L Nie - Proceedings of the 31st ACM …, 2023 - dl.acm.org
Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared …
J Choi, S Lee, J Chu, M Choi… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Video Transformers have become the prevalent solution for various video downstream tasks with superior expressive power and flexibility. However these video transformers suffer from …