Video recognition has been dominated by the end-to-end learning paradigm–first initializing a video recognition model with weights of a pretrained image model and then conducting …
Y Liu, P Xiong, L Xu, S Cao, Q Jin - European conference on computer …, 2022 - Springer
Text-Video retrieval is a task of great practical value and has received increasing attention, among which learning spatial-temporal video representation is one of the research hotspots …
Disentangled Representation Learning (DRL) aims to learn a model capable of identifying and disentangling the underlying factors hidden in the observable data in representation …
W Wu, H Luo, B Fang, J Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences. However, in real-world scenarios …
The last several years have witnessed remarkable progress in video-and-language (VidL) understanding. However, most modern VidL approaches use complex and specialized …
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision- language representation learned from a large scale of web-collected image-text data. In light …
H Ha, S Song - arXiv preprint arXiv:2207.11514, 2022 - arxiv.org
We study open-world 3D scene understanding, a family of tasks that require agents to reason about their 3D environment with an open-set vocabulary and out-of-domain visual …
C Zhu, Q Jia, W Chen, Y Guo, Y Liu - International Journal of Multimedia …, 2023 - Springer
Abstract Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence, and vice versa. In general, this retrieval task is composed of …