Socratic models: Composing zero-shot multimodal reasoning with language

A Zeng, M Attarian, B Ichter, K Choromanski… - arXiv preprint arXiv …, 2022 - arxiv.org
Large pretrained (eg," foundation") models exhibit distinct capabilities depending on the
domain of data they are trained on. While these domains are generic, they may only barely …

Frozen clip models are efficient video learners

Z Lin, S Geng, R Zhang, P Gao, G De Melo… - … on Computer Vision, 2022 - Springer
Video recognition has been dominated by the end-to-end learning paradigm–first initializing
a video recognition model with weights of a pretrained image model and then conducting …

Ts2-net: Token shift and selection transformer for text-video retrieval

Y Liu, P Xiong, L Xu, S Cao, Q Jin - European conference on computer …, 2022 - Springer
Text-Video retrieval is a task of great practical value and has received increasing attention,
among which learning spatial-temporal video representation is one of the research hotspots …

Disentangled representation learning

X Wang, H Chen, S Tang, Z Wu, W Zhu - arXiv preprint arXiv:2211.11695, 2022 - arxiv.org
Disentangled Representation Learning (DRL) aims to learn a model capable of identifying
and disentangling the underlying factors hidden in the observable data in representation …

Cap4video: What can auxiliary captions do for text-video retrieval?

W Wu, H Luo, B Fang, J Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Most existing text-video retrieval methods focus on cross-modal matching between the
visual content of videos and textual query sentences. However, in real-world scenarios …

Vindlu: A recipe for effective video-and-language pretraining

F Cheng, X Wang, J Lei, D Crandall… - Proceedings of the …, 2023 - openaccess.thecvf.com
The last several years have witnessed remarkable progress in video-and-language (VidL)
understanding. However, most modern VidL approaches use complex and specialized …

Clip-vip: Adapting pre-trained image-text model to video-language representation alignment

H Xue, Y Sun, B Liu, J Fu, R Song, H Li… - arXiv preprint arXiv …, 2022 - arxiv.org
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-
language representation learned from a large scale of web-collected image-text data. In light …

Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models

H Ha, S Song - arXiv preprint arXiv:2207.11514, 2022 - arxiv.org
We study open-world 3D scene understanding, a family of tasks that require agents to
reason about their 3D environment with an open-set vocabulary and out-of-domain visual …

Deep learning for video-text retrieval: a review

C Zhu, Q Jia, W Chen, Y Guo, Y Liu - International Journal of Multimedia …, 2023 - Springer
Abstract Video-Text Retrieval (VTR) aims to search for the most relevant video related to the
semantics in a given sentence, and vice versa. In general, this retrieval task is composed of …

Revisiting temporal modeling for clip-based image-to-video knowledge transferring

R Liu, J Huang, G Li, J Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Image-text pretrained models, eg, CLIP, have shown impressive general multi-modal
knowledge learned from large-scale image-text data pairs, thus attracting increasing …