Clip2tv: An empirical study on transformer-based methods for video-text retrieval

A Zeng, M Attarian, B Ichter, K Choromanski… - arXiv preprint arXiv …, 2022 - arxiv.org

Large pretrained (eg," foundation") models exhibit distinct capabilities depending on the
domain of data they are trained on. While these domains are generic, they may only barely …

被引用次数：398 相关文章所有 6 个版本

[PDF] arxiv.org

Frozen clip models are efficient video learners

Z Lin, S Geng, R Zhang, P Gao, G De Melo… - … on Computer Vision, 2022 - Springer

Video recognition has been dominated by the end-to-end learning paradigm–first initializing
a video recognition model with weights of a pretrained image model and then conducting …

被引用次数：153 相关文章所有 5 个版本

[PDF] arxiv.org

Ts2-net: Token shift and selection transformer for text-video retrieval

Y Liu, P Xiong, L Xu, S Cao, Q Jin - European conference on computer …, 2022 - Springer

Text-Video retrieval is a task of great practical value and has received increasing attention,
among which learning spatial-temporal video representation is one of the research hotspots …

被引用次数：98 相关文章所有 5 个版本

[PDF] arxiv.org

Disentangled representation learning

X Wang, H Chen, S Tang, Z Wu, W Zhu - arXiv preprint arXiv:2211.11695, 2022 - arxiv.org

Disentangled Representation Learning (DRL) aims to learn a model capable of identifying
and disentangling the underlying factors hidden in the observable data in representation …

被引用次数：123 相关文章所有 4 个版本

[PDF] thecvf.com

Cap4video: What can auxiliary captions do for text-video retrieval?

W Wu, H Luo, B Fang, J Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Most existing text-video retrieval methods focus on cross-modal matching between the
visual content of videos and textual query sentences. However, in real-world scenarios …

被引用次数：57 相关文章所有 6 个版本

[PDF] thecvf.com

Vindlu: A recipe for effective video-and-language pretraining

F Cheng, X Wang, J Lei, D Crandall… - Proceedings of the …, 2023 - openaccess.thecvf.com

The last several years have witnessed remarkable progress in video-and-language (VidL)
understanding. However, most modern VidL approaches use complex and specialized …

被引用次数：57 相关文章所有 8 个版本

[PDF] arxiv.org

Clip-vip: Adapting pre-trained image-text model to video-language representation alignment

H Xue, Y Sun, B Liu, J Fu, R Song, H Li… - arXiv preprint arXiv …, 2022 - arxiv.org

The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-
language representation learned from a large scale of web-collected image-text data. In light …

被引用次数：94 相关文章所有 2 个版本

[PDF] arxiv.org

Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models

H Ha, S Song - arXiv preprint arXiv:2207.11514, 2022 - arxiv.org

We study open-world 3D scene understanding, a family of tasks that require agents to
reason about their 3D environment with an open-set vocabulary and out-of-domain visual …

被引用次数：83 相关文章所有 5 个版本

[PDF] arxiv.org

Deep learning for video-text retrieval: a review

C Zhu, Q Jia, W Chen, Y Guo, Y Liu - International Journal of Multimedia …, 2023 - Springer

Abstract Video-Text Retrieval (VTR) aims to search for the most relevant video related to the
semantics in a given sentence, and vice versa. In general, this retrieval task is composed of …

被引用次数：13 相关文章所有 4 个版本

[PDF] thecvf.com

Revisiting temporal modeling for clip-based image-to-video knowledge transferring

R Liu, J Huang, G Li, J Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com

Image-text pretrained models, eg, CLIP, have shown impressive general multi-modal
knowledge learned from large-scale image-text data pairs, thus attracting increasing …

被引用次数：36 相关文章所有 7 个版本

高级搜索

QQ 群