Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

P Jin, J Huang, P Xiong, S Tian, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …

Expectation-maximization contrastive learning for compact video-and-language representations

P Jin, J Huang, F Liu, X Wu, S Ge… - Advances in neural …, 2022 - proceedings.neurips.cc
Most video-and-language representation learning approaches employ contrastive learning,
eg, CLIP, to project the video and text features into a common latent space according to the …

A review of deep learning for video captioning

M Abdar, M Kollati, S Kuraparthi, F Pourpanah… - arXiv preprint arXiv …, 2023 - arxiv.org
Video captioning (VC) is a fast-moving, cross-disciplinary area of research that bridges work
in the fields of computer vision, natural language processing (NLP), linguistics, and human …

Tagging before alignment: Integrating multi-modal tags for video-text retrieval

Y Chen, J Wang, L Lin, Z Qi, J Ma, Y Shan - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Vision-language alignment learning for video-text retrieval arouses a lot of attention in
recent years. Most of the existing methods either transfer the knowledge of image-text …

Cross-lingual cross-modal retrieval with noise-robust learning

Y Wang, J Dong, T Liang, M Zhang, R Cai… - Proceedings of the 30th …, 2022 - dl.acm.org
Despite the recent developments in the field of cross-modal retrieval, there has been less
research focusing on low-resource languages due to the lack of manually annotated …

Token mixing: parameter-efficient transfer learning from image-language to video-language

Y Liu, L Xu, P Xiong, Q Jin - Proceedings of the AAAI Conference on …, 2023 - ojs.aaai.org
Applying large scale pre-trained image-language model to video-language tasks has
recently become a trend, which brings two challenges. One is how to effectively transfer …

Balance act: Mitigating hubness in cross-modal retrieval with query and gallery banks

Y Wang, X Jian, B Xue - arXiv preprint arXiv:2310.11612, 2023 - arxiv.org
In this work, we present a post-processing solution to address the hubness problem in cross-
modal retrieval, a phenomenon where a small number of gallery data points are frequently …

Boosting video-text retrieval with explicit high-level semantics

H Wang, D Xu, D He, F Li, Z Ji, J Han… - Proceedings of the 30th …, 2022 - dl.acm.org
Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding,
which aims to search for relevant video (text) given a query (video). Existing methods …

Visual commonsense-aware representation network for video captioning

P Zeng, H Zhang, L Gao, X Li, J Qian… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Generating consecutive descriptions for videos, that is, video captioning, requires taking full
advantage of visual representation along with the generation process. Existing video …

Relation Triplet Construction for Cross-modal Text-to-Video Retrieval

X Song, J Chen, YG Jiang - Proceedings of the 31st ACM International …, 2023 - dl.acm.org
Cross-modal text-to-video retrieval aims to find semantically related videos for a text query.
Since video and text are distinct modalities, the major challenge comes from building the …