P Jin, J Huang, F Liu, X Wu, S Ge… - Advances in neural …, 2022 - proceedings.neurips.cc
Most video-and-language representation learning approaches employ contrastive learning, eg, CLIP, to project the video and text features into a common latent space according to the …
Video captioning (VC) is a fast-moving, cross-disciplinary area of research that bridges work in the fields of computer vision, natural language processing (NLP), linguistics, and human …
Y Chen, J Wang, L Lin, Z Qi, J Ma, Y Shan - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent years. Most of the existing methods either transfer the knowledge of image-text …
Y Wang, J Dong, T Liang, M Zhang, R Cai… - Proceedings of the 30th …, 2022 - dl.acm.org
Despite the recent developments in the field of cross-modal retrieval, there has been less research focusing on low-resource languages due to the lack of manually annotated …
Y Liu, L Xu, P Xiong, Q Jin - Proceedings of the AAAI Conference on …, 2023 - ojs.aaai.org
Applying large scale pre-trained image-language model to video-language tasks has recently become a trend, which brings two challenges. One is how to effectively transfer …
Y Wang, X Jian, B Xue - arXiv preprint arXiv:2310.11612, 2023 - arxiv.org
In this work, we present a post-processing solution to address the hubness problem in cross- modal retrieval, a phenomenon where a small number of gallery data points are frequently …
Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods …
P Zeng, H Zhang, L Gao, X Li, J Qian… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Generating consecutive descriptions for videos, that is, video captioning, requires taking full advantage of visual representation along with the generation process. Existing video …
X Song, J Chen, YG Jiang - Proceedings of the 31st ACM International …, 2023 - dl.acm.org
Cross-modal text-to-video retrieval aims to find semantically related videos for a text query. Since video and text are distinct modalities, the major challenge comes from building the …