SX Zhang, H Wang,
X Zhu, W Gu, T Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Video-language alignment is a crucial multi-modal task that benefits various downstream
applications, eg, video-text retrieval and video question answering. Existing methods either …