Hit: Hierarchical transformer with momentum contrast for video-text retrieval

S Liu, H Fan, S Qian, Y Chen… - Proceedings of the …, 2021 - openaccess.thecvf.com
Abstract Video-Text Retrieval has been a hot research topic with the growth of multimedia
data on the internet. Transformer for video-text learning has attracted increasing attention …

Dual learning with dynamic knowledge distillation for partially relevant video retrieval

J Dong, M Zhang, Z Zhang, X Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Almost all previous text-to-video retrieval works assume that videos are pre-trimmed with
short durations. However, in practice, videos are generally untrimmed containing much …

Reading-strategy inspired visual representation learning for text-to-video retrieval

J Dong, Y Wang, X Chen, X Qu, X Li… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
This paper aims for the task of text-to-video retrieval, where given a query in the form of a
natural-language sentence, it is asked to retrieve videos which are semantically relevant to …

Partially relevant video retrieval

J Dong, X Chen, M Zhang, X Yang, S Chen… - Proceedings of the 30th …, 2022 - dl.acm.org
Current methods for text-to-video retrieval (T2VR) are trained and tested on video-captioning
oriented datasets such as MSVD, MSR-VTT and VATEX. A key property of these datasets is …

Spatial-temporal graphs for cross-modal text2video retrieval

X Song, J Chen, Z Wu, YG Jiang - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Cross-modal text to video retrieval aims to find relevant videos given text queries, which is
crucial for various real-world applications. The key to address this task is to build the …

Hanet: Hierarchical alignment networks for video-text retrieval

P Wu, X He, M Tang, Y Lv, J Liu - Proceedings of the 29th ACM …, 2021 - dl.acm.org
Video-text retrieval is an important yet challenging task in vision-language understanding,
which aims to learn a joint embedding space where related video and text instances are …

[PDF][PDF] Multi-View Visual Semantic Embedding.

Z Li, C Guo, Z Feng, JN Hwang, X Xue - IJCAI, 2022 - ijcai.org
Abstract Visual Semantic Embedding (VSE) is a dominant method for vision-language
retrieval. Its purpose is to learn an embedding space so that visual data can be embedded in …

Hierarchical cross-modal graph consistency learning for video-text retrieval

W Jin, Z Zhao, P Zhang, J Zhu, X He… - Proceedings of the 44th …, 2021 - dl.acm.org
Due to the popularity of video contents on the Internet, the information retrieval between
videos and texts has attracted broad interest from researchers, which is a challenging cross …

Semantics-aware spatial-temporal binaries for cross-modal video retrieval

M Qi, J Qin, Y Yang, Y Wang… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
With the current exponential growth of video-based social networks, video retrieval using
natural language is receiving ever-increasing attention. Most existing approaches tackle this …

Using multimodal contrastive knowledge distillation for video-text retrieval

W Ma, Q Chen, T Zhou, S Zhao… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Cross-modal retrieval aims to enable a flexible bi-directional retrieval experience across
different modalities (eg, searching for videos with texts). Many existing efforts tend to learn a …