Multimodal learning, which aims to understand and analyze information from multiple modalities, has achieved substantial progress in the supervised regime in recent years …
Contrastive learning-based video-language representation learning approaches, eg, CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre …
Large-scale single-stream pre-training has shown dramatic performance in image-text retrieval. Regrettably, it faces low inference efficiency due to heavy attention layers …
Z Jin, M Hayat, Y Yang, Y Guo… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract 3D visual language reasoning plays an important role in effective human-computer interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre …
The remarkable success of transformers in the field of natural language processing has sparked the interest of the speech-processing community, leading to an exploration of their …
H Liao, H Shen, Z Li, C Wang, G Li, Y Bie… - … in Transportation Research, 2024 - Elsevier
In the field of autonomous vehicles (AVs), accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge. This …
Z Luo, P Zhao, C Xu, X Geng, T Shen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Image-text retrieval (ITR) aims to retrieve images or texts that match a query originating from the other modality. The conventional dense retrieval paradigm relies on encoding images …
Pretrained models have produced great success in both Computer Vision (CV) and Natural Language Processing (NLP). This progress leads to learning joint representations of vision …
Text-based person search aims to retrieve the specified person images given a textual description. The key to tackling such a challenging task is to learn powerful multi-modal …