X Li, Y Wu,
X Jiang, Z Guo, M Gong… - Proceedings of the …, 2024 - openaccess.thecvf.com
… a contrastive learning approach, which leverages a multimodal encoder to obtain the multimodal
features (ie, visual, … Ultimately, the visual representations from the image encoder are …