Z Shao, Z Yu, M Wang, J Yu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Abstract Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question. Early studies retrieve required knowledge from …
Multimodal learning, which aims to understand and analyze information from multiple modalities, has achieved substantial progress in the supervised regime in recent years …
We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language (VL) pre-training. The …
S Mo, M Kim, K Lee, J Shin - Advances in Neural …, 2023 - proceedings.neurips.cc
Vision-language models, such as contrastive language-image pre-training (CLIP), have demonstrated impressive results in natural image domains. However, these models often …
X Hu, C Zhang, Y Zhang, B Hai… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Pre-trained Visual-Language Models (VLMs), such as CLIP, have shown enhanced performance across a range of tasks that involve the integration of visual and linguistic …
Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various …
Y Bai, J Wang, M Cao, C Chen, Z Cao, L Nie… - Proceedings of the 31st …, 2023 - dl.acm.org
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description. Existing methods are …
Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP and ALIGN, have introduced a new paradigm for learning transferable visual representations. Recently, there …
Advanced image fusion techniques aim to synthesise fusion results by integrating the complementary information provided by the source inputs. However, the inherent differences …