Unsupervised vision-and-language pre-training via retrieval-based multi-granular alignment

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：193 相关文章所有 7 个版本

[PDF] thecvf.com

Prompting large language models with answer heuristics for knowledge-based visual question answering

Z Shao, Z Yu, M Wang, J Yu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Abstract Knowledge-based visual question answering (VQA) requires external knowledge
beyond the image to answer the question. Early studies retrieve required knowledge from …

被引用次数：197 相关文章所有 5 个版本

[PDF] arxiv.org

Self-supervised multimodal learning: A survey

Y Zong, O Mac Aodha, T Hospedales - arXiv preprint arXiv:2304.01008, 2023 - arxiv.org

Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …

被引用次数：37 相关文章所有 2 个版本

[PDF] neurips.cc

Bootstrapping vision-language learning with decoupled language pre-training

Y Jian, C Gao, S Vosoughi - Advances in Neural …, 2024 - proceedings.neurips.cc

We present a novel methodology aimed at optimizing the application of frozen large
language models (LLMs) for resource-intensive vision-language (VL) pre-training. The …

被引用次数：26 相关文章所有 6 个版本

[PDF] neurips.cc

S-clip: Semi-supervised vision-language learning using few specialist captions

S Mo, M Kim, K Lee, J Shin - Advances in Neural …, 2023 - proceedings.neurips.cc

Vision-language models, such as contrastive language-image pre-training (CLIP), have
demonstrated impressive results in natural image domains. However, these models often …

被引用次数：14 相关文章所有 5 个版本

[PDF] thecvf.com

Learning to adapt clip for few-shot monocular depth estimation

X Hu, C Zhang, Y Zhang, B Hai… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Pre-trained Visual-Language Models (VLMs), such as CLIP, have shown enhanced
performance across a range of tasks that involve the integration of visual and linguistic …

被引用次数：14 相关文章所有 6 个版本

[PDF] arxiv.org

Cross-modal concept learning and inference for vision-language models

Y Zhang, C Zhang, Y Tang, Z He - Neurocomputing, 2024 - Elsevier

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the
correlation between texts and images, achieving remarkable success on various …

被引用次数：11 相关文章所有 4 个版本

[PDF] arxiv.org

Text-based person search without parallel image-text data

Y Bai, J Wang, M Cao, C Chen, Z Cao, L Nie… - Proceedings of the 31st …, 2023 - dl.acm.org

Text-based person search (TBPS) aims to retrieve the images of the target person from a
large image gallery based on a given natural language description. Existing methods are …

被引用次数：10 相关文章所有 3 个版本

[PDF] arxiv.org

Bdc-adapter: Brownian distance covariance for better vision-language reasoning

Y Zhang, C Zhang, Z Liao, Y Tang, Z He - arXiv preprint arXiv:2309.01256, 2023 - arxiv.org

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP and ALIGN, have
introduced a new paradigm for learning transferable visual representations. Recently, there …

被引用次数：12 相关文章所有 7 个版本

[PDF] arxiv.org

TextFusion: Unveiling the power of textual semantics for controllable image fusion

C Cheng, T Xu, XJ Wu, H Li, X Li, Z Tang, J Kittler - Information Fusion, 2025 - Elsevier

Advanced image fusion techniques aim to synthesise fusion results by integrating the
complementary information provided by the source inputs. However, the inherent differences …

被引用次数：3 相关文章所有 2 个版本

高级搜索

QQ 群