Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

A survey on graph neural networks and graph transformers in computer vision: A task-oriented perspective

C Chen, Y Wu, Q Dai, HY Zhou, M Xu… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Graph Neural Networks (GNNs) have gained momentum in graph representation learning
and boosted the state of the art in a variety of areas, such as data mining (eg, social network …

Winoground: Probing vision and language models for visio-linguistic compositionality

T Thrush, R Jiang, M Bartolo, A Singh… - Proceedings of the …, 2022 - openaccess.thecvf.com
We present a novel task and dataset for evaluating the ability of vision and language models
to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two …

Scaling up visual and vision-language representation learning with noisy text supervision

C Jia, Y Yang, Y Xia, YT Chen… - International …, 2021 - proceedings.mlr.press
Pre-trained representations are becoming crucial for many NLP and perception tasks. While
representation learning in NLP has transitioned to training on raw text without human …

Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition

SC Huang, L Shen, MP Lungren… - Proceedings of the …, 2021 - openaccess.thecvf.com
In recent years, the growing number of medical imaging studies is placing an ever-
increasing burden on radiologists. Deep learning provides a promising solution for …

[PDF][PDF] Large-scale domain-specific pretraining for biomedical vision-language processing

S Zhang, Y Xu, N Usuyama, J Bagga… - arXiv preprint arXiv …, 2023 - researchgate.net
Contrastive pretraining on parallel image-text data has attained great success in vision-
language processing (VLP), as exemplified by CLIP and related methods. However, prior …

Negative-aware attention framework for image-text matching

K Zhang, Z Mao, Q Wang… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Image-text matching, as a fundamental task, bridges the gap between vision and language.
The key of this task is to accurately measure similarity between these two modalities. Prior …

Multi-granularity cross-modal alignment for generalized medical visual representation learning

F Wang, Y Zhou, S Wang… - Advances in Neural …, 2022 - proceedings.neurips.cc
Learning medical visual representations directly from paired radiology reports has become
an emerging topic in representation learning. However, existing medical image-text joint …

Similarity reasoning and filtration for image-text matching

H Diao, Y Zhang, L Ma, H Lu - Proceedings of the AAAI conference on …, 2021 - ojs.aaai.org
Image-text matching plays a critical role in bridging the vision and language, and great
progress has been made by exploiting the global alignment between image and sentence …

Dual-level representation enhancement on characteristic and context for image-text retrieval

S Yang, Q Li, W Li, X Li, AA Liu - IEEE Transactions on Circuits …, 2022 - ieeexplore.ieee.org
Image-text retrieval is a fundamental and vital task in multi-media retrieval and has received
growing attention since it connects heterogeneous data. Previous methods that perform well …