Medclip: Contrastive learning from unpaired medical images and text

Z Wang, Z Wu, D Agarwal, J Sun - arXiv preprint arXiv:2210.10163, 2022 - arxiv.org
Existing vision-text contrastive learning like CLIP aims to match the paired image and
caption embeddings while pushing others apart, which improves representation …

Lit: Zero-shot transfer with locked-image text tuning

X Zhai, X Wang, B Mustafa, A Steiner… - Proceedings of the …, 2022 - openaccess.thecvf.com
This paper presents contrastive-tuning, a simple method employing contrastive training to
align image and text models while still taking advantage of their pre-training. In our empirical …

Scaling up visual and vision-language representation learning with noisy text supervision

C Jia, Y Yang, Y Xia, YT Chen… - International …, 2021 - proceedings.mlr.press
Pre-trained representations are becoming crucial for many NLP and perception tasks. While
representation learning in NLP has transitioned to training on raw text without human …

Negative-aware attention framework for image-text matching

K Zhang, Z Mao, Q Wang… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Image-text matching, as a fundamental task, bridges the gap between vision and language.
The key of this task is to accurately measure similarity between these two modalities. Prior …

Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

P Jin, J Huang, P Xiong, S Tian, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …

[HTML][HTML] Combined scaling for zero-shot transfer learning

H Pham, Z Dai, G Ghiasi, K Kawaguchi, H Liu, AW Yu… - Neurocomputing, 2023 - Elsevier
Recent developments in multimodal training methodologies, including CLIP and ALIGN,
obviate the necessity for individual data labeling. These approaches utilize pairs of data and …

Cots: Collaborative two-stream vision-language pre-training model for cross-modal retrieval

H Lu, N Fei, Y Huo, Y Gao, Z Lu… - Proceedings of the …, 2022 - openaccess.thecvf.com
Large-scale single-stream pre-training has shown dramatic performance in image-text
retrieval. Regrettably, it faces low inference efficiency due to heavy attention layers …

Fine-grained image-text matching by cross-modal hard aligning network

Z Pan, F Wu, B Zhang - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com
Current state-of-the-art image-text matching methods implicitly align the visual-semantic
fragments, like regions in images and words in sentences, and adopt cross-attention …

Expectation-maximization contrastive learning for compact video-and-language representations

P Jin, J Huang, F Liu, X Wu, S Ge… - Advances in neural …, 2022 - proceedings.neurips.cc
Most video-and-language representation learning approaches employ contrastive learning,
eg, CLIP, to project the video and text features into a common latent space according to the …

Learning semantic relationship among instances for image-text matching

Z Fu, Z Mao, Y Song, Y Zhang - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Image-text matching, a bridge connecting image and language, is an important task, which
generally learns a holistic cross-modal embedding to achieve a high-quality semantic …