Learning the best pooling strategy for visual semantic embedding

Z Wang, Z Wu, D Agarwal, J Sun - arXiv preprint arXiv:2210.10163, 2022 - arxiv.org

Existing vision-text contrastive learning like CLIP aims to match the paired image and
caption embeddings while pushing others apart, which improves representation …

被引用次数：245 相关文章所有 4 个版本

[PDF] thecvf.com

Lit: Zero-shot transfer with locked-image text tuning

X Zhai, X Wang, B Mustafa, A Steiner… - Proceedings of the …, 2022 - openaccess.thecvf.com

This paper presents contrastive-tuning, a simple method employing contrastive training to
align image and text models while still taking advantage of their pre-training. In our empirical …

被引用次数：479 相关文章所有 7 个版本

[PDF] mlr.press

Scaling up visual and vision-language representation learning with noisy text supervision

C Jia, Y Yang, Y Xia, YT Chen… - International …, 2021 - proceedings.mlr.press

Pre-trained representations are becoming crucial for many NLP and perception tasks. While
representation learning in NLP has transitioned to training on raw text without human …

被引用次数：3102 相关文章所有 6 个版本

[PDF] thecvf.com

Negative-aware attention framework for image-text matching

K Zhang, Z Mao, Q Wang… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

Image-text matching, as a fundamental task, bridges the gap between vision and language.
The key of this task is to accurately measure similarity between these two modalities. Prior …

被引用次数：112 相关文章所有 4 个版本

[PDF] thecvf.com

Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

P Jin, J Huang, P Xiong, S Tian, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …

被引用次数：48 相关文章所有 6 个版本

[HTML] sciencedirect.com

[HTML][HTML] Combined scaling for zero-shot transfer learning

H Pham, Z Dai, G Ghiasi, K Kawaguchi, H Liu, AW Yu… - Neurocomputing, 2023 - Elsevier

Recent developments in multimodal training methodologies, including CLIP and ALIGN,
obviate the necessity for individual data labeling. These approaches utilize pairs of data and …

被引用次数：157 相关文章所有 5 个版本

[PDF] thecvf.com

Cots: Collaborative two-stream vision-language pre-training model for cross-modal retrieval

H Lu, N Fei, Y Huo, Y Gao, Z Lu… - Proceedings of the …, 2022 - openaccess.thecvf.com

Large-scale single-stream pre-training has shown dramatic performance in image-text
retrieval. Regrettably, it faces low inference efficiency due to heavy attention layers …

被引用次数：68 相关文章所有 6 个版本

[PDF] thecvf.com

Fine-grained image-text matching by cross-modal hard aligning network

Z Pan, F Wu, B Zhang - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com

Current state-of-the-art image-text matching methods implicitly align the visual-semantic
fragments, like regions in images and words in sentences, and adopt cross-attention …

被引用次数：34 相关文章所有 6 个版本

[PDF] neurips.cc

Expectation-maximization contrastive learning for compact video-and-language representations

P Jin, J Huang, F Liu, X Wu, S Ge… - Advances in neural …, 2022 - proceedings.neurips.cc

Most video-and-language representation learning approaches employ contrastive learning,
eg, CLIP, to project the video and text features into a common latent space according to the …

被引用次数：47 相关文章所有 8 个版本

[PDF] thecvf.com

Learning semantic relationship among instances for image-text matching

Z Fu, Z Mao, Y Song, Y Zhang - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Image-text matching, a bridge connecting image and language, is an important task, which
generally learns a holistic cross-modal embedding to achieve a high-quality semantic …

被引用次数：30 相关文章所有 5 个版本

高级搜索

QQ 群