- 学术资源搜索

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：145 相关文章所有 7 个版本

[HTML] sciencedirect.com

[HTML][HTML] Adversarial text-to-image synthesis: A review

S Frolov, T Hinz, F Raue, J Hees, A Dengel - Neural Networks, 2021 - Elsevier

With the advent of generative adversarial networks, synthesizing images from text
descriptions has recently become an active research area. It is a flexible and intuitive way for …

被引用次数：180 相关文章所有 9 个版本

[PDF] thecvf.com

Imagebind: One embedding space to bind them all

R Girdhar, A El-Nouby, Z Liu, M Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …

被引用次数：489 相关文章所有 7 个版本

[PDF] thecvf.com

Winoground: Probing vision and language models for visio-linguistic compositionality

T Thrush, R Jiang, M Bartolo, A Singh… - Proceedings of the …, 2022 - openaccess.thecvf.com

We present a novel task and dataset for evaluating the ability of vision and language models
to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two …

被引用次数：279 相关文章所有 6 个版本

[PDF] thecvf.com

Lit: Zero-shot transfer with locked-image text tuning

X Zhai, X Wang, B Mustafa, A Steiner… - Proceedings of the …, 2022 - openaccess.thecvf.com

This paper presents contrastive-tuning, a simple method employing contrastive training to
align image and text models while still taking advantage of their pre-training. In our empirical …

被引用次数：458 相关文章所有 7 个版本

[PDF] arxiv.org

Pandagpt: One model to instruction-follow them all

Y Su, T Lan, H Li, J Xu, Y Wang, D Cai - arXiv preprint arXiv:2305.16355, 2023 - arxiv.org

We present PandaGPT, an approach to emPower large lANguage moDels with visual and
Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can …

被引用次数：169 相关文章所有 3 个版本

[PDF] github.io

Graph neural networks: foundation, frontiers and applications

L Wu, P Cui, J Pei, L Zhao, X Guo - … of the 28th ACM SIGKDD Conference …, 2022 - dl.acm.org

The field of graph neural networks (GNNs) has seen rapid and incredible strides over the
recent years. Graph neural networks, also known as deep learning on graphs, graph …

被引用次数：327 相关文章所有 11 个版本

[PDF] thecvf.com

Frozen in time: A joint video and image encoder for end-to-end retrieval

M Bain, A Nagrani, G Varol… - Proceedings of the …, 2021 - openaccess.thecvf.com

Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …

被引用次数：872 相关文章所有 12 个版本

[PDF] mlr.press

Scaling up visual and vision-language representation learning with noisy text supervision

C Jia, Y Yang, Y Xia, YT Chen… - International …, 2021 - proceedings.mlr.press

Pre-trained representations are becoming crucial for many NLP and perception tasks. While
representation learning in NLP has transitioned to training on raw text without human …

被引用次数：2925 相关文章所有 6 个版本

[PDF] mlr.press

Vilt: Vision-and-language transformer without convolution or region supervision

W Kim, B Son, I Kim - International conference on machine …, 2021 - proceedings.mlr.press

Abstract Vision-and-Language Pre-training (VLP) has improved performance on various
joint vision-and-language downstream tasks. Current approaches to VLP heavily rely on …

被引用次数：1465 相关文章所有 5 个版本

高级搜索

QQ 群

Vision-language pre-training: Basics, recent advances, and future trends

[HTML][HTML] Adversarial text-to-image synthesis: A review

Imagebind: One embedding space to bind them all

Winoground: Probing vision and language models for visio-linguistic compositionality

Lit: Zero-shot transfer with locked-image text tuning

Pandagpt: One model to instruction-follow them all

Graph neural networks: foundation, frontiers and applications

Frozen in time: A joint video and image encoder for end-to-end retrieval

Scaling up visual and vision-language representation learning with noisy text supervision

Vilt: Vision-and-language transformer without convolution or region supervision

引用