相关文章- 学术资源搜索

Clippo: Image-and-language understanding from pixels only

M Tschannen, B Mustafa… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Multimodal models are becoming increasingly effective, in part due to unified components,
such as the Transformer architecture. However, multimodal models still often consist of many …

被引用次数：21 相关文章所有 6 个版本

[PDF] thecvf.com

12-in-1: Multi-task vision and language representation learning

J Lu, V Goswami, M Rohrbach… - Proceedings of the …, 2020 - openaccess.thecvf.com

Much of vision-and-language research focuses on a small but diverse set of independent
tasks and supporting datasets often studied in isolation; however, the visually-grounded …

被引用次数：514 相关文章所有 10 个版本

[PDF] thecvf.com

Uc2: Universal cross-lingual cross-modal vision-and-language pre-training

M Zhou, L Zhou, S Wang, Y Cheng… - Proceedings of the …, 2021 - openaccess.thecvf.com

Vision-and-language pre-training has achieved impressive success in learning multimodal
representations between vision and language. To generalize this success to non-English …

被引用次数：73 相关文章所有 11 个版本

[PDF] thecvf.com

COOKIE: Contrastive cross-modal knowledge sharing pre-training for vision-language representation

K Wen, J Xia, Y Huang, L Li, J Xu… - Proceedings of the …, 2021 - openaccess.thecvf.com

There has been a recent surge of interest in cross-modal pre-training. However, existed
approaches pre-train a one-stream model to learn joint vision-language representation …

被引用次数：35 相关文章所有 4 个版本

[PDF] neurips.cc

ASIF: Coupled data turns unimodal models to multimodal without training

A Norelli, M Fumero, V Maiorca… - Advances in …, 2024 - proceedings.neurips.cc

CLIP proved that aligning visual and language spaces is key to solving many vision tasks
without explicit training, but required to train image and text encoders from scratch on a huge …

被引用次数：16 相关文章所有 5 个版本

[PDF] arxiv.org

Distilled dual-encoder model for vision-language understanding

Z Wang, W Wang, H Zhu, M Liu, B Qin, F Wei - arXiv preprint arXiv …, 2021 - arxiv.org

We propose a cross-modal attention distillation framework to train a dual-encoder model for
vision-language understanding tasks, such as visual reasoning and visual question …

被引用次数：25 相关文章所有 4 个版本

[PDF] arxiv.org

Minigpt-4: Enhancing vision-language understanding with advanced large language models

D Zhu, J Chen, X Shen, X Li, M Elhoseiny - arXiv preprint arXiv …, 2023 - arxiv.org

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly
generating websites from handwritten text and identifying humorous elements within …

被引用次数：1241 相关文章所有 7 个版本

[PDF] aclanthology.org

Pubmedclip: How much does clip benefit visual question answering in the medical domain?

S Eslami, C Meinel, G De Melo - Findings of the Association for …, 2023 - aclanthology.org

Abstract Contrastive Language–Image Pre-training (CLIP) has shown remarkable success
in learning with cross-modal supervision from extensive amounts of image–text pairs …

被引用次数：33 相关文章

[PDF] arxiv.org

Vt-clip: Enhancing vision-language models with visual-guided texts

L Qiu, R Zhang, Z Guo, Z Zeng, Z Guo, Y Li… - arXiv preprint arXiv …, 2021 - arxiv.org

Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for
its transferable visual representation learning. However, due to the semantic gap within …

被引用次数：38 相关文章所有 2 个版本

[PDF] arxiv.org

Laion-400m: Open dataset of clip-filtered 400 million image-text pairs

C Schuhmann, R Vencu, R Beaumont… - arXiv preprint arXiv …, 2021 - arxiv.org

Multi-modal language-vision models trained on hundreds of millions of image-text pairs (eg
CLIP, DALL-E) gained a recent surge, showing remarkable capability to perform zero-or few …

被引用次数：923 相关文章所有 6 个版本

高级搜索

QQ 群

Clippo: Image-and-language understanding from pixels only

12-in-1: Multi-task vision and language representation learning

Uc2: Universal cross-lingual cross-modal vision-and-language pre-training

COOKIE: Contrastive cross-modal knowledge sharing pre-training for vision-language representation

ASIF: Coupled data turns unimodal models to multimodal without training

Distilled dual-encoder model for vision-language understanding

Minigpt-4: Enhancing vision-language understanding with advanced large language models

Pubmedclip: How much does clip benefit visual question answering in the medical domain?

Vt-clip: Enhancing vision-language models with visual-guided texts

Laion-400m: Open dataset of clip-filtered 400 million image-text pairs

引用