Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：145 相关文章所有 7 个版本

[PDF] springer.com

Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer

With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

被引用次数：118 相关文章所有 8 个版本

[PDF] mlr.press

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

J Li, D Li, S Savarese, S Hoi - International conference on …, 2023 - proceedings.mlr.press

The cost of vision-and-language pre-training has become increasingly prohibitive due to
end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and …

被引用次数：2685 相关文章所有 7 个版本

[PDF] arxiv.org

Minigpt-4: Enhancing vision-language understanding with advanced large language models

D Zhu, J Chen, X Shen, X Li, M Elhoseiny - arXiv preprint arXiv …, 2023 - arxiv.org

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly
generating websites from handwritten text and identifying humorous elements within …

被引用次数：1388 相关文章所有 7 个版本

[PDF] arxiv.org

Palm-e: An embodied multimodal language model

D Driess, F Xia, MSM Sajjadi, C Lynch… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models excel at a wide range of complex tasks. However, enabling general
inference in the real world, eg, for robotics problems, raises the challenge of grounding. We …

被引用次数：1038 相关文章所有 6 个版本

[PDF] thecvf.com

Objaverse: A universe of annotated 3d objects

M Deitke, D Schwenk, J Salvador… - Proceedings of the …, 2023 - openaccess.thecvf.com

Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and
LAION have propelled recent dramatic progress in AI. Large neural models trained on such …

被引用次数：440 相关文章所有 5 个版本

[PDF] thecvf.com

Image as a foreign language: Beit pretraining for vision and vision-language tasks

W Wang, H Bao, L Dong, J Bjorck… - Proceedings of the …, 2023 - openaccess.thecvf.com

A big convergence of language, vision, and multimodal pretraining is emerging. In this work,
we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …

被引用次数：369 相关文章所有 5 个版本

[PDF] thecvf.com

Eva: Exploring the limits of masked visual representation learning at scale

Y Fang, W Wang, B Xie, Q Sun, L Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com

We launch EVA, a vision-centric foundation model to explore the limits of visual
representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …

被引用次数：452 相关文章所有 5 个版本

[PDF] neurips.cc

Language is not all you need: Aligning perception with language models

S Huang, L Dong, W Wang, Y Hao… - Advances in …, 2024 - proceedings.neurips.cc

A big convergence of language, multimodal perception, action, and world modeling is a key
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …

被引用次数：325 相关文章所有 5 个版本

[PDF] thecvf.com

Reproducible scaling laws for contrastive language-image learning

M Cherti, R Beaumont, R Wightman… - Proceedings of the …, 2023 - openaccess.thecvf.com

Scaling up neural networks has led to remarkable performance across a wide range of
tasks. Moreover, performance often follows reliable scaling laws as a function of training set …

被引用次数：400 相关文章所有 6 个版本

高级搜索

QQ 群