Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：145 相关文章所有 7 个版本

[PDF] springer.com

Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer

With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

被引用次数：118 相关文章所有 8 个版本

[PDF] mlr.press

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

J Li, D Li, S Savarese, S Hoi - International conference on …, 2023 - proceedings.mlr.press

The cost of vision-and-language pre-training has become increasingly prohibitive due to
end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and …

被引用次数：2685 相关文章所有 7 个版本

[PDF] neurips.cc

Visual instruction tuning

H Liu, C Li, Q Wu, YJ Lee - Advances in neural information …, 2024 - proceedings.neurips.cc

Instruction tuning large language models (LLMs) using machine-generated instruction-
following data has been shown to improve zero-shot capabilities on new tasks, but the idea …

被引用次数：2333 相关文章所有 15 个版本

[PDF] arxiv.org

A survey of large language models

WX Zhao, K Zhou, J Li, T Tang, X Wang, Y Hou… - arXiv preprint arXiv …, 2023 - arxiv.org

Language is essentially a complex, intricate system of human expressions governed by
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …

被引用次数：1968 相关文章所有 4 个版本

[PDF] arxiv.org

Minigpt-4: Enhancing vision-language understanding with advanced large language models

D Zhu, J Chen, X Shen, X Li, M Elhoseiny - arXiv preprint arXiv …, 2023 - arxiv.org

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly
generating websites from handwritten text and identifying humorous elements within …

被引用次数：1388 相关文章所有 7 个版本

[PDF] thecvf.com

Image as a foreign language: Beit pretraining for vision and vision-language tasks

W Wang, H Bao, L Dong, J Bjorck… - Proceedings of the …, 2023 - openaccess.thecvf.com

A big convergence of language, vision, and multimodal pretraining is emerging. In this work,
we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …

被引用次数：369 相关文章所有 5 个版本

[PDF] thecvf.com

Internimage: Exploring large-scale vision foundation models with deformable convolutions

W Wang, J Dai, Z Chen, Z Huang, Z Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

Compared to the great progress of large-scale vision transformers (ViTs) in recent years,
large-scale models based on convolutional neural networks (CNNs) are still in an early …

被引用次数：497 相关文章所有 8 个版本

[PDF] thecvf.com

Eva: Exploring the limits of masked visual representation learning at scale

Y Fang, W Wang, B Xie, Q Sun, L Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com

We launch EVA, a vision-centric foundation model to explore the limits of visual
representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …

被引用次数：452 相关文章所有 5 个版本

[PDF] neurips.cc

Language is not all you need: Aligning perception with language models

S Huang, L Dong, W Wang, Y Hao… - Advances in …, 2024 - proceedings.neurips.cc

A big convergence of language, multimodal perception, action, and world modeling is a key
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …

被引用次数：325 相关文章所有 5 个版本

高级搜索

QQ 群