Diffusion models have emerged as a powerful new family of deep generative models with record-breaking performance in many applications, including image synthesis, video …
J Li, D Li, S Savarese, S Hoi - International conference on …, 2023 - proceedings.mlr.press
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and …
H Liu, C Li, Q Wu, YJ Lee - Advances in neural information …, 2024 - proceedings.neurips.cc
Instruction tuning large language models (LLMs) using machine-generated instruction- following data has been shown to improve zero-shot capabilities on new tasks, but the idea …
We present ImageBind, an approach to learn a joint embedding across six different modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …
In this work, we present SEEM, a promotable and interactive model for segmenting everything everywhere all at once in an image. In SEEM, we propose a novel and versatile …
A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …
Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early …
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the …