Adding conditional control to text-to-image diffusion models

L Zhang, A Rao, M Agrawala - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
We present ControlNet, a neural network architecture to add spatial conditioning controls to
large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large …

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

S Liu, Z Zeng, T Ren, F Li, H Zhang, J Yang… - arXiv preprint arXiv …, 2023 - arxiv.org
In this paper, we present an open-set object detector, called Grounding DINO, by marrying
Transformer-based detector DINO with grounded pre-training, which can detect arbitrary …

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

C Mou, X Wang, L Xie, Y Wu, J Zhang, Z Qi… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated
strong power of learning complex structures and meaningful semantics. However, relying …

Generalized decoding for pixel, image, and language

X Zou, ZY Dou, J Yang, Z Gan, L Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
We present X-Decoder, a generalized decoding model that can predict pixel-level
segmentation and language tokens seamlessly. X-Decoder takes as input two types of …

Dreambooth3d: Subject-driven text-to-3d generation

A Raj, S Kaza, B Poole, M Niemeyer… - Proceedings of the …, 2023 - openaccess.thecvf.com
We present DreamBooth3D, an approach to personalize text-to-3D generative models from
as few as 3-6 casually captured images of a subject. Our approach combines recent …

Instantbooth: Personalized text-to-image generation without test-time finetuning

J Shi, W Xiong, Z Lin, HJ Jung - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Recent advances in personalized image generation have enabled pre-trained text-to-image
models to learn new concepts from specific image sets. However these methods often …

Stablevideo: Text-driven consistency-aware diffusion video editing

W Chai, X Guo, G Wang, Y Lu - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Diffusion-based methods can generate realistic images and videos, but they struggle to edit
existing objects in a video while preserving their geometry over time. This prevents diffusion …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Composer: Creative and controllable image synthesis with composable conditions

L Huang, D Chen, Y Liu, Y Shen, D Zhao… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent large-scale generative models learned on big data are capable of synthesizing
incredible images yet suffer from limited controllability. This work offers a new generation …

Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion

J Xie, Y Li, Y Huang, H Liu, W Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recent text-to-image diffusion models have demonstrated an astonishing capacity to
generate high-quality images. However, researchers mainly studied the way of synthesizing …