Object-level Visual Prompts for Compositional Image Generation

G Parmar, O Patashnik, KC Wang, D Ostashev… - arXiv preprint arXiv …, 2025 - arxiv.org
We introduce a method for composing object-level visual prompts within a text-to-image
diffusion model. Our approach addresses the task of generating semantically coherent …

Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation

T Wei, D Chen, Y Zhou, X Pan - arXiv preprint arXiv:2411.18301, 2024 - arxiv.org
Representing the cutting-edge technique of text-to-image models, the latest Multimodal
Diffusion Transformer (MMDiT) largely mitigates many generation issues existing in previous …

Understanding Multi-Granularity for Open-Vocabulary Part Segmentation

J Choi, S Lee, S Lee, M Lee, H Shim - arXiv preprint arXiv:2406.11384, 2024 - arxiv.org
Open-vocabulary part segmentation (OVPS) is an emerging research area focused on
segmenting fine-grained entities based on diverse and previously unseen vocabularies. Our …

Harnessing Multimodal AI for Creative Design: Performance Evaluation of Stable Diffusion and DALL-E 3 in Fashion Apparel and Typography

KN Sai, U Wable, A Singh, N Koundinya… - 2024 International …, 2024 - ieeexplore.ieee.org
In recent years, multimodal AI (Artificial Intelligence) models have exhibited promising
capabilities in generating diverse forms of creative content. This review paper critically …

Video Diffusion Models Learn the Structure of the Dynamic World

Z Bao, A Bagchi, YX Wang, P Tokmakov, M Hebert - openreview.net
Diffusion models have demonstrated significant progress in visual perception tasks due to
their ability to capture fine-grained, object-centric features through large-scale vision …