T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation

K Huang, K Sun, E Xie, Z Li… - Advances in Neural …, 2023 - proceedings.neurips.cc
Despite the stunning ability to generate high-quality images by recent text-to-image models,
current approaches often struggle to effectively compose objects with different attributes and …

Layoutgpt: Compositional visual planning and generation with large language models

W Feng, W Zhu, T Fu, V Jampani… - Advances in …, 2024 - proceedings.neurips.cc
Attaining a high degree of user controllability in visual generation often requires intricate,
fine-grained inputs like layouts. However, such inputs impose a substantial burden on users …

Sparsectrl: Adding sparse controls to text-to-video diffusion models

Y Guo, C Yang, A Rao, M Agrawala, D Lin… - European Conference on …, 2024 - Springer
The development of text-to-video (T2V), ie, generating videos with a given text prompt, has
been significantly advanced in recent years. However, relying solely on text prompts often …

Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning

J Ma, J Liang, C Chen, H Lu - ACM SIGGRAPH 2024 Conference …, 2024 - dl.acm.org
Recent progress in personalized image generation using diffusion models has been
significant. However, development in the area of open-domain and test-time fine-tuning-free …

Grounded text-to-image synthesis with attention refocusing

Q Phung, S Ge, JB Huang - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Driven by the scalable diffusion models trained on large-scale datasets text-to-image
synthesis methods have shown compelling results. However these models still fail to …

Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment

R Rassin, E Hirsch, D Glickman… - Advances in …, 2024 - proceedings.neurips.cc
Text-conditioned image generation models often generate incorrect associations between
entities and their visual attributes. This reflects an impaired mapping between linguistic …

Compositional text-to-image synthesis with attention map control of diffusion models

R Wang, Z Chen, C Chen, J Ma, H Lu… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Recent text-to-image (T2I) diffusion models show outstanding performance in generating
high-quality images conditioned on textual prompts. However, they fail to semantically align …

Boosting consistency in story visualization with rich-contextual conditional diffusion models

F Shen, H Ye, S Liu, J Zhang, C Wang, X Han… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent research showcases the considerable potential of conditional diffusion models for
generating consistent stories. However, current methods, which predominantly generate …

T2v-compbench: A comprehensive benchmark for compositional text-to-video generation

K Sun, K Huang, X Liu, Y Wu, Z Xu, Z Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Text-to-video (T2V) generation models have advanced significantly, yet their ability to
compose different objects, attributes, actions, and motions into a video remains unexplored …

Conform: Contrast is all you need for high-fidelity text-to-image diffusion models

THS Meral, E Simsar, F Tombari… - Proceedings of the …, 2024 - openaccess.thecvf.com
Images produced by text-to-image diffusion models might not always faithfully represent the
semantic intent of the provided text prompt where the model might overlook or entirely fail to …