Harnessing the spatial-temporal attention of diffusion models for high-fidelity text-to-image...

K Huang, K Sun, E Xie, Z Li… - Advances in Neural …, 2023 - proceedings.neurips.cc

Despite the stunning ability to generate high-quality images by recent text-to-image models,
current approaches often struggle to effectively compose objects with different attributes and …

被引用次数：168 相关文章所有 6 个版本

[PDF] neurips.cc

Layoutgpt: Compositional visual planning and generation with large language models

W Feng, W Zhu, T Fu, V Jampani… - Advances in …, 2024 - proceedings.neurips.cc

Attaining a high degree of user controllability in visual generation often requires intricate,
fine-grained inputs like layouts. However, such inputs impose a substantial burden on users …

被引用次数：177 相关文章所有 7 个版本

[PDF] arxiv.org

Sparsectrl: Adding sparse controls to text-to-video diffusion models

Y Guo, C Yang, A Rao, M Agrawala, D Lin… - European Conference on …, 2024 - Springer

The development of text-to-video (T2V), ie, generating videos with a given text prompt, has
been significantly advanced in recent years. However, relying solely on text prompts often …

被引用次数：75 相关文章所有 2 个版本

[PDF] arxiv.org

Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning

J Ma, J Liang, C Chen, H Lu - ACM SIGGRAPH 2024 Conference …, 2024 - dl.acm.org

Recent progress in personalized image generation using diffusion models has been
significant. However, development in the area of open-domain and test-time fine-tuning-free …

被引用次数：101 相关文章所有 3 个版本

[PDF] thecvf.com

Grounded text-to-image synthesis with attention refocusing

Q Phung, S Ge, JB Huang - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com

Driven by the scalable diffusion models trained on large-scale datasets text-to-image
synthesis methods have shown compelling results. However these models still fail to …

被引用次数：83 相关文章所有 3 个版本

[PDF] neurips.cc

Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment

R Rassin, E Hirsch, D Glickman… - Advances in …, 2024 - proceedings.neurips.cc

Text-conditioned image generation models often generate incorrect associations between
entities and their visual attributes. This reflects an impaired mapping between linguistic …

被引用次数：72 相关文章所有 5 个版本

[PDF] aaai.org

Compositional text-to-image synthesis with attention map control of diffusion models

R Wang, Z Chen, C Chen, J Ma, H Lu… - Proceedings of the AAAI …, 2024 - ojs.aaai.org

Recent text-to-image (T2I) diffusion models show outstanding performance in generating
high-quality images conditioned on textual prompts. However, they fail to semantically align …

被引用次数：52 相关文章所有 4 个版本

[PDF] arxiv.org

Boosting consistency in story visualization with rich-contextual conditional diffusion models

F Shen, H Ye, S Liu, J Zhang, C Wang, X Han… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent research showcases the considerable potential of conditional diffusion models for
generating consistent stories. However, current methods, which predominantly generate …

被引用次数：28 相关文章所有 3 个版本

[PDF] arxiv.org

T2v-compbench: A comprehensive benchmark for compositional text-to-video generation

K Sun, K Huang, X Liu, Y Wu, Z Xu, Z Li… - arXiv preprint arXiv …, 2024 - arxiv.org

Text-to-video (T2V) generation models have advanced significantly, yet their ability to
compose different objects, attributes, actions, and motions into a video remains unexplored …

被引用次数：17 相关文章所有 3 个版本

[PDF] thecvf.com

Conform: Contrast is all you need for high-fidelity text-to-image diffusion models

THS Meral, E Simsar, F Tombari… - Proceedings of the …, 2024 - openaccess.thecvf.com

Images produced by text-to-image diffusion models might not always faithfully represent the
semantic intent of the provided text prompt where the model might overlook or entirely fail to …

被引用次数：17 相关文章所有 2 个版本

高级搜索

QQ 群