PIXART- : Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

J Chen, C Ge, E Xie, Y Wu, L Yao, X Ren… - … on Computer Vision, 2025 - Springer
In this paper, we introduce PixArt-\(\Sigma\), a Diffusion Transformer model (DiT) capable of
directly generating images at 4K resolution. PixArt-\(\Sigma\) represents a significant …

Textdiffuser-2: Unleashing the power of language models for text rendering

J Chen, Y Huang, T Lv, L Cui, Q Chen, F Wei - European Conference on …, 2025 - Springer
The diffusion model has been proven a powerful generative model in recent years, yet it
remains a challenge in generating visual text. Although existing work has endeavored to …

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining

D Liu, S Zhao, L Zhuo, W Lin, Y Qiao, H Li… - arXiv preprint arXiv …, 2024 - arxiv.org
We present Lumina-mGPT, a family of multimodal autoregressive models capable of various
vision and language tasks, particularly excelling in generating flexible photorealistic images …

Ledits++: Limitless image editing using text-to-image models

M Brack, F Friedrich, K Kornmeier… - Proceedings of the …, 2024 - openaccess.thecvf.com
Text-to-image diffusion models have recently received increasing interest for their
astonishing ability to produce high-fidelity images from solely text inputs. Subsequent …

Representation alignment for generation: Training diffusion transformers is easier than you think

S Yu, S Kwak, H Jang, J Jeong, J Huang, J Shin… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent studies have shown that the denoising process in (generative) diffusion models can
induce meaningful (discriminative) representations inside the model, though the quality of …

Pyramidal flow matching for efficient video generative modeling

Y Jin, Z Sun, N Li, K Xu, H Jiang, N Zhuang… - arXiv preprint arXiv …, 2024 - arxiv.org
Video generation requires modeling a vast spatiotemporal space, which demands
significant computational resources and data usage. To reduce the complexity, the …

Bk-sdm: A lightweight, fast, and cheap version of stable diffusion

BK Kim, HK Song, T Castells, S Choi - European Conference on Computer …, 2025 - Springer
Abstract Text-to-image (T2I) generation with Stable Diffusion models (SDMs) involves high
computing demands due to billion-scale parameters. To enhance efficiency, recent studies …

Eclipse: A resource-efficient text-to-image prior for image generations

M Patel, C Kim, S Cheng, C Baral… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Text-to-image (T2I) diffusion models notably the unCLIP models (eg DALL-E-2)
achieve state-of-the-art (SOTA) performance on various compositional T2I benchmarks at …

Simple and scalable strategies to continually pre-train large language models

A Ibrahim, B Thérien, K Gupta, ML Richter… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start
the process over again once new data becomes available. A much more efficient solution is …

Turboedit: Instant text-based image editing

Z Wu, N Kolkin, J Brandt, R Zhang… - European Conference on …, 2025 - Springer
We address the challenges of precise image inversion and disentangled image editing in
the context of few-step diffusion models. We introduce an encoder based iterative inversion …