Pixart-\sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

J Chen, C Ge, E Xie, Y Wu, L Yao, X Ren… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of
directly generating images at 4K resolution. PixArt-\Sigma represents a significant …

Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design

A Campbell, J Yim, R Barzilay, T Rainforth… - arXiv preprint arXiv …, 2024 - arxiv.org
Combining discrete and continuous data is an important capability for generative models.
We present Discrete Flow Models (DFMs), a new flow-based model of discrete data that …

Diffusion models meet remote sensing: Principles, methods, and perspectives

Y Liu, J Yue, S Xia, P Ghamisi, W Xie… - arXiv preprint arXiv …, 2024 - arxiv.org
As a newly emerging advance in deep generative models, diffusion models have achieved
state-of-the-art results in many fields, including computer vision, natural language …

A survey on diffusion models for time series and spatio-temporal data

Y Yang, M Jin, H Wen, C Zhang, Y Liang, L Ma… - arXiv preprint arXiv …, 2024 - arxiv.org
The study of time series data is crucial for understanding trends and anomalies over time,
enabling predictive insights across various sectors. Spatio-temporal data, on the other hand …

Diffusion-rwkv: Scaling rwkv-like architectures for diffusion models

Z Fei, M Fan, C Yu, D Li, J Huang - arXiv preprint arXiv:2404.04478, 2024 - arxiv.org
Transformers have catalyzed advancements in computer vision and natural language
processing (NLP) fields. However, substantial computational complexity poses limitations for …

VisionLLaMA: A Unified LLaMA Interface for Vision Tasks

X Chu, J Su, B Zhang, C Shen - arXiv preprint arXiv:2403.00522, 2024 - arxiv.org
Large language models are built on top of a transformer-based architecture to process
textual inputs. For example, the LLaMA stands out among many open-source …

Mora: Enabling generalist video generation via a multi-agent framework

Z Yuan, R Chen, Z Li, H Jia, L He, C Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Sora is the first large-scale generalist video generation model that garnered significant
attention across society. Since its launch by OpenAI in February 2024, no other video …

On statistical rates and provably efficient criteria of latent diffusion transformers (dits)

JYC Hu, W Wu, Z Li, Z Song, H Liu - arXiv preprint arXiv:2407.01079, 2024 - arxiv.org
We investigate the statistical and computational limits of latent\textbf {Di} ffusion\textbf {T}
ransformers (\textbf {DiT} s) under the low-dimensional linear latent space assumption …

Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

R Li, C Zheng, C Rupprecht, A Vedaldi - arXiv preprint arXiv:2408.04631, 2024 - arxiv.org
We present Puppet-Master, an interactive video generative model that can serve as a motion
prior for part-level dynamics. At test time, given a single image and a sparse set of motion …

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

D Yang, R Huang, Y Wang, H Guo, D Chong… - arXiv preprint arXiv …, 2024 - arxiv.org
Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective
method for improving the diversity and naturalness of synthesized speech. At the high level …