Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

J Chen, C Ge, E Xie, Y Wu, L Yao, X Ren… - arXiv preprint arXiv …, 2024 - arxiv.org

In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of
directly generating images at 4K resolution. PixArt-\Sigma represents a significant …

被引用次数：30 相关文章所有 2 个版本

[PDF] arxiv.org

Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design

A Campbell, J Yim, R Barzilay, T Rainforth… - arXiv preprint arXiv …, 2024 - arxiv.org

Combining discrete and continuous data is an important capability for generative models.
We present Discrete Flow Models (DFMs), a new flow-based model of discrete data that …

被引用次数：21 相关文章所有 6 个版本

[PDF] arxiv.org

Diffusion models meet remote sensing: Principles, methods, and perspectives

Y Liu, J Yue, S Xia, P Ghamisi, W Xie… - arXiv preprint arXiv …, 2024 - arxiv.org

As a newly emerging advance in deep generative models, diffusion models have achieved
state-of-the-art results in many fields, including computer vision, natural language …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

A survey on diffusion models for time series and spatio-temporal data

Y Yang, M Jin, H Wen, C Zhang, Y Liang, L Ma… - arXiv preprint arXiv …, 2024 - arxiv.org

The study of time series data is crucial for understanding trends and anomalies over time,
enabling predictive insights across various sectors. Spatio-temporal data, on the other hand …

被引用次数：12 相关文章所有 3 个版本

[PDF] arxiv.org

Diffusion-rwkv: Scaling rwkv-like architectures for diffusion models

Z Fei, M Fan, C Yu, D Li, J Huang - arXiv preprint arXiv:2404.04478, 2024 - arxiv.org

Transformers have catalyzed advancements in computer vision and natural language
processing (NLP) fields. However, substantial computational complexity poses limitations for …

被引用次数：8 相关文章所有 2 个版本

[PDF] arxiv.org

VisionLLaMA: A Unified LLaMA Interface for Vision Tasks

X Chu, J Su, B Zhang, C Shen - arXiv preprint arXiv:2403.00522, 2024 - arxiv.org

Large language models are built on top of a transformer-based architecture to process
textual inputs. For example, the LLaMA stands out among many open-source …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

Mora: Enabling generalist video generation via a multi-agent framework

Z Yuan, R Chen, Z Li, H Jia, L He, C Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

Sora is the first large-scale generalist video generation model that garnered significant
attention across society. Since its launch by OpenAI in February 2024, no other video …

被引用次数：10 相关文章所有 2 个版本

[PDF] arxiv.org

On statistical rates and provably efficient criteria of latent diffusion transformers (dits)

JYC Hu, W Wu, Z Li, Z Song, H Liu - arXiv preprint arXiv:2407.01079, 2024 - arxiv.org

We investigate the statistical and computational limits of latent\textbf {Di} ffusion\textbf {T}
ransformers (\textbf {DiT} s) under the low-dimensional linear latent space assumption …

被引用次数：2 相关文章

[PDF] arxiv.org

Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

R Li, C Zheng, C Rupprecht, A Vedaldi - arXiv preprint arXiv:2408.04631, 2024 - arxiv.org

We present Puppet-Master, an interactive video generative model that can serve as a motion
prior for part-level dynamics. At test time, given a single image and a sparse set of motion …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

D Yang, R Huang, Y Wang, H Guo, D Chong… - arXiv preprint arXiv …, 2024 - arxiv.org

Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective
method for improving the diversity and naturalness of synthesized speech. At the high level …

被引用次数：1 相关文章所有 2 个版本

高级搜索

QQ 群