相关文章- 学术资源搜索

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

S Yang, Z Zhong, M Zhao, S Takahashi, M Ishii… - arXiv preprint arXiv …, 2024 - arxiv.org

In recent years, with the realistic generation results and a wide range of personalized
applications, diffusion-based generative models gain huge attention in both visual and …

被引用次数：1 相关文章所有 2 个版本

[PDF] thecvf.com

Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners

Y Xing, Y He, Z Tian, X Wang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Video and audio content creation serves as the core technique for the movie industry and
professional users. Recently existing diffusion-based methods tackle video and audio …

被引用次数：8 相关文章所有 3 个版本

[PDF] arxiv.org

Audiotoken: Adaptation of text-conditioned diffusion models for audio-to-image generation

G Yariv, I Gat, L Wolf, Y Adi, I Schwartz - arXiv preprint arXiv:2305.13050, 2023 - arxiv.org

In recent years, image generation has shown a great leap in performance, where diffusion
models play a central role. Although generating high-quality images, such models are …

被引用次数：13 相关文章所有 4 个版本

[PDF] arxiv.org

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

K Wang, S Deng, J Shi, D Hatzinakos… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent Diffusion Transformers (DiTs) have shown impressive capabilities in generating high-
quality single-modality content, including images, videos, and audio. However, it is still …

I hear your true colors: Image guided audio generation

R Sheffer, Y Adi - … 2023-2023 IEEE International Conference on …, 2023 - ieeexplore.ieee.org

We propose IM2WAV, an image guided open-domain audio generation system. Given an
input image or a sequence of images, IM2WAV generates a semantically relevant sound …

被引用次数：33 相关文章所有 4 个版本

[PDF] apsipa.org

Diverse audio-to-image generation via semantics and feature consistency

PT Yang, FG Su, YCF Wang - 2020 Asia-Pacific Signal and …, 2020 - ieeexplore.ieee.org

Humans are capable of imagining scene images when hearing ambient sounds. Therefore,
audio-to-image synthesis becomes a challenging yet practical topic for both natural …

被引用次数：9 相关文章所有 2 个版本

[PDF] mlr.press

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

R Huang, J Huang, D Yang, Y Ren… - International …, 2023 - proceedings.mlr.press

Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the …

被引用次数：160 相关文章所有 7 个版本

[PDF] arxiv.org

Align, adapt and inject: Sound-guided unified image generation

Y Yang, K Zhang, Y Ge, W Shao, Z Xue, Y Qiao… - arXiv preprint arXiv …, 2023 - arxiv.org

Text-guided image generation has witnessed unprecedented progress due to the
development of diffusion models. Beyond text and image, sound is a vital element within the …

被引用次数：4 相关文章所有 3 个版本

TA2V: Text-Audio Guided Video Generation

M Zhao, W Wang, T Chen, R Zhang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

Recent conditional and unconditional video generation tasks have been accomplished
mainly based on generative adversarial network (GAN), diffusion, and autoregressive …

[PDF] arxiv.org

SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models

BC Biner, FM Sofian, UB Karakaş, D Ceylan… - arXiv preprint arXiv …, 2024 - arxiv.org

We are witnessing a revolution in conditional image synthesis with the recent success of
large scale text-to-image generation methods. This success also opens up new …

高级搜索

QQ 群

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners

Audiotoken: Adaptation of text-conditioned diffusion models for audio-to-image generation

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

I hear your true colors: Image guided audio generation

Diverse audio-to-image generation via semantics and feature consistency

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

Align, adapt and inject: Sound-guided unified image generation

TA2V: Text-Audio Guided Video Generation

SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models

相关搜索

引用