Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

S Yang, Z Zhong, M Zhao, S Takahashi, M Ishii… - arXiv preprint arXiv …, 2024 - arxiv.org
In recent years, with the realistic generation results and a wide range of personalized
applications, diffusion-based generative models gain huge attention in both visual and …

Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners

Y Xing, Y He, Z Tian, X Wang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Video and audio content creation serves as the core technique for the movie industry and
professional users. Recently existing diffusion-based methods tackle video and audio …

Audiotoken: Adaptation of text-conditioned diffusion models for audio-to-image generation

G Yariv, I Gat, L Wolf, Y Adi, I Schwartz - arXiv preprint arXiv:2305.13050, 2023 - arxiv.org
In recent years, image generation has shown a great leap in performance, where diffusion
models play a central role. Although generating high-quality images, such models are …

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

K Wang, S Deng, J Shi, D Hatzinakos… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent Diffusion Transformers (DiTs) have shown impressive capabilities in generating high-
quality single-modality content, including images, videos, and audio. However, it is still …

I hear your true colors: Image guided audio generation

R Sheffer, Y Adi - … 2023-2023 IEEE International Conference on …, 2023 - ieeexplore.ieee.org
We propose IM2WAV, an image guided open-domain audio generation system. Given an
input image or a sequence of images, IM2WAV generates a semantically relevant sound …

Diverse audio-to-image generation via semantics and feature consistency

PT Yang, FG Su, YCF Wang - 2020 Asia-Pacific Signal and …, 2020 - ieeexplore.ieee.org
Humans are capable of imagining scene images when hearing ambient sounds. Therefore,
audio-to-image synthesis becomes a challenging yet practical topic for both natural …

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

R Huang, J Huang, D Yang, Y Ren… - International …, 2023 - proceedings.mlr.press
Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the …

Align, adapt and inject: Sound-guided unified image generation

Y Yang, K Zhang, Y Ge, W Shao, Z Xue, Y Qiao… - arXiv preprint arXiv …, 2023 - arxiv.org
Text-guided image generation has witnessed unprecedented progress due to the
development of diffusion models. Beyond text and image, sound is a vital element within the …

TA2V: Text-Audio Guided Video Generation

M Zhao, W Wang, T Chen, R Zhang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Recent conditional and unconditional video generation tasks have been accomplished
mainly based on generative adversarial network (GAN), diffusion, and autoregressive …

SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models

BC Biner, FM Sofian, UB Karakaş, D Ceylan… - arXiv preprint arXiv …, 2024 - arxiv.org
We are witnessing a revolution in conditional image synthesis with the recent success of
large scale text-to-image generation methods. This success also opens up new …