Visual to sound: Generating natural sound for videos in the wild

S Zhang, Y Li, S Zhang, F Shahabi, S Xia, Y Deng… - Sensors, 2022 - mdpi.com

Mobile and wearable devices have enabled numerous applications, including activity
tracking, wellness monitoring, and human–computer interaction, that measure and improve …

被引用次数：354 相关文章所有 14 个版本

[PDF] port.ac.uk

Deep multi-view learning methods: A review

X Yan, S Hu, Y Mao, Y Ye, H Yu - Neurocomputing, 2021 - Elsevier

Multi-view learning (MVL) has attracted increasing attention and achieved great practical
success by exploiting complementary information of multiple features or modalities …

被引用次数：267 相关文章所有 4 个版本

[PDF] thecvf.com

Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation

L Ruan, Y Ma, H Yang, H He, B Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

We propose the first joint audio-video generation framework that brings engaging watching
and listening experiences simultaneously, towards high-quality realistic videos. To generate …

被引用次数：153 相关文章所有 5 个版本

[PDF] arxiv.org

Diffsound: Discrete diffusion model for text-to-sound generation

D Yang, J Yu, H Wang, W Wang… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org

Generating sound effects that people want is an important topic. However, there are limited
studies in this area for sound generation. In this study, we investigate generating sound …

被引用次数：290 相关文章所有 4 个版本

[PDF] thecvf.com

Everything at once-multi-modal fusion transformer for video retrieval

N Shvetsova, B Chen… - Proceedings of the …, 2022 - openaccess.thecvf.com

Multi-modal learning from video data has seen increased attention recently as it allows
training of semantically meaningful embeddings without human annotation, enabling tasks …

被引用次数：163 相关文章所有 7 个版本

[PDF] arxiv.org

Visualvoice: Audio-visual speech separation with cross-modal consistency

R Gao, K Grauman - 2021 IEEE/CVF Conference on Computer …, 2021 - ieeexplore.ieee.org

We introduce a new approach for audio-visual speech separation. Given a video, the goal is
to extract the speech associated with a face in spite of simultaneous back-ground sounds …

被引用次数：194 相关文章所有 9 个版本

[PDF] thecvf.com

The sound of pixels

H Zhao, C Gan, A Rouditchenko… - Proceedings of the …, 2018 - openaccess.thecvf.com

We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos,
learns to locate image regions which produce sounds and separate the input sounds into a …

被引用次数：622 相关文章所有 10 个版本

[PDF] thecvf.com

Listen to look: Action recognition by previewing audio

R Gao, TH Oh, K Grauman… - Proceedings of the …, 2020 - openaccess.thecvf.com

In the face of the video data deluge, today's expensive clip-level classifiers are increasingly
impractical. We propose a framework for efficient action recognition in untrimmed video that …

被引用次数：295 相关文章所有 7 个版本

[PDF] thecvf.com

Audio-visual event localization in unconstrained videos

Y Tian, J Shi, B Li, Z Duan, C Xu - Proceedings of the …, 2018 - openaccess.thecvf.com

In this paper, we introduce a novel problem of audio-visual event localization in
unconstrained videos. We define an audio-visual event as an event that is both visible and …

被引用次数：528 相关文章所有 11 个版本

[PDF] neurips.cc

Dancing to music

HY Lee, X Yang, MY Liu, TC Wang… - Advances in neural …, 2019 - proceedings.neurips.cc

Dancing to music is an instinctive move by humans. Learning to model the music-to-dance
generation process is, however, a challenging problem. It requires significant efforts to …

被引用次数：303 相关文章所有 14 个版本

高级搜索

QQ 群