An overview of deep-learning-based audio-visual speech enhancement and separation

D Michelsanti, ZH Tan, SX Zhang, Y Xu… - … on Audio, Speech …, 2021 - ieeexplore.ieee.org
Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …

Visualvoice: Audio-visual speech separation with cross-modal consistency

R Gao, K Grauman - 2021 IEEE/CVF Conference on Computer …, 2021 - ieeexplore.ieee.org
We introduce a new approach for audio-visual speech separation. Given a video, the goal is
to extract the speech associated with a face in spite of simultaneous back-ground sounds …

Learning to separate object sounds by watching unlabeled video

R Gao, R Feris, K Grauman - Proceedings of the European …, 2018 - openaccess.thecvf.com
Perceiving a scene most fully requires all the senses. Yet modeling how objects look and
sound is challenging: most natural scenes and events contain multiple objects, and the …

2.5 d visual sound

R Gao, K Grauman - … of the IEEE/CVF Conference on …, 2019 - openaccess.thecvf.com
Binaural audio provides a listener with 3D sound sensation, allowing a rich perceptual
experience of the scene. However, binaural recordings are scarcely available and require …

Co-separating sounds of visual objects

R Gao, K Grauman - Proceedings of the IEEE/CVF …, 2019 - openaccess.thecvf.com
Learning how objects sound from video is challenging, since they often heavily overlap in a
single audio channel. Current methods for visually-guided audio source separation sidestep …

Positive sample propagation along the audio-visual event line

J Zhou, L Zheng, Y Zhong, S Hao… - Proceedings of the …, 2021 - openaccess.thecvf.com
Visual and audio signals often coexist in natural environments, forming audio-visual events
(AVEs). Given a video, we aim to localize video segments containing an AVE and identify its …

Contrastive positive sample propagation along the audio-visual event line

J Zhou, D Guo, M Wang - IEEE Transactions on Pattern …, 2022 - ieeexplore.ieee.org
Visual and audio signals often coexist in natural environments, forming audio-visual events
(AVEs). Given a video, we aim to localize video segments containing an AVE and identify its …

Sep-stereo: Visually guided stereophonic audio generation by associating source separation

H Zhou, X Xu, D Lin, X Wang, Z Liu - … , Glasgow, UK, August 23–28, 2020 …, 2020 - Springer
Stereophonic audio is an indispensable ingredient to enhance human auditory experience.
Recent research has explored the usage of visual information as guidance to generate …

Creating a multitrack classical music performance dataset for multimodal music analysis: Challenges, insights, and applications

B Li, X Liu, K Dinesh, Z Duan… - IEEE Transactions on …, 2018 - ieeexplore.ieee.org
We introduce a dataset for facilitating audio-visual analysis of music performances. The
dataset comprises 44 simple multi-instrument classical music pieces assembled from …

Audio-visual speech codecs: Rethinking audio-visual speech enhancement by re-synthesis

K Yang, D Marković, S Krenn… - Proceedings of the …, 2022 - openaccess.thecvf.com
Since facial actions such as lip movements contain significant information about speech
content, it is not surprising that audio-visual speech enhancement methods are more …