查看文章

arxiv.org 中的 [PDF]

An overview of deep-learning-based audio-visual speech enhancement and separation

作者

Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Jesper Jensen

发表日期

2021/3/17

来源

IEEE/ACM Transactions on Audio, Speech, and Language Processing

卷号

页码范围

1368-1396

出版商

IEEE

简介

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning , achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted …

引用总数

被引用次数：244

202020212022202320242 31 60 93 56

学术搜索中的文章

An overview of deep-learning-based audio-visual speech enhancement and separation

D Michelsanti, ZH Tan, SX Zhang, Y Xu, M Yu, D Yu… - IEEE/ACM Transactions on Audio, Speech, and …, 2021

被引用次数：244 相关文章所有 6 个版本