作者
Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Jesper Jensen
发表日期
2021/3/17
来源
IEEE/ACM Transactions on Audio, Speech, and Language Processing
卷号
29
页码范围
1368-1396
出版商
IEEE
简介
Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning , achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted …
引用总数
20202021202220232024231609356
学术搜索中的文章
D Michelsanti, ZH Tan, SX Zhang, Y Xu, M Yu, D Yu… - IEEE/ACM Transactions on Audio, Speech, and …, 2021