An overview of deep-learning-based audio-visual speech enhancement and separation

D Michelsanti, ZH Tan, SX Zhang, Y Xu… - … on Audio, Speech …, 2021 - ieeexplore.ieee.org
Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …

A Two-stage Audio-Visual Speech Separation Method without Visual Signals for Testing and Tuples Loss with Dynamic Margin

Y Liu, Y Deng, Y Wei - IEEE Journal of Selected Topics in …, 2024 - ieeexplore.ieee.org
Speech separation as a fundamental task in signal processing can be used in many types of
intelligent robots, and audio-visual (AV) speech separation has been proven to be superior …

AV-CrossNet: an Audiovisual Complex Spectral Mapping Network for Speech Separation By Leveraging Narrow-and Cross-Band Modeling

VA Kalkhorani, C Yu, A Kumar, K Tan, B Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
Adding visual cues to audio-based speech separation can improve separation performance.
This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement …

Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation

X Wang, X Kong, X Peng, Y Lu - arXiv preprint arXiv:2207.01197, 2022 - arxiv.org
In this paper we propose a multi-modal multi-correlation learning framework targeting at the
task of audio-visual speech separation. Although previous efforts have been extensively put …

Cross-modal Speech Separation Without Visual Information During Testing

Y Liu, Y Deng, Y Wei - 2023 IEEE Biomedical Circuits and …, 2023 - ieeexplore.ieee.org
Visual information plays an important role in speech separation. It has been illustrated by
many studies that audio-visual speech separation has better performance than audio-only …

Towards Light-Weight and High Performance Speech Enhancement and Recognition Using Mixed Precision Neural Network Quantization

J Xu - 2022 - search.proquest.com
Automatic speech recognition (ASR) including the speech enhancement front-end,
recognition back-end which is made up of acoustic models and language models stays a …