Deep learning for environmentally robust speech recognition: An overview of recent developments

Z Zhang, J Geiger, J Pohjalainen, AED Mousa… - ACM Transactions on …, 2018 - dl.acm.org
Eliminating the negative effect of non-stationary environmental noise is a long-standing
research topic for automatic speech recognition but still remains an important challenge …

Voice separation with an unknown number of multiple speakers

E Nachmani, Y Adi, L Wolf - International Conference on …, 2020 - proceedings.mlr.press
We present a new method for separating a mixed audio sequence, in which multiple voices
speak simultaneously. The new method employs gated neural networks that are trained to …

Spex: Multi-scale time domain speaker extraction network

C Xu, W Rao, ES Chng, H Li - IEEE/ACM transactions on audio …, 2020 - ieeexplore.ieee.org
Speaker extraction aims to mimic humans' selective auditory attention by extracting a target
speaker's voice from a multi-talker environment. It is common to perform the extraction in …

End-to-end microphone permutation and number invariant multi-channel speech separation

Y Luo, Z Chen, N Mesgarani… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
An important problem in ad-hoc microphone speech separation is how to guarantee the
robustness of a system with respect to the locations and numbers of microphones. The …

ADL-MVDR: All deep learning MVDR beamformer for target speech separation

Z Zhang, Y Xu, M Yu, SX Zhang… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
Speech separation algorithms are often used to separate the target speech from other
interfering sources. However, purely neural network based speech separation systems often …

FaSNet: Low-latency adaptive beamforming for multi-microphone audio processing

Y Luo, C Han, N Mesgarani, E Ceolini… - 2019 IEEE automatic …, 2019 - ieeexplore.ieee.org
Beamforming has been extensively investigated for multi-channel audio processing tasks.
Recently, learning-based beamforming methods, sometimes called neural beamformers …

Unified architecture for multichannel end-to-end speech recognition with neural beamforming

T Ochiai, S Watanabe, T Hori… - IEEE Journal of …, 2017 - ieeexplore.ieee.org
This paper proposes a unified architecture for end-to-end automatic speech recognition
(ASR) to encompass microphone-array signal processing such as a state-of-the-art neural …

Multichannel end-to-end speech recognition

T Ochiai, S Watanabe, T Hori… - … conference on machine …, 2017 - proceedings.mlr.press
The field of speech recognition is in the midst of a paradigm shift: end-to-end neural
networks are challenging the dominance of hidden Markov models as a core technology …

End-to-end dereverberation, beamforming, and speech recognition in a cocktail party

W Zhang, X Chang, C Boeddeker… - … on Audio, Speech …, 2022 - ieeexplore.ieee.org
Far-field multi-speaker automatic speech recognition (ASR) has drawn increasing attention
in recent years. Most existing methods feature a signal processing frontend and an ASR …

Time-domain speaker extraction network

C Xu, W Rao, ES Chng, H Li - 2019 IEEE Automatic Speech …, 2019 - ieeexplore.ieee.org
Speaker extraction is to extract a target speaker's voice from multi-talker speech. It simulates
humans' cocktail party effect or the selective listening ability. The prior work mostly performs …