UNSSOR: unsupervised neural speech separation by leveraging over-determined training mixtures

ZQ Wang, S Watanabe - Advances in Neural Information …, 2024 - proceedings.neurips.cc
In reverberant conditions with multiple concurrent speakers, each microphone acquires a
mixture signal of multiple speakers at a different location. In over-determined conditions …

TF-Locoformer: Transformer with local modeling by convolution for speech separation and enhancement

K Saijo, G Wichern, FG Germain, Z Pan… - … on Acoustic Signal …, 2024 - ieeexplore.ieee.org
Time-frequency (TF) domain dual-path models achieve high-fidelity speech separation.
While some previous state-of-the-art (SoTA) models rely on RNNs, this reliance means they …

URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

W Zhang, R Scheibler, K Saijo, S Cornell, C Li… - arXiv preprint arXiv …, 2024 - arxiv.org
The last decade has witnessed significant advancements in deep learning-based speech
enhancement (SE). However, most existing SE research has limitations on the coverage of …

Improving Design of Input Condition Invariant Speech Enhancement

W Zhang, J Jung, Y Qian - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org
Building a single universal speech enhancement (SE) system that can handle arbitrary input
is a demanded but underexplored research topic. Towards this ultimate goal, one direction …

SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition

Y Wu, S Maiti, Y Peng, W Zhang, C Li, Y Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advancements in language models have significantly enhanced performance in
multiple speech-related tasks. Existing speech language models typically utilize task …

Scale This, Not That: Investigating Key Dataset Attributes for Efficient Speech Enhancement Scaling

L Zhang, W Zhang, C Li, Y Qian - arXiv preprint arXiv:2412.14890, 2024 - arxiv.org
Recent speech enhancement models have shown impressive performance gains by scaling
up model complexity and training data. However, the impact of dataset variability (eg text …

Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement

Y Yang, Z Liu, W Yu, G Sun, Q Kong… - arXiv preprint arXiv …, 2024 - arxiv.org
Diffusion-based generative models have recently achieved remarkable results in speech
and vocal enhancement due to their ability to model complex speech data distributions …

Task-Aware Unified Source Separation

K Saijo, J Ebbers, FG Germain, G Wichern… - arXiv preprint arXiv …, 2024 - arxiv.org
Several attempts have been made to handle multiple source separation tasks such as
speech enhancement, speech separation, sound event separation, music source separation …

Bridging the Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition

KC Wang, YJ Li, WL Chen, YW Chen, YC Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Noise robustness is critical when applying automatic speech recognition (ASR) in real-world
scenarios. One solution involves the used of speech enhancement (SE) models as the front …

Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

W Zhang, K Saijo, J Jung, C Li, S Watanabe… - arXiv preprint arXiv …, 2024 - arxiv.org
Deep learning-based speech enhancement (SE) models have achieved impressive
performance in the past decade. Numerous advanced architectures have been designed to …