R Liu, B Sisman, H Li - arXiv preprint arXiv:2104.01408, 2021 - arxiv.org
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years. However, the generated voice is often not perceptually identifiable by its intended emotion …
Y Zhou, X Tian, H Li - IEEE/ACM Transactions on Audio …, 2021 - ieeexplore.ieee.org
Cross-lingual personalized speech generation seeks to synthesize a target speaker's voice from only a few training samples that are in a different language. One popular technique is to …
W Wang, Z Pan, X Li, S Wang… - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
Speech separation seeks to separate individual speech signals from a speech mixture. Typically, most separation models are trained on synthetic data due to the unavailability of …
Y Zhou, Z Wu, X Tian, H Li - IEEE/ACM Transactions on Audio …, 2023 - ieeexplore.ieee.org
Cross-lingual voice conversion (XVC) transforms the speaker identity of a source speaker to that of a target speaker who speaks a different language. Due to the intrinsic differences …
Z Zhang, Q Liu - IEEE Signal Processing Letters, 2021 - ieeexplore.ieee.org
Feature extractionplays an important role before pattern recognition takes place. The existing artificial neural networks (ANNs), however, ignoreto learn and represent temporal …
Audio-visual speech synthesis (AVSS) has garnered attention in recent years for its utility in the realm of audio-visual learning. AVSS transforms one speaker's speech into another's …
Abstract Cross-Lingual Voice Conversion (XVC) aims to modify a source speaker identity towards a target while preserving the source linguistic content. This paper introduces a cycle …
X Tian, K Fu, S Gao, Y Gu, K Wang, W Li, Z Ma… - 2022 - isca-archive.org
Automatic speech quality assessment aims to train a model capable of automatically measuring the performance of synthesis systems. This is a challenging task, especially …
S Yan, S Chen, Y Xu, D Ke - International Conference on Artificial …, 2023 - Springer
Voice conversion aims to change the timber of the source speaker to that of the target speaker without changing the speech content. The cross-lingual voice conversion requires …