An overview of voice conversion systems

SH Mohammadi, A Kain - Speech Communication, 2017 - Elsevier
Voice transformation (VT) aims to change one or more aspects of a speech signal while
preserving linguistic information. A subset of VT, Voice conversion (VC) specifically aims to …

Fastspeech 2: Fast and high-quality end-to-end text to speech

Y Ren, C Hu, X Tan, T Qin, S Zhao, Z Zhao… - arXiv preprint arXiv …, 2020 - arxiv.org
Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize
speech significantly faster than previous autoregressive models with comparable quality …

Emotional voice conversion: Theory, databases and ESD

K Zhou, B Sisman, R Liu, H Li - Speech Communication, 2022 - Elsevier
In this paper, we first provide a review of the state-of-the-art emotional voice conversion
research, and the existing emotional speech databases. We then motivate the development …

Expressive TTS training with frame and style reconstruction loss

R Liu, B Sisman, G Gao, H Li - IEEE/ACM Transactions on …, 2021 - ieeexplore.ieee.org
We propose a novel training strategy for Tacotron-based text-to-speech (TTS) system that
improves the speech styling at utterance level. One of the key challenges in prosody …

Transforming spectrum and prosody for emotional voice conversion with non-parallel training data

K Zhou, B Sisman, H Li - arXiv preprint arXiv:2002.00198, 2020 - arxiv.org
Emotional voice conversion aims to convert the spectrum and prosody to change the
emotional patterns of speech, while preserving the speaker identity and linguistic content …

From speaker to dubber: movie dubbing with prosody and duration consistency learning

Z Zhang, L Li, G Cong, H Yin, Y Gao, C Yan… - Proceedings of the …, 2024 - dl.acm.org
Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in
both temporal and emotional aspects while preserving the vocal timbre of one brief …

Converting anyone's emotion: Towards speaker-independent emotional voice conversion

K Zhou, B Sisman, M Zhang, H Li - arXiv preprint arXiv:2005.07025, 2020 - arxiv.org
Emotional voice conversion aims to convert the emotion of speech from one state to another
while preserving the linguistic content and speaker identity. The prior studies on emotional …

[PDF][PDF] Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion.

H Ming, DY Huang, L Xie, J Wu, M Dong, H Li - Interspeech, 2016 - isca-archive.org
Emotional voice conversion aims at converting speech from one emotion state to another.
This paper proposes to model timbre and prosody features using a deep bidirectional long …

Hierarchical representation and estimation of prosody using continuous wavelet transform

A Suni, J Šimko, D Aalto, M Vainio - Computer Speech & Language, 2017 - Elsevier
Prominences and boundaries are the essential constituents of prosodic structure in speech.
They provide for means to chunk the speech stream into linguistically relevant units by …

Fusion of spectral and prosody modelling for multilingual speech emotion conversion

S Vekkot, D Gupta - Knowledge-Based Systems, 2022 - Elsevier
The paper proposes an integrated speech emotion conversion framework developed using
speaker-independent mixed-lingual training. The key contribution of the work is non-parallel …