Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to …
Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm. Obtaining disentangled representations of these components is …
D Wang, L Deng, YT Yeung, X Chen, X Liu… - arXiv preprint arXiv …, 2021 - arxiv.org
One-shot voice conversion (VC), which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference, can be effectively achieved by speech …
We present an unsupervised non-parallel many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2. Using a combination of …
DY Wu, YH Chen, HY Lee - arXiv preprint arXiv:2006.04154, 2020 - arxiv.org
Voice conversion (VC) is a task that transforms the source speaker's timbre, accent, and tones in audio into another one's while preserving the linguistic content. It is still a …
YH Chen, DY Wu, TH Wu, H Lee - ICASSP 2021-2021 IEEE …, 2021 - ieeexplore.ieee.org
Recently, voice conversion (VC) has been widely studied. Many VC systems use disentangle-based learning techniques to separate the speaker and the linguistic content …
V Popov, I Vovk, V Gogoryan, T Sadekova… - arXiv preprint arXiv …, 2021 - arxiv.org
Voice conversion is a common speech synthesis task which can be solved in different ways depending on a particular real-world scenario. The most challenging one often referred to as …
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated …
Prosody plays an important role in characterizing the style of a speaker or an emotion, but most non-parallel voice or emotion style transfer algorithms do not convert any prosody …