Unsupervised speech decomposition via triple information bottleneck

K Qian, Y Zhang, S Chang… - International …, 2020 - proceedings.mlr.press
Speech information can be roughly decomposed into four components: language content,
timbre, pitch, and rhythm. Obtaining disentangled representations of these components is …

Controllable neural text-to-speech synthesis using intuitive prosodic features

T Raitio, R Rasipuram, D Castellani - arXiv preprint arXiv:2009.06775, 2020 - arxiv.org
Modern neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable
from natural speech. However, the prosody of generated utterances often represents the …

Towards unsupervised speech recognition and synthesis with quantized speech representation learning

AH Liu, T Tu, H Lee, L Lee - ICASSP 2020-2020 IEEE …, 2020 - ieeexplore.ieee.org
In this paper we propose a Sequential Representation Quantization AutoEncoder (SeqRQ-
AE) to learn from primarily unpaired audio data and produce sequences of representations …

Combining automatic speaker verification and prosody analysis for synthetic speech detection

L Attorresi, D Salvi, C Borrelli, P Bestagini… - … Conference on Pattern …, 2022 - Springer
The rapid spread of media content synthesis technology and the potentially damaging
impact of audio and video deepfakes on people's lives have raised the need to implement …

Prosody-controllable spontaneous TTS with neural HMMs

H Lameris, S Mehta, GE Henter… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Spontaneous speech has many affective and pragmatic functions that are interesting and
challenging to model in TTS. However, the presence of reduced articulation, fillers …

Prosodic features control by symbols as input of sequence-to-sequence acoustic modeling for neural TTS

K Kurihara, N Seiyama, T Kumano - IEICE Transactions on …, 2021 - search.ieice.org
This paper describes a method to control prosodic features using phonetic and prosodic
symbols as input of attention-based sequence-to-sequence (seq2seq) acoustic modeling …

Exact prosody cloning in zero-shot multispeaker text-to-speech

F Lux, J Koch, NT Vu - 2022 IEEE Spoken Language …, 2023 - ieeexplore.ieee.org
The cloning of a speaker's voice using an untranscribed reference sample is one of the great
advances of modern neural text-to-speech (TTS) methods. Approaches for mimicking the …

Hierarchical prosody modeling and control in non-autoregressive parallel neural TTS

T Raitio, J Li, S Seshadri - ICASSP 2022-2022 IEEE …, 2022 - ieeexplore.ieee.org
Neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from
natural speech. However, the synthetic speech often represents the average prosodic style …

Supervised and unsupervised approaches for controlling narrow lexical focus in sequence-to-sequence speech synthesis

S Shechtman, R Fernandez… - 2021 IEEE Spoken …, 2021 - ieeexplore.ieee.org
Although Sequence-to-Sequence (S2S) architectures have become state-of-the-art in
speech synthesis, capable of generating outputs that approach the perceptual quality of …

Transplantation of conversational speaking style with interjections in sequence-to-sequence speech synthesis

R Fernandez, D Haws, G Lorberbom… - arXiv preprint arXiv …, 2022 - arxiv.org
Sequence-to-Sequence Text-to-Speech architectures that directly generate low level
acoustic features from phonetic sequences are known to produce natural and expressive …