Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

H Barakat, O Turk, C Demiroglu - EURASIP Journal on Audio, Speech, and …, 2024 - Springer
Speech synthesis has made significant strides thanks to the transition from machine learning
to deep learning models. Contemporary text-to-speech (TTS) models possess the capability …

[PDF][PDF] Beyond Deep Learning: Charting the Next Frontiers of Affective Computing

A Triantafyllopoulos, L Christ, A Gebhard… - Intelligent …, 2024 - spj.science.org
Affective computing (AC), as most other areas of computational research, has benefited
tremendously from advances in deep learning (DL). These advances have opened up new …

BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data

M Łajszczak, G Cámbara, Y Li, F Beyhan… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf {B} $
ig $\textbf {A} $ daptive $\textbf {S} $ treamable TTS with $\textbf {E} $ mergent abilities …

Effect of attention and self-supervised speech embeddings on non-semantic speech tasks

P Mohapatra, A Pandey, Y Sui, Q Zhu - Proceedings of the 31st ACM …, 2023 - dl.acm.org
Human emotion understanding is pivotal in making conversational technology mainstream.
We view speech emotion understanding as a perception task which is a more realistic …

[HTML][HTML] Turn Left Turn Right-Delving type and modality of instructions in navigation assistant systems for people with visual impairments

B Kuriakose, IM Ness, MÅ skov Tengstedt… - International Journal of …, 2023 - Elsevier
Receiving navigation directions and relevant information through appropriate channels is
crucial for individuals with visual impairments when they use navigation assistant systems …

EMOCONV-Diff: Diffusion-Based Speech Emotion Conversion for Non-Parallel and in-the-Wild Data

NR Prabhu, B Lay, S Welker… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Speech emotion conversion is the task of converting the expressed emotion of a spoken
utterance to a target emotion while preserving the lexical content and speaker identity. While …

Mdrt: Multi-domain synthetic speech localization

AKS Yadav, K Bhagtani, S Baireddy… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
With recent advancements in generating synthetic speech, tools to generate high-quality
synthetic speech impersonating any human speaker are easily available. Several incidents …

Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis

S Inoue, K Zhou, S Wang, H Li - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org
It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS)
synthesis. Prior studies have primarily focused on learning a global prosodic representation …

Not My Voice! A Taxonomy of Ethical and Safety Harms of Speech Generators

W Hutiri, O Papakyriakopoulos, A Xiang - The 2024 ACM Conference on …, 2024 - dl.acm.org
The rapid and wide-scale adoption of AI to generate human speech poses a range of
significant ethical and safety risks to society that need to be addressed. For example, a …

Improved Dendritic Learning: Activation Function Analysis

Y Wang, Y Yu, T Zhang, K Song, Y Wang, S Gao - Information Sciences, 2024 - Elsevier
This study conducted a thorough evaluation of an improved dendritic learning (DL)
framework, focusing specifically on its application in power load forecasting. The objective …