Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models

YA Li, C Han, V Raghavan… - Advances in Neural …, 2024 - proceedings.neurips.cc
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style
diffusion and adversarial training with large speech language models (SLMs) to achieve …

Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study

C Zhang, Y Liu, Y Zheng, S Zhao - arXiv preprint arXiv:2406.04633, 2024 - arxiv.org
Scaling text-to-speech (TTS) with autoregressive language model (LM) to large-scale
datasets by quantizing waveform into discrete speech tokens is making great progress to …

Multi-Modal Retrieval For Large Language Model Based Speech Recognition

A Gourav, J Kolehmainen, P Shivakumar… - Findings of the …, 2024 - aclanthology.org
Retrieval is a widely adopted approach for improving language models leveraging external
information. As the field moves towards multi-modal large language models, it is important to …

Automatic Speech Recognition in Psychiatric Interviews: A Rocket to Diagnostic Support in Psychosis

JTG Molina, PA Gaspar, A Figueroa-Barra - Revista Colombiana de …, 2024 - Elsevier
Speech analysis is a crucial tool in discerning the complex cognitive and emotional
subtleties of individuals. It holds a significant role in psychiatric research, particularly in the …

SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer

D Kim, S Hong, YH Choi - arXiv preprint arXiv:2307.10550, 2023 - arxiv.org
Expressive speech synthesis models are trained by adding corpora with diverse speakers,
various emotions, and different speaking styles to the dataset, in order to control various …

Multi-modal retrieval for large language model based speech recognition

J Kolehmainen, A Gourav, PG Shivakumar… - arXiv preprint arXiv …, 2024 - arxiv.org
Retrieval is a widely adopted approach for improving language models leveraging external
information. As the field moves towards multi-modal large language models, it is important to …

Towards Natural-Sounding Speech to Text in English

K Saulitis, E Urtans, V Caune - … Conference on Deep Learning Theory and …, 2024 - Springer
This study focuses on a systematic review of the literature and an experimental comparison
of 20 English speech synthesis methods. Nine of the models were subjected to a …

[PDF][PDF] Better Text Compression Using a Large Language Model

D Shin - 2023 - tdcommons.org
Conventional compression techniques for text are based on typical frequencies of individual
letters within the text, independent of higher-level semantics. This disclosure describes a …