Foundationtts: Text-to-speech for asr customization with generative language model

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arXiv preprint arXiv …, 2023 - arxiv.org

Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

被引用次数：133 相关文章所有 3 个版本

[PDF] neurips.cc

Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models

YA Li, C Han, V Raghavan… - Advances in Neural …, 2024 - proceedings.neurips.cc

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style
diffusion and adversarial training with large speech language models (SLMs) to achieve …

被引用次数：44 相关文章所有 6 个版本

[PDF] arxiv.org

Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study

C Zhang, Y Liu, Y Zheng, S Zhao - arXiv preprint arXiv:2406.04633, 2024 - arxiv.org

Scaling text-to-speech (TTS) with autoregressive language model (LM) to large-scale
datasets by quantizing waveform into discrete speech tokens is making great progress to …

Multi-Modal Retrieval For Large Language Model Based Speech Recognition

A Gourav, J Kolehmainen, P Shivakumar… - Findings of the …, 2024 - aclanthology.org

Retrieval is a widely adopted approach for improving language models leveraging external
information. As the field moves towards multi-modal large language models, it is important to …

[PDF] researchgate.net

Automatic Speech Recognition in Psychiatric Interviews: A Rocket to Diagnostic Support in Psychosis

JTG Molina, PA Gaspar, A Figueroa-Barra - Revista Colombiana de …, 2024 - Elsevier

Speech analysis is a crucial tool in discerning the complex cognitive and emotional
subtleties of individuals. It holds a significant role in psychiatric research, particularly in the …

SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer

D Kim, S Hong, YH Choi - arXiv preprint arXiv:2307.10550, 2023 - arxiv.org

Expressive speech synthesis models are trained by adding corpora with diverse speakers,
various emotions, and different speaking styles to the dataset, in order to control various …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Multi-modal retrieval for large language model based speech recognition

J Kolehmainen, A Gourav, PG Shivakumar… - arXiv preprint arXiv …, 2024 - arxiv.org

Retrieval is a widely adopted approach for improving language models leveraging external
information. As the field moves towards multi-modal large language models, it is important to …

Towards Natural-Sounding Speech to Text in English

K Saulitis, E Urtans, V Caune - … Conference on Deep Learning Theory and …, 2024 - Springer

This study focuses on a systematic review of the literature and an experimental comparison
of 20 English speech synthesis methods. Nine of the models were subjected to a …

[PDF] tdcommons.org

[PDF][PDF] Better Text Compression Using a Large Language Model

D Shin - 2023 - tdcommons.org

Conventional compression techniques for text are based on typical frequencies of individual
letters within the text, independent of higher-level semantics. This disclosure describes a …

高级搜索

QQ 群