Zmm-tts: Zero-shot multilingual and multispeaker speech synthesis conditioned on self-supervised discrete speech representations

C Gong, X Wang, E Cooper, D Wells… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker,
single-language synthesis. Multilingual TTS systems are limited to resource-rich languages …

BERTIVITS: The Posterior Encoder Fusion of Pre-Trained Models and Residual Skip Connections for End-to-End Speech Synthesis

Z Wang, M Song, D Zhou - Applied Sciences, 2024 - mdpi.com
Enhancing the naturalness and rhythmicity of generated audio in end-to-end speech
synthesis is crucial. The current state-of-the-art (SOTA) model, VITS, utilizes a conditional …

An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

C Gong, E Cooper, X Wang, C Qiang, M Geng… - arXiv preprint arXiv …, 2024 - arxiv.org
Self-supervised learning (SSL) representations from massively multilingual models offer a
promising solution for low-resource language speech tasks. Despite advancements …

An Effective Contextualized Automatic Speech Recognition Approach Leveraging Self-Supervised Phoneme Features

LT Pai, YC Wang, BC Yan, HW Wang… - 2024 Asia Pacific …, 2024 - ieeexplore.ieee.org
Years of scholarly efforts have led to extensive studies on end-to-end automatic speech
recognition (E2E ASR), now demonstrating robust performance in everyday applications …

From ASR to TTS: Enhancing Synthesis with Cleaned ASR Data

A Surinrangsee, A Thangthai - 2024 19th International Joint …, 2024 - ieeexplore.ieee.org
This paper investigates a novel method for handling with the scarcity of appropriate speech
corpora for Text-to-Speech (TTS) systems in the Thai language by repurposing Automatic …