As ChatGPT goes viral, generative AI (AIGC, aka AI-generated content) has made headlines everywhere because of its ability to analyze and create text, images, and beyond. With such …
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from …
Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists …
Abstract Language models have been successfully used to model natural signals, such as images, speech, and music. A key component of these models is a high quality neural …
E Kharitonov, D Vincent, Z Borsos… - Transactions of the …, 2023 - direct.mit.edu
We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we …
Conventional methods for human motion synthesis have either been deterministic or have had to struggle with the trade-off between motion diversity vs motion quality. In response to …
YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero …
J Kim, J Kong, J Son - International Conference on Machine …, 2021 - proceedings.mlr.press
Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two …
Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios …