Vit-tts: visual text-to-speech with scalable diffusion transformer

H Liu, R Huang, X Lin, W Xu, M Zheng, H Chen… - arXiv preprint arXiv …, 2023 - arxiv.org
Text-to-speech (TTS) has undergone remarkable improvements in performance, particularly
with the advent of Denoising Diffusion Probabilistic Models (DDPMs). However, the …

TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation

X Cheng, R Huang, L Li, T Jin, Z Wang, A Yin… - arXiv preprint arXiv …, 2023 - arxiv.org
Direct speech-to-speech translation achieves high-quality results through the introduction of
discrete units obtained from self-supervised learning. This approach circumvents delays and …