Viola: Conditional language models for speech recognition, synthesis, and translation

T Wang, L Zhou, Z Zhang, Y Wu, S Liu… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Recent research shows a big convergence in model architecture, training objectives, and
inference methods across various tasks for different modalities. In this paper, we propose …

Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing

J Ao, R Wang, L Zhou, C Wang, S Ren, Y Wu… - arXiv preprint arXiv …, 2021 - arxiv.org
Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural
language processing models, we propose a unified-modal SpeechT5 framework that …

NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets

G Mittag, B Naderi, A Chehadi, S Möller - arXiv preprint arXiv:2104.09494, 2021 - arxiv.org
In this paper, we present an update to the NISQA speech quality prediction model that is
focused on distortions that occur in communication networks. In contrast to the previous …

The voicemos challenge 2022

WC Huang, E Cooper, Y Tsao, HM Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
We present the first edition of the VoiceMOS Challenge, a scientific event that aims to
promote the study of automatic prediction of the mean opinion score (MOS) of synthetic …

The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation

Y Yoon, P Wolfert, T Kucherenko, C Viegas… - Proceedings of the …, 2022 - dl.acm.org
This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-
speech gesture generation. Participating teams used the same speech and motion dataset …

A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020

T Kucherenko, P Jonell, Y Yoon, P Wolfert… - Proceedings of the 26th …, 2021 - dl.acm.org
Co-speech gestures, gestures that accompany speech, play an important role in human
communication. Automatic co-speech gesture generation is thus a key enabling technology …

Ldnet: Unified listener dependent modeling in mos prediction for synthetic speech

WC Huang, E Cooper, J Yamagishi… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
An effective approach to automatically predict the subjective rating for synthetic speech is to
train on a listening test dataset with human-annotated scores. Although each speech sample …

A review on subjective and objective evaluation of synthetic speech

E Cooper, WC Huang, Y Tsao, HM Wang… - Acoustical Science …, 2024 - jstage.jst.go.jp
Evaluating synthetic speech generated by machines is a complicated process, as it involves
judging along multiple dimensions including naturalness, intelligibility, and whether the …

Boosting large language model for speech synthesis: An empirical study

H Hao, L Zhou, S Liu, J Li, S Hu, R Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) have made significant advancements in natural language
processing and are concurrently extending the language ability to other modalities, such as …

TIMIT-TTS: A text-to-speech dataset for multimodal synthetic media detection

D Salvi, B Hosler, P Bestagini, MC Stamm… - IEEE …, 2023 - ieeexplore.ieee.org
With the rapid development of deep learning techniques, the generation and counterfeiting
of multimedia material has become increasingly simple. Current technology enables the …