AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

J Choi, SJ Park, M Kim, YM Ro - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …

Espnet-codec: Comprehensive training and evaluation of neural codecs for audio, music, and speech

J Shi, J Tian, Y Wu, J Jung, JQ Yip… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
Neural codecs have become crucial to recent speech and audio generation research. In
addition to signal compression capabilities, discrete codecs have also been found to …

Muskits-espnet: A comprehensive toolkit for singing voice synthesis in new paradigm

Y Wu, J Shi, Y Yu, Y Tang, T Qian, Y Lin, J Han… - Proceedings of the …, 2024 - dl.acm.org
This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to
Singing Voice Synthesis (SVS) through the application of pretrained audio models in both …

Textless Unit-to-Unit Training for Many-to-Many Multilingual Speech-to-Speech Translation

M Kim, J Choi, D Kim, YM Ro - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
This paper proposes a textless training method for many-to-many multilingual speech-to-
speech translation that can also benefit the transfer of pre-trained knowledge to text-based …

MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model

J Shi, X Ma, H Inaguma, A Sun, S Watanabe - arXiv preprint arXiv …, 2024 - arxiv.org
Speech discrete representation has proven effective in various downstream applications
due to its superior compression rate of the waveform, fast convergence during training, and …

The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

X Chang, J Shi, J Tian, Y Wu, Y Tang, Y Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
Representing speech and audio signals in discrete units has become a compelling
alternative to traditional high-dimensional feature vectors. Numerous studies have …

SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech Models

Y Tang, Y Wu, J Shi, Q Jin - arXiv preprint arXiv:2406.08905, 2024 - arxiv.org
Discrete representation has shown advantages in speech generation tasks, wherein
discrete tokens are derived by discretizing hidden features from self-supervised learning …

Speechprompt: Prompting speech language models for speech processing tasks

KW Chang, H Wu, YK Wang, YK Wu… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Prompting has become a practical method for utilizing pre-trained language models (LMs).
This approach offers several advantages. It allows an LM to adapt to new tasks with minimal …

Streaming decoder-only automatic speech recognition with discrete speech units: A pilot study

P Chen, S Sun, C Shan, Q Yang, L Xie - arXiv preprint arXiv:2406.18862, 2024 - arxiv.org
Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown
impressive performance across various speech-related tasks, especially in Automatic …

Last: Language model aware speech tokenization

A Turetzky, Y Adi - arXiv preprint arXiv:2409.03701, 2024 - arxiv.org
Speech tokenization serves as the foundation of speech language model (LM), enabling
them to perform various tasks such as spoken language modeling, text-to-speech, speech-to …