Lrspeech: Extremely low-resource speech synthesis and recognition

J Xu, X Tan, Y Ren, T Qin, J Li, S Zhao… - Proceedings of the 26th …, 2020 - dl.acm.org
Speech synthesis (text to speech, TTS) and recognition (automatic speech recognition, ASR)
are important speech tasks, and require a large amount of text and speech pairs for model …

Almost unsupervised text to speech and automatic speech recognition

Y Ren, X Tan, T Qin, S Zhao… - … on machine learning, 2019 - proceedings.mlr.press
Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech
processing and both achieve impressive performance thanks to the recent advance in deep …

Unsupervised cross-modal alignment of speech and text embedding spaces

YA Chung, WH Weng, S Tong… - Advances in neural …, 2018 - proceedings.neurips.cc
Recent research has shown that word embedding spaces learned from text corpora of
different languages can be aligned without any parallel data supervision. Inspired by the …

Unsupervised speech recognition via segmental empirical output distribution matching

CK Yeh, J Chen, C Yu, D Yu - arXiv preprint arXiv:1812.09323, 2018 - arxiv.org
We consider the problem of training speech recognition systems without using any labeled
data, under the assumption that the learner can only access to the input utterances and a …

Phonetic-and-semantic embedding of spoken words with applications in spoken content retrieval

YC Chen, SF Huang, CH Shen… - 2018 IEEE Spoken …, 2018 - ieeexplore.ieee.org
Word embedding or Word2Vec has been successful in offering semantics for text words
learned from the context of words. Audio Word2Vec was shown to offer phonetic structures …

Iterative pseudo-forced alignment by acoustic ctc loss for self-supervised asr domain adaptation

F López, J Luque - arXiv preprint arXiv:2210.15226, 2022 - arxiv.org
High-quality data labeling from specific domains is costly and human time-consuming. In this
work, we propose a self-supervised domain adaptation method, based upon an iterative …

Improving word recognition in speech transcriptions by decision-level fusion of stemming and two-way phoneme pruning

S Mehra, S Susan - … Computing: 10th International Conference, IACC 2020 …, 2021 - Springer
We introduce an unsupervised approach for correcting highly imperfect speech
transcriptions based on a decision-level fusion of stemming and two-way phoneme pruning …

Almost-unsupervised speech recognition with close-to-zero resource based on phonetic structures learned from very small unpaired speech and text data

YC Chen, CH Shen, SF Huang, H Lee… - arXiv preprint arXiv …, 2018 - arxiv.org
Producing a large amount of annotated speech data for training ASR systems remains
difficult for more than 95% of languages all over the world which are low-resourced …

Learning decision making strategies of non-experts: A next-gail model for taxi drivers

M Pan, X Zhang, Y Li, X Zhou, J Luo - Proceedings of the 29th …, 2021 - dl.acm.org
Thanks to the rapid development of mobile sensing techniques, massive human-generated
spatial-temporal data (HSTD) are generated from the urban areas, eg, passenger-seeking …

Temporally aligning long audio interviews with questions: a case study in multimodal data integration

PS Pasi, K Battepati, P Jyothi, G Ramakrishnan… - arXiv preprint arXiv …, 2023 - arxiv.org
The problem of audio-to-text alignment has seen significant amount of research using
complete supervision during training. However, this is typically not in the context of long …