Towards unsupervised automatic speech recognition trained by unaligned speech and text only

J Xu, X Tan, Y Ren, T Qin, J Li, S Zhao… - Proceedings of the 26th …, 2020 - dl.acm.org

Speech synthesis (text to speech, TTS) and recognition (automatic speech recognition, ASR)
are important speech tasks, and require a large amount of text and speech pairs for model …

被引用次数：94 相关文章所有 4 个版本

[PDF] mlr.press

Almost unsupervised text to speech and automatic speech recognition

Y Ren, X Tan, T Qin, S Zhao… - … on machine learning, 2019 - proceedings.mlr.press

Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech
processing and both achieve impressive performance thanks to the recent advance in deep …

被引用次数：122 相关文章所有 7 个版本

[PDF] neurips.cc

Unsupervised cross-modal alignment of speech and text embedding spaces

YA Chung, WH Weng, S Tong… - Advances in neural …, 2018 - proceedings.neurips.cc

Recent research has shown that word embedding spaces learned from text corpora of
different languages can be aligned without any parallel data supervision. Inspired by the …

被引用次数：111 相关文章所有 15 个版本

[PDF] arxiv.org

Unsupervised speech recognition via segmental empirical output distribution matching

CK Yeh, J Chen, C Yu, D Yu - arXiv preprint arXiv:1812.09323, 2018 - arxiv.org

We consider the problem of training speech recognition systems without using any labeled
data, under the assumption that the learner can only access to the input utterances and a …

被引用次数：45 相关文章所有 4 个版本

[PDF] arxiv.org

Phonetic-and-semantic embedding of spoken words with applications in spoken content retrieval

YC Chen, SF Huang, CH Shen… - 2018 IEEE Spoken …, 2018 - ieeexplore.ieee.org

Word embedding or Word2Vec has been successful in offering semantics for text words
learned from the context of words. Audio Word2Vec was shown to offer phonetic structures …

被引用次数：40 相关文章所有 4 个版本

[PDF] arxiv.org

Iterative pseudo-forced alignment by acoustic ctc loss for self-supervised asr domain adaptation

F López, J Luque - arXiv preprint arXiv:2210.15226, 2022 - arxiv.org

High-quality data labeling from specific domains is costly and human time-consuming. In this
work, we propose a self-supervised domain adaptation method, based upon an iterative …

被引用次数：5 相关文章所有 7 个版本

[PDF] arxiv.org

Improving word recognition in speech transcriptions by decision-level fusion of stemming and two-way phoneme pruning

S Mehra, S Susan - … Computing: 10th International Conference, IACC 2020 …, 2021 - Springer

We introduce an unsupervised approach for correcting highly imperfect speech
transcriptions based on a decision-level fusion of stemming and two-way phoneme pruning …

被引用次数：7 相关文章所有 5 个版本

[PDF] arxiv.org

Almost-unsupervised speech recognition with close-to-zero resource based on phonetic structures learned from very small unpaired speech and text data

YC Chen, CH Shen, SF Huang, H Lee… - arXiv preprint arXiv …, 2018 - arxiv.org

Producing a large amount of annotated speech data for training ASR systems remains
difficult for more than 95% of languages all over the world which are low-resourced …

被引用次数：14 相关文章所有 3 个版本

[PDF] acm.org

Learning decision making strategies of non-experts: A next-gail model for taxi drivers

M Pan, X Zhang, Y Li, X Zhou, J Luo - Proceedings of the 29th …, 2021 - dl.acm.org

Thanks to the rapid development of mobile sensing techniques, massive human-generated
spatial-temporal data (HSTD) are generated from the urban areas, eg, passenger-seeking …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Temporally aligning long audio interviews with questions: a case study in multimodal data integration

PS Pasi, K Battepati, P Jyothi, G Ramakrishnan… - arXiv preprint arXiv …, 2023 - arxiv.org

The problem of audio-to-text alignment has seen significant amount of research using
complete supervision during training. However, this is typically not in the context of long …

高级搜索

QQ 群