Multi-modal pre-training for automated speech recognition

DM Chan, S Ghosh, D Chakrabarty… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
ICASSP 2022-2022 IEEE International Conference on Acoustics …, 2022ieeexplore.ieee.org
Traditionally, research in automated speech recognition has focused on local-first encoding
of audio representations to predict the spoken phonemes in an utterance. Unfortunately,
approaches relying on such hyper-local information tend to be vulnerable to both local-level
corruption (such as audio-frame drops, or loud noises) and global-level noise (such as
environmental noise, or background noise) that has not been seen during training. In this
work, we introduce a novel approach that leverages a self-supervised learning technique …
Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance. Unfortunately, approaches relying on such hyper-local information tend to be vulnerable to both local-level corruption (such as audio-frame drops, or loud noises) and global-level noise (such as environmental noise, or background noise) that has not been seen during training. In this work, we introduce a novel approach that leverages a self-supervised learning technique based on masked language modeling to compute a global, multi-modal encoding of the environment in which the utterance occurs. We then use a new deep-fusion framework to integrate this global context into a traditional ASR method, and demonstrate that the resulting method can outperform baseline methods by up to 7% on Librispeech; gains on internal datasets range from 6% (on larger models) to 45% (on smaller models).
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果