P Peng, D Harwath - arXiv preprint arXiv:2203.15081, 2022 - arxiv.org
We present a method for visually-grounded spoken term discovery. After training either a HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …
The NLP community has seen substantial recent interest in grounding to facilitate interaction between language technologies and the world. However, as a community, we use the term …
In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our …
G Chrupała - Journal of Artificial Intelligence Research, 2022 - jair.org
This survey provides an overview of the evolution of visually grounded models of spoken language over the last 20 years. Such models are inspired by the observation that when …
W Pan, H Shi, Z Zhao, J Zhu, X He… - Proceedings of the …, 2022 - openaccess.thecvf.com
Audio-Guided video semantic segmentation is a challenging problem in visual analysis and editing, which automatically separates foreground objects from background in a video …
Recent computational models of the acquisition of spoken language via grounding in perception exploit associations between spoken and visual modalities and learn to …
Decades of research has studied how language learning infants learn to discriminate speech sounds, segment words, and associate words with their meanings. While gradual …
Speech-based image retrieval has been studied as a proxy for joint representation learning, usually without emphasis on retrieval itself. As such, it is unclear how well speech-based …
In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate …