Text-free image-to-speech synthesis using learned segmental units

WN Hsu, D Harwath, C Song, J Glass - arXiv preprint arXiv:2012.15454, 2020 - arxiv.org
In this paper we present the first model for directly synthesizing fluent, natural-sounding
spoken audio captions for images that does not require natural language text as an …

Word discovery in visually grounded, self-supervised speech models

P Peng, D Harwath - arXiv preprint arXiv:2203.15081, 2022 - arxiv.org
We present a method for visually-grounded spoken term discovery. After training either a
HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …

Grounding'grounding'in NLP

KR Chandu, Y Bisk, AW Black - arXiv preprint arXiv:2106.02192, 2021 - arxiv.org
The NLP community has seen substantial recent interest in grounding to facilitate interaction
between language technologies and the world. However, as a community, we use the term …

Learning hierarchical discrete linguistic units from visually-grounded speech

D Harwath, WN Hsu, J Glass - arXiv preprint arXiv:1911.09602, 2019 - arxiv.org
In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …

Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques

G Chrupała - Journal of Artificial Intelligence Research, 2022 - jair.org
This survey provides an overview of the evolution of visually grounded models of spoken
language over the last 20 years. Such models are inspired by the observation that when …

Wnet: Audio-guided video object segmentation via wavelet-based cross-modal denoising networks

W Pan, H Shi, Z Zhao, J Zhu, X He… - Proceedings of the …, 2022 - openaccess.thecvf.com
Audio-Guided video semantic segmentation is a challenging problem in visual analysis and
editing, which automatically separates foreground objects from background in a video …

Learning english with peppa pig

M Nikolaus, A Alishahi, G Chrupała - Transactions of the Association …, 2022 - direct.mit.edu
Recent computational models of the acquisition of spoken language via grounding in
perception exploit associations between spoken and visual modalities and learn to …

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning?--A computational investigation

K Khorrami, O Räsänen - arXiv preprint arXiv:2109.14200, 2021 - arxiv.org
Decades of research has studied how language learning infants learn to discriminate
speech sounds, segment words, and associate words with their meanings. While gradual …

Talk, don't write: A study of direct speech-based image retrieval

R Sanabria, A Waters, J Baldridge - arXiv preprint arXiv:2104.01894, 2021 - arxiv.org
Speech-based image retrieval has been studied as a proxy for joint representation learning,
usually without emphasis on retrieval itself. As such, it is unclear how well speech-based …

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

P Peng, SW Li, O Räsänen, A Mohamed… - arXiv preprint arXiv …, 2023 - arxiv.org
In this paper, we show that representations capturing syllabic units emerge when training a
self-supervised speech model with a visually-grounded training objective. We demonstrate …