Word recognition, competition, and activation in a model of visually grounded speech

WN Hsu, D Harwath, C Song, J Glass - arXiv preprint arXiv:2012.15454, 2020 - arxiv.org

In this paper we present the first model for directly synthesizing fluent, natural-sounding
spoken audio captions for images that does not require natural language text as an …

被引用次数：83 相关文章所有 8 个版本

[PDF] arxiv.org

Word discovery in visually grounded, self-supervised speech models

P Peng, D Harwath - arXiv preprint arXiv:2203.15081, 2022 - arxiv.org

We present a method for visually-grounded spoken term discovery. After training either a
HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …

被引用次数：48 相关文章所有 6 个版本

[PDF] arxiv.org

Grounding'grounding'in NLP

KR Chandu, Y Bisk, AW Black - arXiv preprint arXiv:2106.02192, 2021 - arxiv.org

The NLP community has seen substantial recent interest in grounding to facilitate interaction
between language technologies and the world. However, as a community, we use the term …

被引用次数：66 相关文章所有 6 个版本

[PDF] arxiv.org

Learning hierarchical discrete linguistic units from visually-grounded speech

D Harwath, WN Hsu, J Glass - arXiv preprint arXiv:1911.09602, 2019 - arxiv.org

In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …

被引用次数：103 相关文章所有 8 个版本

[PDF] jair.org Full View

Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques

G Chrupała - Journal of Artificial Intelligence Research, 2022 - jair.org

This survey provides an overview of the evolution of visually grounded models of spoken
language over the last 20 years. Such models are inspired by the observation that when …

被引用次数：48 相关文章所有 9 个版本

[PDF] thecvf.com

Wnet: Audio-guided video object segmentation via wavelet-based cross-modal denoising networks

W Pan, H Shi, Z Zhao, J Zhu, X He… - Proceedings of the …, 2022 - openaccess.thecvf.com

Audio-Guided video semantic segmentation is a challenging problem in visual analysis and
editing, which automatically separates foreground objects from background in a video …

被引用次数：13 相关文章所有 3 个版本

[PDF] mit.edu

Learning english with peppa pig

M Nikolaus, A Alishahi, G Chrupała - Transactions of the Association …, 2022 - direct.mit.edu

Recent computational models of the acquisition of spoken language via grounding in
perception exploit associations between spoken and visual modalities and learn to …

被引用次数：18 相关文章所有 14 个版本

[PDF] arxiv.org

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning?--A computational investigation

K Khorrami, O Räsänen - arXiv preprint arXiv:2109.14200, 2021 - arxiv.org

Decades of research has studied how language learning infants learn to discriminate
speech sounds, segment words, and associate words with their meanings. While gradual …

被引用次数：28 相关文章所有 14 个版本

[PDF] arxiv.org

Talk, don't write: A study of direct speech-based image retrieval

R Sanabria, A Waters, J Baldridge - arXiv preprint arXiv:2104.01894, 2021 - arxiv.org

Speech-based image retrieval has been studied as a proxy for joint representation learning,
usually without emphasis on retrieval itself. As such, it is unclear how well speech-based …

被引用次数：26 相关文章所有 6 个版本

[PDF] arxiv.org

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

P Peng, SW Li, O Räsänen, A Mohamed… - arXiv preprint arXiv …, 2023 - arxiv.org

In this paper, we show that representations capturing syllabic units emerge when training a
self-supervised speech model with a visually-grounded training objective. We demonstrate …

被引用次数：5 相关文章所有 7 个版本

高级搜索

QQ 群