Avlnet: Learning audio-visual language representations from instructional videos

A Rouditchenko, A Boggust, D Harwath, B Chen… - arXiv preprint arXiv …, 2020 - arxiv.org
Current methods for learning visually grounded language from videos often rely on text
annotation, such as human generated captions or machine generated automatic speech …

Speechclip: Integrating speech with pre-trained vision and language model

YJ Shih, HF Wang, HJ Chang, L Berry… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
Data-driven speech processing models usually perform well with a large amount of text
supervision, but collecting transcribed speech data is costly. Therefore, we propose Speech …

Self-supervised representation learning for speech using visual grounding and masked language modeling

P Peng, D Harwath - arXiv preprint arXiv:2202.03543, 2022 - arxiv.org
In this paper, we describe our submissions to the ZeroSpeech 2021 Challenge and
SUPERB benchmark. Our submissions are based on the recently proposed FaST-VGS …

Fast-slow transformer for visually grounding speech

P Peng, D Harwath - ICASSP 2022-2022 IEEE International …, 2022 - ieeexplore.ieee.org
We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS. FaST-VGS
is a Transformer-based model for learning the associations between raw speech waveforms …

Conceptbeam: Concept driven target speech extraction

Y Ohishi, M Delcroix, T Ochiai, S Araki… - Proceedings of the 30th …, 2022 - dl.acm.org
We propose a novel framework for target speech extraction based on semantic information,
called ConceptBeam. Target speech extraction means extracting the speech of a target …

Unsupervised audio-caption aligning learns correspondences between individual sound events and textual phrases

H Xie, O Räsänen, K Drossos… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
We investigate unsupervised learning of correspondences between sound events and
textual phrases through aligning audio clips with textual captions describing the content of a …

Synthesizing spoken descriptions of images

X Wang, J Van Der Hout, J Zhu… - … on Audio, Speech …, 2021 - ieeexplore.ieee.org
Image captioning technology has great potential in many scenarios. However, current text-
based image captioning methods cannot be applied to approximately half of the world's …

Spoken ObjectNet: A bias-controlled spoken caption dataset

I Palmer, A Rouditchenko, A Barbu, B Katz… - arXiv preprint arXiv …, 2021 - arxiv.org
Visually-grounded spoken language datasets can enable models to learn cross-modal
correspondences with very weak supervision. However, modern audio-visual datasets …

Cascaded multilingual audio-visual learning from videos

A Rouditchenko, A Boggust, D Harwath… - arXiv preprint arXiv …, 2021 - arxiv.org
In this paper, we explore self-supervised audio-visual models that learn from instructional
videos. Prior work has shown that these models can relate spoken words and sounds to …

A translation framework for visually grounded spoken unit discovery

L Wang, M Hasegawa-Johnson - 2021 55th Asilomar …, 2021 - ieeexplore.ieee.org
Multimodal acoustic unit discovery (MAUD) is a key task in self-supervised spoken language
learning and low-resource speech recognition. In this paper, we proposed two models for …