Avlnet: Learning audio-visual language representations from instructional videos

A Rouditchenko, A Boggust, D Harwath, B Chen… - arXiv preprint arXiv …, 2020 - arxiv.org
Current methods for learning visually grounded language from videos often rely on text
annotation, such as human generated captions or machine generated automatic speech …

Cross-modal discrete representation learning

AH Liu, SY Jin, CIJ Lai, A Rouditchenko, A Oliva… - arXiv preprint arXiv …, 2021 - arxiv.org
Recent advances in representation learning have demonstrated an ability to represent
information from different modalities such as video, text, and audio in a single high-level …

A DNN-HMM-DNN hybrid model for discovering word-like units from spoken captions and image regions

L Wang, M Hasegawa-Johnson - Interspeech, 2020 - par.nsf.gov
Discovering word-like units without textual transcriptions is an important step in low-resource
speech technology. In this work, we demonstrate a model inspired by statistical machine …

Multimodal word discovery and retrieval with spoken descriptions and visual concepts

L Wang, M Hasegawa-Johnson - IEEE/ACM Transactions on …, 2020 - ieeexplore.ieee.org
In the absence of dictionaries, translators, or grammars, it is still possible to learn some of
the words of a new language by listening to spoken descriptions of images. If several …

[PDF][PDF] Cross-Modal Discrete Representation Learning

AHLSYJ Cheng, IJLA Rouditchenko, AOJ Glass - olivalab.mit.edu
In contrast to recent advances focusing on highlevel representation learning across
modalities, in this work we present a self-supervised learning framework that is able to learn …

Learning Audio-Video Language Representations

A Rouditchenko - 2021 - dspace.mit.edu
Automatic speech recognition has seen recent advancements powered by machine
learning, but it is still only available for a small fraction of the more than 7,000 languages …