Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques

G Chrupała - Journal of Artificial Intelligence Research, 2022 - jair.org
This survey provides an overview of the evolution of visually grounded models of spoken
language over the last 20 years. Such models are inspired by the observation that when …

Uwspeech: Speech to speech translation for unwritten languages

C Zhang, X Tan, Y Ren, T Qin, K Zhang… - Proceedings of the AAAI …, 2021 - ojs.aaai.org
Existing speech to speech translation systems heavily rely on the text of target language:
they usually translate source language either to target text and then synthesize target …

Survey: Transformer-based Models in Data Modality Conversion

E Rashno, A Eskandari, A Anand… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformers have made significant strides across various artificial intelligence domains,
including natural language processing, computer vision, and audio processing. This …

Safety Helmet‐Wearing Detection System for Manufacturing Workshop Based on Improved YOLOv7

X Chen, Q Xie - Journal of Sensors, 2023 - Wiley Online Library
Safety helmets play a vital role in protecting workers' heads. In order to improve the accuracy
of the detection model in complex environments, such as complex backgrounds and …

Discovering phonetic inventories with crosslingual automatic speech recognition

P Żelasko, S Feng, LM Velazquez, A Abavisani… - Computer Speech & …, 2022 - Elsevier
The high cost of data acquisition makes Automatic Speech Recognition (ASR) model
training problematic for most existing languages, including languages that do not even have …

How phonotactics affect multilingual and zero-shot asr performance

S Feng, P Żelasko, L Moro-Velázquez… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
The idea of combining multiple languages' recordings to train a single automatic speech
recognition (ASR) model brings the promise of the emergence of universal speech …

Generating images from spoken descriptions

X Wang, T Qiao, J Zhu, A Hanjalic… - … /ACM Transactions on …, 2021 - ieeexplore.ieee.org
Text-based technologies, such as text translation from one language to another, and image
captioning, are gaining popularity. However, approximately half of the world's languages are …

Modelling human word learning and recognition using visually grounded speech

D Merkx, S Scholten, SL Frank, M Ernestus… - Cognitive …, 2023 - Springer
Many computational models of speech recognition assume that the set of target words is
already given. This implies that these models learn to recognise speech in a biologically …

Keyword localisation in untranscribed speech using visually grounded speech models

K Olaleye, D Oneaţă, H Kamper - IEEE Journal of Selected …, 2022 - ieeexplore.ieee.org
Keyword localisation is the task of finding where in a speech utterance a given query
keyword occurs. We investigate to what extent keyword localisation is possible using a …

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

S Feng, M Tu, R Xia, C Huang, Y Wang - arXiv preprint arXiv:2305.11569, 2023 - arxiv.org
We improve low-resource ASR by integrating the ideas of multilingual training and self-
supervised learning. Concretely, we leverage an International Phonetic Alphabet (IPA) …