Deep audio-visual learning: A survey

H Zhu, MD Luo, R Wang, AH Zheng, R He - International Journal of …, 2021 - Springer
Audio-visual learning, aimed at exploiting the relationship between audio and visual
modalities, has drawn considerable attention since deep learning started to be used …

Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arXiv preprint arXiv:2208.09579, 2022 - arxiv.org
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

Audio-visual event localization in unconstrained videos

Y Tian, J Shi, B Li, Z Duan, C Xu - Proceedings of the …, 2018 - openaccess.thecvf.com
In this paper, we introduce a novel problem of audio-visual event localization in
unconstrained videos. We define an audio-visual event as an event that is both visible and …

Audio retrieval with natural language queries: A benchmark study

AS Koepke, AM Oncescu, JF Henriques… - IEEE Transactions …, 2022 - ieeexplore.ieee.org
The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the
goal is to retrieve the audio content from a pool of candidates that best matches a given …

TVLT: Textless vision-language transformer

Z Tang, J Cho, Y Nie, M Bansal - Advances in neural …, 2022 - proceedings.neurips.cc
In this work, we present the Textless Vision-Language Transformer (TVLT), where
homogeneous transformer blocks take raw visual and audio inputs for vision-and-language …

Fine-grained action retrieval through multiple parts-of-speech embeddings

M Wray, D Larlus, G Csurka… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
We address the problem of cross-modal fine-grained action retrieval between text and video.
Cross-modal retrieval is commonly achieved through learning a shared embedding space …

Sound-guided semantic image manipulation

SH Lee, W Roh, W Byeon, SH Yoon… - Proceedings of the …, 2022 - openaccess.thecvf.com
The recent success of the generative model shows that leveraging the multi-modal
embedding space can manipulate an image using text information. However, manipulating …

Noisy correspondence learning with meta similarity correction

H Han, K Miao, Q Zheng, M Luo - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Despite the success of multimodal learning in cross-modal retrieval task, the remarkable
progress relies on the correct correspondence among multimedia data. However, collecting …

Moviefactory: Automatic movie creation from text using large generative models for language and images

J Zhu, H Yang, H He, W Wang, Z Tuo… - Proceedings of the 31st …, 2023 - dl.acm.org
In this paper, we present MovieFactory, a powerful framework to generate cinematic-picture
(3072x1280), film-style (multi-scene), and multi-modality (sounding) movies on the demand …

It's time for artistic correspondence in music and video

D Surís, C Vondrick, B Russell… - Proceedings of the …, 2022 - openaccess.thecvf.com
We present an approach for recommending a music track for a given video, and vice versa,
based on both their temporal alignment and their correspondence at an artistic level. We …