Cross-modal embeddings for video and audio retrieval

H Zhu, MD Luo, R Wang, AH Zheng, R He - International Journal of …, 2021 - Springer

Audio-visual learning, aimed at exploiting the relationship between audio and visual
modalities, has drawn considerable attention since deep learning started to be used …

被引用次数：190 相关文章所有 12 个版本

[PDF] arxiv.org

Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arXiv preprint arXiv:2208.09579, 2022 - arxiv.org

Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

被引用次数：62 相关文章所有 2 个版本

[PDF] thecvf.com

Audio-visual event localization in unconstrained videos

Y Tian, J Shi, B Li, Z Duan, C Xu - Proceedings of the …, 2018 - openaccess.thecvf.com

In this paper, we introduce a novel problem of audio-visual event localization in
unconstrained videos. We define an audio-visual event as an event that is both visible and …

被引用次数：534 相关文章所有 11 个版本

[PDF] arxiv.org

Audio retrieval with natural language queries: A benchmark study

AS Koepke, AM Oncescu, JF Henriques… - IEEE Transactions …, 2022 - ieeexplore.ieee.org

The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the
goal is to retrieve the audio content from a pool of candidates that best matches a given …

被引用次数：114 相关文章所有 10 个版本

[PDF] neurips.cc

TVLT: Textless vision-language transformer

Z Tang, J Cho, Y Nie, M Bansal - Advances in neural …, 2022 - proceedings.neurips.cc

In this work, we present the Textless Vision-Language Transformer (TVLT), where
homogeneous transformer blocks take raw visual and audio inputs for vision-and-language …

被引用次数：37 相关文章所有 7 个版本

[PDF] thecvf.com

Fine-grained action retrieval through multiple parts-of-speech embeddings

M Wray, D Larlus, G Csurka… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

We address the problem of cross-modal fine-grained action retrieval between text and video.
Cross-modal retrieval is commonly achieved through learning a shared embedding space …

被引用次数：176 相关文章所有 10 个版本

[PDF] thecvf.com

Sound-guided semantic image manipulation

SH Lee, W Roh, W Byeon, SH Yoon… - Proceedings of the …, 2022 - openaccess.thecvf.com

The recent success of the generative model shows that leveraging the multi-modal
embedding space can manipulate an image using text information. However, manipulating …

被引用次数：59 相关文章所有 9 个版本

[PDF] thecvf.com

Noisy correspondence learning with meta similarity correction

H Han, K Miao, Q Zheng, M Luo - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Despite the success of multimodal learning in cross-modal retrieval task, the remarkable
progress relies on the correct correspondence among multimedia data. However, collecting …

被引用次数：25 相关文章所有 5 个版本

[PDF] arxiv.org

Moviefactory: Automatic movie creation from text using large generative models for language and images

J Zhu, H Yang, H He, W Wang, Z Tuo… - Proceedings of the 31st …, 2023 - dl.acm.org

In this paper, we present MovieFactory, a powerful framework to generate cinematic-picture
(3072x1280), film-style (multi-scene), and multi-modality (sounding) movies on the demand …

被引用次数：32 相关文章所有 3 个版本

[PDF] thecvf.com

It's time for artistic correspondence in music and video

D Surís, C Vondrick, B Russell… - Proceedings of the …, 2022 - openaccess.thecvf.com

We present an approach for recommending a music track for a given video, and vice versa,
based on both their temporal alignment and their correspondence at an artistic level. We …

被引用次数：36 相关文章所有 6 个版本

高级搜索

QQ 群