Audio-text retrieval in context

JM Kim, A Koepke, C Schmid… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Cross-modal retrieval methods are the preferred tool to search databases for the text that
best matches a query image and vice versa However, image-text retrieval models commonly …

被引用次数：18 相关文章所有 5 个版本

[PDF] arxiv.org

Cross-modal retrieval: a systematic review of methods and future directions

L Zhu, T Wang, F Li, J Li, Z Zhang, HT Shen - arXiv preprint arXiv …, 2023 - arxiv.org

With the exponential surge in diverse multi-modal data, traditional uni-modal retrieval
methods struggle to meet the needs of users demanding access to data from various …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

Retrieving multimodal information for augmented generation: A survey

R Zhao, H Chen, W Wang, F Jiao, XL Do, C Qin… - arXiv preprint arXiv …, 2023 - arxiv.org

As Large Language Models (LLMs) become popular, there emerged an important trend of
using multimodality to augment the LLMs' generation ability, which enables LLMs to better …

被引用次数：25 相关文章所有 5 个版本

[PDF] arxiv.org

Separate anything you describe

X Liu, Q Kong, Y Zhao, H Liu, Y Yuan, Y Liu… - arXiv preprint arXiv …, 2023 - arxiv.org

Language-queried audio source separation (LASS) is a new paradigm for computational
auditory scene analysis (CASA). LASS aims to separate a target sound from an audio …

被引用次数：14 相关文章所有 3 个版本

[PDF] dcase.community

[PDF][PDF] The SJTU system for DCASE2022 challenge task 6: Audio captioning with audio-text retrieval pre-training

X Xu, Z Xie, M Wu, K Yu - Tech. Rep., DCASE2022 Challenge, 2022 - dcase.community

This technical report describes the system submitted to the Detection and Classification of
Acoustic Scenes and Events (DCASE) 2022 challenge Task 6. There are two involving …

被引用次数：31 相关文章所有 2 个版本

[PDF] arxiv.org

Improving text-audio retrieval by text-aware attention pooling and prior matrix revised loss

Y Xin, D Yang, Y Zou - ICASSP 2023-2023 IEEE International …, 2023 - ieeexplore.ieee.org

In text-audio retrieval (TAR) tasks, due to the heterogeneity of contents between text and
audio, the semantic information contained in the text is only similar to certain frames within …

被引用次数：19 相关文章所有 5 个版本

[PDF] arxiv.org

Zero-shot audio captioning with audio-language model guidance and audio context keywords

L Salewski, S Fauth, A Koepke, Z Akata - arXiv preprint arXiv:2311.08396, 2023 - arxiv.org

Zero-shot audio captioning aims at automatically generating descriptive textual captions for
audio content without prior training for this task. Different from speech recognition which …

被引用次数：3 相关文章所有 3 个版本

Cooperative game modeling with weighted token-level alignment for audio-text retrieval

Y Xin, B Wang, L Shang - IEEE Signal Processing Letters, 2023 - ieeexplore.ieee.org

Previous audio-text retrieval (ATR) methods primarily concentrate on constructing
contrastive pairs between entire audio clips and full caption sentences, while neglecting fine …

被引用次数：3 相关文章所有 2 个版本

[PDF] neurips.cc

Towards effective multi-modal interchanges in zero-resource sounding object localization

Y Zhao, C Zhang, H Huang, H Li… - Advances in Neural …, 2022 - proceedings.neurips.cc

Aiming to locate the object that emits a specified sound in complex scenes, the task of
sounding object localization bridges two perception-oriented modalities of vision and …

被引用次数：5 相关文章所有 4 个版本

[PDF] arxiv.org

Improving audio-text retrieval via hierarchical cross-modal interaction and auxiliary captions

Y Xin, Y Zou - arXiv preprint arXiv:2307.15344, 2023 - arxiv.org

Most existing audio-text retrieval (ATR) methods focus on constructing contrastive pairs
between whole audio clips and complete caption sentences, while ignoring fine-grained …

被引用次数：4 相关文章所有 4 个版本

高级搜索

QQ 群

Exposing and mitigating spurious correlations for cross-modal retrieval

Cross-modal retrieval: a systematic review of methods and future directions

Retrieving multimodal information for augmented generation: A survey

Separate anything you describe

[PDF][PDF] The SJTU system for DCASE2022 challenge task 6: Audio captioning with audio-text retrieval pre-training

Improving text-audio retrieval by text-aware attention pooling and prior matrix revised loss

Zero-shot audio captioning with audio-language model guidance and audio context keywords

Cooperative game modeling with weighted token-level alignment for audio-text retrieval

Towards effective multi-modal interchanges in zero-resource sounding object localization

Improving audio-text retrieval via hierarchical cross-modal interaction and auxiliary captions

引用