Exposing and mitigating spurious correlations for cross-modal retrieval

JM Kim, A Koepke, C Schmid… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Cross-modal retrieval methods are the preferred tool to search databases for the text that
best matches a query image and vice versa However, image-text retrieval models commonly …

Cross-modal retrieval: a systematic review of methods and future directions

L Zhu, T Wang, F Li, J Li, Z Zhang, HT Shen - arXiv preprint arXiv …, 2023 - arxiv.org
With the exponential surge in diverse multi-modal data, traditional uni-modal retrieval
methods struggle to meet the needs of users demanding access to data from various …

Retrieving multimodal information for augmented generation: A survey

R Zhao, H Chen, W Wang, F Jiao, XL Do, C Qin… - arXiv preprint arXiv …, 2023 - arxiv.org
As Large Language Models (LLMs) become popular, there emerged an important trend of
using multimodality to augment the LLMs' generation ability, which enables LLMs to better …

Separate anything you describe

X Liu, Q Kong, Y Zhao, H Liu, Y Yuan, Y Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Language-queried audio source separation (LASS) is a new paradigm for computational
auditory scene analysis (CASA). LASS aims to separate a target sound from an audio …

[PDF][PDF] The SJTU system for DCASE2022 challenge task 6: Audio captioning with audio-text retrieval pre-training

X Xu, Z Xie, M Wu, K Yu - Tech. Rep., DCASE2022 Challenge, 2022 - dcase.community
This technical report describes the system submitted to the Detection and Classification of
Acoustic Scenes and Events (DCASE) 2022 challenge Task 6. There are two involving …

Improving text-audio retrieval by text-aware attention pooling and prior matrix revised loss

Y Xin, D Yang, Y Zou - ICASSP 2023-2023 IEEE International …, 2023 - ieeexplore.ieee.org
In text-audio retrieval (TAR) tasks, due to the heterogeneity of contents between text and
audio, the semantic information contained in the text is only similar to certain frames within …

Zero-shot audio captioning with audio-language model guidance and audio context keywords

L Salewski, S Fauth, A Koepke, Z Akata - arXiv preprint arXiv:2311.08396, 2023 - arxiv.org
Zero-shot audio captioning aims at automatically generating descriptive textual captions for
audio content without prior training for this task. Different from speech recognition which …

Cooperative game modeling with weighted token-level alignment for audio-text retrieval

Y Xin, B Wang, L Shang - IEEE Signal Processing Letters, 2023 - ieeexplore.ieee.org
Previous audio-text retrieval (ATR) methods primarily concentrate on constructing
contrastive pairs between entire audio clips and full caption sentences, while neglecting fine …

Towards effective multi-modal interchanges in zero-resource sounding object localization

Y Zhao, C Zhang, H Huang, H Li… - Advances in Neural …, 2022 - proceedings.neurips.cc
Aiming to locate the object that emits a specified sound in complex scenes, the task of
sounding object localization bridges two perception-oriented modalities of vision and …

Improving audio-text retrieval via hierarchical cross-modal interaction and auxiliary captions

Y Xin, Y Zou - arXiv preprint arXiv:2307.15344, 2023 - arxiv.org
Most existing audio-text retrieval (ATR) methods focus on constructing contrastive pairs
between whole audio clips and complete caption sentences, while ignoring fine-grained …